Can we ScalaSQL on Duckdb?

May 18, 2025

Hell yeah! Now the question is: should we?

While it's possible to use ScalaSQL on DuckDB, current limitations of ScalaSQL make some bit's of data analytics tasks less than optimal.

What's this about?

DuckDB

DuckDB is the latest rage in the analytical database space. This space seems to be weirdly dominated by big data tools and products. The weird bit is when you find yourself using these products to handle small data. These tools are complex and the products are very expensive. As of late, I've inherited a project that boasts a montly $700 bill, to handle workloads that could fit in a single $5 vps. A plausible explanation of how have we ended up here can be found on Ludic's blog.

DuckDB presents itself as a sane Sqlite-like alternative for analytical databases. A detailed rational on how DuckDB can be a feasible alternative to big data tools and products can be read on Jordan Tigani's Big Data is dead.

ScalaSQL

ScalaSQL is Scala ORM library from the Li Haoyi ecosystem. That is a set of libraries that are easy to use. The goal is to have alternatives to the more powerful Scala frameworks, but still provide the benefits of Scala.

Please, state what's this about

Scala was once the workhorse of complex big data processing. However, in the last decade, it has been mostly replaced by Python. I believe that Scala's more disciplined approach and type safety, provide significant help with the complexities of data analytical tasks. I also don't like Python.

With these turning tides in both the data analytics tooling and the Scala ecosystem, both moving towards greater simplicity, is Scala a feasible alternative again?

In order to figure out if this is a feasible stack, I've replicated the relevant bits DuckDB Java documentation, using Scala and ScalaSQL.

I've not tested the streaming and updating sections, as those are things that should be handled quite well with ScalaSQL and the default Java libraries.

All the code is available here. I used scala-cli to build the project. The first thing to have in mind is: ScalaSQL requires at least Scala 3.6.2. This is not an LTS version of Scala. Project setup can be found here.

Database connection

In order to use DuckDB, we need to stablish a connection with the database. We do this using tools from the standard java.sql library with the DuckDB jdbc driver.

Finally we have to create a ScalaSQL client. This requires that we define which SQL dialect we want to use. The genius of DuckDB is that it uses the Postgres SQL dialect. So using the PostgresDialect from ScalaSQL to set up the connection should work.

Because I ended up having multiple scripts that used the DB connection, I've put all my DB variables in the file Database.scala.

  import java.sql.DriverManager
  import org.duckdb.DuckDBConnection
  import scalasql._, PostgresDialect._

  val conn = DriverManager
    .getConnection("jdbc:duckdb:")
    .asInstanceOf[DuckDBConnection]
  val dbClient = DbClient.Connection(conn)
  val db = dbClient.getAutoCommitClientConnection

Does this setup actually work?

The whole point of ScalaSQL is that we get to define our database tables as case classes, with that we can build type safe queries though their object companions.

In BasicDuckDB.scala, I showcase how the ScalaSQL and DuckDB actually works. That code replicates the querying section of the DuckDB documentation.

The items table gets defined as a case class, as well as insert and select queries. This code does run, proving that we can use ScalaSQL on DuckDB.

  import scalasql._, PostgresDialect._

  case class Items[T[_]](
      count: T[Int],
      item: T[String],
      value: T[Double]
  )

  object Items extends Table[Items]

  val createQuery = scala.io.Source.fromFile("./create_items.sql").mkString
  val jeans =
    Items.insert.columns(_.count := 1, _.item := "jeans", _.value := 20.0)
  val hammer =
    Items.insert.columns(_.count := 2, _.item := "hammer", _.value := 40.2)

  @main
  def basicDuckDb = {
    db.updateRaw(createQuery)
    println(db.renderSql(jeans))
    println(db.run(jeans))
    println(db.renderSql(hammer))
    println(db.run(hammer))
    val items = db.run(Items.select)
    println(items)
    items.foreach(x => println(x.item))
  }

At this point my main concern is the lack of Data Definition Language (DDL) methods in ScalaSQL. Notice that in order to create the items table, we need to define a SQL file and run it raw. This may make ScalaSQL not amenable for analytical or data engineering tasks.

Arrow methods

Embolden by earlier's success, I decided to check out if Arrow methods worked.

I wrote a script to create an Arrow vector. The main interest of that code is: it's a good reminder of how verbose Java can be. As such, it's contents are omitted. However, there is something that bit me. You need to generate a full Arrow's VectorSchemaRoot. If you just try to load, say, an IntVector, DuckDB will throw an error, as it seems to require a Schema.

The other relevant bit regarding this setup is: if you want to use Arrow, you have to set the --add-opens JDK variable. I have not found a way to do this directly with scala-cli. Thus I have this config script.

With Arrow properly configured and vector saved to disk, we can replicate this section of DuckDB's docs, using ScalaSQL. We define again our table as a case class. Then load and register the Arrow vector into the database. Finally, we can query our vector using ScalaSQL. Find the code in ArrowTest.scala

  import scalasql._, PostgresDialect._

  import org.apache.arrow.c.ArrowArrayStream
  import org.apache.arrow.c.Data
  import org.apache.arrow.vector.ipc.ArrowFileReader
  import org.apache.arrow.memory.RootAllocator

  import java.io.File
  import java.io.FileInputStream

  val allocator = RootAllocator()

  case class Asdf[T[_]](
      integers: T[Int]
  )

  object Asdf extends Table[Asdf]

  @main
  def arrowTest = {
    val file = File("data/test_file.arrow")
    val inputStream = FileInputStream(file)
    val reader = ArrowFileReader(inputStream.getChannel(), allocator)
    val arrowStream = ArrowArrayStream.allocateNew(allocator)
    Data.exportArrayStream(allocator, reader, arrowStream)

    conn.registerArrowStream("asdf", arrowStream)
    println(db.run(Asdf.select))
  }

Discussion

The benefits of type safe queries is available on DuckDB through ScalaSQL. In a limited fashion. ScalaSQL lacks methods to handle DDL queries. This makes this library suboptimal for the load bit of ETL work. Furthermore, at the time of writing ScalaSQL doesn't seem to have support for COPY ... TO statements. These statements are available in Postgres and DuckDB. These statements are required to write output to parquet files in cloud storage with Duck Db. That is pretty much the goal of current data engineering and analytical tasks.

All that is of no surprise, given that Scala SQL is an ORM, mostly focused on supporting operational databases. Using Scala SQL for analytical work may be a stretch of its current capabilities. However, extending ScalaSQL to handle those missing bits shouldn't be impossible.

With all these limitations, I can envision a workflow, where all DDL and output work is handled in pure SQL, and most complex transformations are handled with ScalaSQL. At the end of the day, we benefit from type safety when we want to bring query results into Scala to do some further processing.