May 18, 2025
While it's possible to use ScalaSQL on DuckDB, current limitations of ScalaSQL make some bit's of data analytics tasks less than optimal.
DuckDB is the latest rage in the analytical database space. This space seems to be weirdly dominated by big data tools and products. The weird bit is when you find yourself using these products to handle small data. These tools are complex and the products are very expensive. As of late, I've inherited a project that boasts a montly $700 bill, to handle workloads that could fit in a single $5 vps. A plausible explanation of how have we ended up here can be found on Ludic's blog.
DuckDB presents itself as a sane Sqlite-like alternative for analytical databases. A detailed rational on how DuckDB can be a feasible alternative to big data tools and products can be read on Jordan Tigani's Big Data is dead.
ScalaSQL is Scala ORM library from the Li Haoyi ecosystem. That is a set of libraries that are easy to use. The goal is to have alternatives to the more powerful Scala frameworks, but still provide the benefits of Scala.
Scala was once the workhorse of complex big data processing. However, in the last decade, it has been mostly replaced by Python. I believe that Scala's more disciplined approach and type safety, provide significant help with the complexities of data analytical tasks. I also don't like Python.
With these turning tides in both the data analytics tooling and the Scala ecosystem, both moving towards greater simplicity, is Scala a feasible alternative again?
In order to figure out if this is a feasible stack, I've replicated the relevant bits DuckDB Java documentation, using Scala and ScalaSQL.
I've not tested the streaming and updating sections, as those are things that should be handled quite well with ScalaSQL and the default Java libraries.
All the code is available here. I used scala-cli
to build the project.
The first thing to have in mind is: ScalaSQL requires at least Scala 3.6.2.
This is not an LTS version of Scala. Project setup can be found here.
In order to use DuckDB, we need to stablish a connection
with the database. We do this using tools from the standard java.sql
library
with the DuckDB jdbc
driver.
Finally we have to create a ScalaSQL client.
This requires that we define which SQL dialect we want to use.
The genius of DuckDB is that it uses the Postgres SQL dialect.
So using the PostgresDialect
from ScalaSQL to set up the connection
should work.
Because I ended up having multiple scripts that used the DB connection, I've put all my DB variables in the file Database.scala.
import java.sql.DriverManager
import org.duckdb.DuckDBConnection
import scalasql._, PostgresDialect._
val conn = DriverManager
.getConnection("jdbc:duckdb:")
.asInstanceOf[DuckDBConnection]
val dbClient = DbClient.Connection(conn)
val db = dbClient.getAutoCommitClientConnection
The whole point of ScalaSQL is that we get to define our database tables as case classes, with that we can build type safe queries though their object companions.
In BasicDuckDB.scala, I showcase how the ScalaSQL and DuckDB actually works. That code replicates the querying section of the DuckDB documentation.
The items
table gets defined as a case class, as well as insert and
select queries. This code does run, proving that we can use ScalaSQL on DuckDB.
import scalasql._, PostgresDialect._
case class Items[T[_]](
count: T[Int],
item: T[String],
value: T[Double]
)
object Items extends Table[Items]
val createQuery = scala.io.Source.fromFile("./create_items.sql").mkString
val jeans =
Items.insert.columns(_.count := 1, _.item := "jeans", _.value := 20.0)
val hammer =
Items.insert.columns(_.count := 2, _.item := "hammer", _.value := 40.2)
@main
def basicDuckDb = {
db.updateRaw(createQuery)
println(db.renderSql(jeans))
println(db.run(jeans))
println(db.renderSql(hammer))
println(db.run(hammer))
val items = db.run(Items.select)
println(items)
items.foreach(x => println(x.item))
}
At this point my main concern is the lack of Data Definition Language (DDL)
methods in ScalaSQL.
Notice that in order to create the items
table, we need to define a SQL
file and run it raw.
This may make ScalaSQL not amenable for analytical or data engineering tasks.
Embolden by earlier's success, I decided to check out if Arrow methods worked.
I wrote a script to create an Arrow vector. The main interest of that code
is: it's a good reminder of how verbose Java can be. As such, it's contents are omitted.
However, there is something that bit me.
You need to generate a full Arrow's VectorSchemaRoot
.
If you just try to load, say, an IntVector
, DuckDB will throw an error,
as it seems to require a Schema.
The other relevant bit regarding this setup is: if you want to use Arrow, you have to set
the --add-opens
JDK variable. I have not found a way to do this directly with scala-cli
.
Thus I have this config script.
With Arrow properly configured and vector saved to disk, we can replicate this section of DuckDB's docs, using ScalaSQL. We define again our table as a case class. Then load and register the Arrow vector into the database. Finally, we can query our vector using ScalaSQL. Find the code in ArrowTest.scala
import scalasql._, PostgresDialect._
import org.apache.arrow.c.ArrowArrayStream
import org.apache.arrow.c.Data
import org.apache.arrow.vector.ipc.ArrowFileReader
import org.apache.arrow.memory.RootAllocator
import java.io.File
import java.io.FileInputStream
val allocator = RootAllocator()
case class Asdf[T[_]](
integers: T[Int]
)
object Asdf extends Table[Asdf]
@main
def arrowTest = {
val file = File("data/test_file.arrow")
val inputStream = FileInputStream(file)
val reader = ArrowFileReader(inputStream.getChannel(), allocator)
val arrowStream = ArrowArrayStream.allocateNew(allocator)
Data.exportArrayStream(allocator, reader, arrowStream)
conn.registerArrowStream("asdf", arrowStream)
println(db.run(Asdf.select))
}
The benefits of type safe queries is available on DuckDB through ScalaSQL.
In a limited fashion.
ScalaSQL lacks methods to handle DDL queries.
This makes this library suboptimal for the load
bit of ETL work.
Furthermore, at the time of writing ScalaSQL doesn't seem to have support
for COPY ... TO
statements. These statements are available in Postgres and DuckDB.
These statements are required to write output
to parquet files in cloud storage with Duck Db.
That is pretty much the goal of current data engineering and analytical tasks.
All that is of no surprise, given that Scala SQL is an ORM, mostly focused on supporting operational databases. Using Scala SQL for analytical work may be a stretch of its current capabilities. However, extending ScalaSQL to handle those missing bits shouldn't be impossible.
With all these limitations, I can envision a workflow, where all DDL and output work is handled in pure SQL, and most complex transformations are handled with ScalaSQL. At the end of the day, we benefit from type safety when we want to bring query results into Scala to do some further processing.