on Spark in Scala 3

September 19, 2023

Today I had the chance to try out Spark on Scala 3. The conclusion: Quite underwhelming.

Introduction: Scala 3 is great.

I've been using Scala 3 to write my dayjob scripts lately. While I have to admit that I was skeptic at first (particularly because significant whitespace has given me enough trouble with Python), I've been pleasantly surprised. Mostly because I haven't managed to make significant whitespace work with my Emacs setup.

But overall, I find Scala 3 a great scripting language. The scala-cli tool and top level definitions make the whole ordeal quite lightweight, but with the power of the JVM ecosystem.

The set up: My first chance of Spark scripting.

I work on short term contracts and usually spend half the alloted project time waiting to get access to the clients infrastructure. But this client is kinda nice and they sent me some .xlsx files with their database schema definitions, so I could set up some mock data and start working. Thus the most expected moment came, where I could try out Spark on Scala 3. The test of fire. I would test the Scala 2.13 - 3 cross version compatibility, as well as how Spark, my main work tool, would behave in this setting.

The task involved getting some insight from the .xlsx files, performing bog standard DataFrame filters and aggregations. That all worked great.

Trouble in paradise: Third parties to the rescue.

The first hiccup came with udfs, user defined functions, and DataSets, strongly typed DataFrames. These involve Scala's implicit black magic. But luckily enough, some brave souls have worked on the issue already. Importing the givens and udf object from that library fixes that front. But at this point I was having second thoughts. It's not a big deal for a throw away script, but this is definitely not production ready.

Downhill we go: Last time Sparking with Scala 3 for the foreseable future.

The real watershed came shortly after. Let me note, scripting is not really what Spark is made for. My end goal was writing some files to disk, not a proper table to a cloud storage or database. So at some point I had to .collect data to my dataframe so I could comfortably create my outputs. Here things broke.

First of all, I could not .toList() my collected array. This is now an annoyance. You get to work with arrays rather than lists. And no amout of third party imported givens seemed to fix the issue.

  //> using scala 3.3.0
  //> using dep org.apache.spark:spark-core_2.13:3.5.0
  //> using dep org.apache.spark:spark-sql_2.13:3.5.0
  //> using dep com.github.mrpowers:spark-daria_2.13:1.2.3

  import org.apache.spark.sql.{SparkSession, Row}
  import org.apache.spark.sql.types.{IntegerType, StringType}
  import com.github.mrpowers.spark.daria.sql.SparkSessionExt._

  @main def main() = {
    val spark = SparkSession.builder()
      .appName("toList Example")
      .config("spark.master", "local")
      .getOrCreate()
    // example data from sparkdaria
    spark.createDF(
      List(
        ("bob", 45), ("liz", 25), ("freeman", 32)
      ), List(
        ("name", StringType, true),
        ("age", IntegerType, false)
      )
    ).collect().toList()
  }
   Compiling project (Scala 3.3.0, JVM)
   [error] ./ToListExample.scala:17:3
   [error] missing argument for parameter n of method apply
      in trait LinearSeqOps:(n: Int): org.apache.spark.sql.Row
   Error compiling project (Scala 3.3.0, JVM)
   Compilation failed

But the major showstopper came a few minutes later. Once I had groked my data into a list of values. I was done. Just parallelize, call .toDF() and write to .csv.

  //> using scala 3.3.0
  //> using dep org.apache.spark:spark-core_2.13:3.5.0
  //> using dep org.apache.spark:spark-sql_2.13:3.5.0

  import org.apache.spark.sql.{SparkSession, Row}
  import org.apache.spark.sql.types.{IntegerType, StringType}

  @main def main() = {
    val spark = SparkSession.builder()
      .appName("toDF Example")
      .config("spark.master", "local")
      .getOrCreate()
    // example data from spark daria

    spark.sparkContext.parallelize(
      List(
        ("bob", 45), ("liz", 25), ("freeman", 32)
      )
    ).toDF()
  }
  Compiling project (Scala 3.3.0, JVM)
  [error] ./ToDFExample.scala:15:3
  [error] value toDF is not a member of org.apache.spark.rdd.RDD[(String, Int)]
      - did you mean org.apache.spark.rdd.RDD[(String, Int)].top?
  Error compiling project (Scala 3.3.0, JVM)
  Compilation failed

I had about 100 of these things I needed to parallelize and and write. I didn't know the schema of them all, so I could not take the .createDataFrame approach of the previous example.

I worked arround the issue with an ad-hoc List to .csv encoder. And I had to use a var at some point, first time I've used a var in Scala. I feel odd about that.

Conclusion: Final veredict.

For a throw away script, all the issues were workaround-able. Nothing took me too long to figure out, and this was definitely a great exercise taking the pulse of the ecosystem. I feel like I was taking Spark out of it's comfort zone with this task. Python with Pandas would have been a much better solution for this. So maybe someone else's use case does not suffer from these ails

The final veredict is: I would not recommend a friend to use Spark with Scala 3.

Colophon: A note on generative chatbots.

But I was telling my colleague about my adventures, real time, as they happened. He was the first to hear my final veredict as I messaged him:

– You cannot use .toDF nor .toList

In this year of 2023, everyone is raging with generative chatbots. So my collegue replied with a screenshot of his query to his favorite chatbot; the artificiality replying with an example, too similar to those shown above.

– It does not work. – I said.

Another screenshot came back, with the chatbot expressing how sure it was that the code worked. Not only sure, it had tried the examples out!

Compiler errors have never felt as glorious, as when I turned my screaming red screen to my colleague's face.