on Single Purpose Compute

October 10, 2024

For the past two years, most of my work has been tied to these Platforms as a Software Substitute. I work on the data-sphere, so you might know the usual suspects: The one called like a process in your brain, the one that, by it's name, you could build a wall with or the one named after the crystallized water elements. No free publicity on these pages.

At the end of the day, what they provide is a managed Spark service and an interface where you can easily run your code from. And Spark's great. It's also hard to manage. So it makes a lot of sense to have managed instances.

These kinds of services are sold as the be-all and end-all of data work. However, they offer significant drawbacks.

Oftentimes, you do not need a Spark instance.

On a particular project we are handling four (4) spreadsheets a month. The largest of these ranges in the thousands of records. The great thing is, we are not even using Spark on that project. Just plain old Pandas is doing all the work. But given that the main selling point of the service we are using to run our code is Spark, you still need to set up a whole Spark cluster (single node, but still), which takes a significant chunk of time. This means, from the time a user clicks a button on their UI, till they see some results on a different OO, it takes ten minutes. Out of which nine and a half are compute setup.

Oftentimes, coding your solution in Python is not the best option.

One of the aforementioned spreadsheets is downloaded from a financial data portal. As things where before the summer, someone logged in on that portal every morning. Searched for the required financial data. Filtered it. Downloaded the spreadsheet. And uploaded it to cloud storage. big four consultants had claimed such process to be impossible to automate.

Being myself a radical believer on non-belief, I did a quick search and found a JavaScript library providing an API for the financial data portal. A couple of hours later, I had the whole process automated. And boldly stated my thoughts about big four consultants.

Just to swallow my own words a few minutes later. Turns out that automating such a process using our beloved platform is impossible. You get to run Python code or jars. You do not get to run you 80 lines of JavaScript.

Most-times, you get to do horrible things with code on notebooks.

I mean, you get to run arbitrary Shell code on one of those platform notebooks. So I guess I could figure out a way to install Node there. And trigger my script from there.

That's exactly what we are doing on a different project. One of those "Python" notebooks installs some packages, runs some Spark code. Then executes some Shell commands. And then writes a curl command by Python string concatenation, and executes it using the os library… Because Python's requests is quite hit or miss within our beloved platform. Whoever wrote that, was completely unconcerned by separation of concerns.

But I've kind of refused to write code like that. So I've asked the someone to just install Node on their multipurpose laptop, and run the code form there.

Because, our work platform is single purpose, and we do not get to use it to solve our work tasks. We only use it for our data tasks.

I asked my manager for permission to run my script on a free tier lambda. But I'm on the Data Department, we use our platform. Not lambdas. I only asked for permission because senior architect claimed that, if I were to deploy a lambda, it would not go unnoticed. I am not one of those permission seekers.

Every-time, your code is tied to the platform.

And that means you do not get to use standard tooling or practices. I just like testing my code. And using a debugger. Those are tools that make me productive. I also like grepping stuff. It makes me productive. But because all the code lives on notebooks on a remote server, I get to use none of that. But it gets worse. My younger colleagues do not get to see me using any of that, they do not get to learn standard productivity tooling.

It's quite easy to write code that only runs on the platform. How on earth am I supposed to run that through a debugger. What I've seen is: You maintain 2 versions of the code. One of them is a notebook where each line lives on its separate cell. So you get to do line by line execution. Again, DRY was not a concern.

I understand that some people do not like testing or debugging. But I like it. I do not go around forcing everyone to test or debug their code. I do not do ANY tech evangelism. Not even Emacs. But I am usually on the firefighting team. And I would really appreciate not to be forced to use the tools someone else likes on a Saturday morning.

Conclusion

I've done questionable things with VPS. I've broken a fair share of computers. And I see the point of these managed services, where you get to use a sliver of the capabilities of the underlying server.

But these things feel absolutely dis-empowering. I know what I could be doing. I've done it. I've used the best tool for the job. I've setup seven hundred VPS to carry out some computation. And nowadays I just get to write Python within a box. And I'm saddened.

But my junior colleagues do not get to learn what they could accomplish with a server. They do not get to code a line outside of Python. And they do not get to learn about testing or debugging. And that saddens me even more.