Please log in to watch this conference skillscast.
I will talk about a much simplified version of functional programming in Scala, in building an abstraction for Feature Generation in Data Engineering space.
The program made using this abstraction will get interpreted to the popular data source languages of our choice - such as Spark or Flink. However, before it gets interpreted to any of these engines, we will explain how these programs could be optimised by introspecting its nodes, and help run these interpretations faster. The core idea is similar to that of Free Applicative, however, implementing it in Scala hasn't been straight forward. Here, we provide a similar capability but without mentioning much about FreeAp, and without the usual Scala boilerplates of implicits, macros and a proliferated usage of type classes.
The purpose of the talk is not just to demonstrate a set of code, but to showcase the fact that sticking on to fundamentals of Functional Programming, and finding the right abstraction enables writing solutions quickly and relatively easily.
It proves we don't need to learn a bulk of libraries to apply these concepts in real world applications. The learning curve and a massive set of libraries was often termed as the functional programming in Scala industry, resulting in lesser adoption and developers moving away from it. With this talk my intention is to motivate developers to come back and start writing FP even if they are in the world of JVM.
Q&A
Question: After that journey of recreating applicatives and semigroups as well as some pluggable interpreter pattern for the data pipelines and getting the team comfortable with it, is there any consideration of using cats/scalaz for that?
Answer: Indeed, cats or scalaz gives me Monoid. Adding a dependency is not going to scare developers, I hope. It’s a matter of how much we are able to teach them (passionate developers who are keen to use these techniques within their limited time at work)
We should see a discussion of Free Applicative here, and using it through Scala is a great idea to solve the pbm as well, if the team is comfortable with it.
Question: The problem you have solved seems to be quite generic (within the data engineering domain), and the solution abstracted and expressed generically. Is this a capability that could be of general benefit to the community?
If this is the case, have you open sourcing this, or at least thought of doing so?
Answer: I am solving a part of the big surface area of a general big-data engineering problem. Unless we spend time on the Expr we discussed, the OSS may not be quite useful for all usecases. What really worked here is, identifying what are the specific problems at the client place, and make sure the code is geared to solving them.
Question: To summarise, for my understanding, what you have basically done is abstracted away the data structure and operations of a spark dataframe or flink (i think also) dataframe with a custom algebra? and then used ZIO to run the jobs. a bit like a double abstraction of the effect, ZO to start the jobs and then the underlying flink/spark effects to perform the computation.
Answer: In short, yes.
Details: It's an abstraction that still has a bridge to the original data-source language's Expr implementation in flink, it is Expression I think, and in Spark, it is called Column . With this bridge we are solving the problem of "hey I can use only a part of the spark's capabilities here".
At the same time, we managed to optimise the program that gave tangible performance differences (in terms of hours). I wanted to talk about the numbers, but that is in general would become a discussion about Spark/Flink. As you mentioned, we then trigger this to the effect system. You can now imagine, you can run two spark actions in parallel from the same driver program.
Question: I see. Yes, incorporating the Expr implementation is great, there is nothing more annoying than providing an abstraction that you constantly have to extend because you weren't exposing all the underlying capabilities before
Answer: Absolutely! It's a disaster in a place where we don't have time. However, I am not super convinced the way Column or Expression are implemented, and the optimisers behind it. So indeed, a bigger opportunity, if we have time, is to build our own
But with the laws in place, we put a wall against the impure nature of spark/flink column/expression.
Question: It was a bit fast in the talk but I found it really cool that you could optimise the program to a pretty large extent automatically. going from individual computations to a single optimised computation with different views (basically instead of calculating 3 results you calculated 2 and combined them for the third right?) seems non-trivial.
Answer: That's right. Given the summary of feature generation you can give that as the input back into the program again, that will merge the partitions together. It's hard in 30 mins to smash into the details, but I am pretty sure we can figure that out if we give a bit of time trying to solve it ourselves.
Question: Do you have any datapoints on how your solution performs vs something like Spark SQL?
Answer: It was using Spark and spark SQL, and abstracting over how the dataframe queries were constructed.
It’s more of making sensible data frames which is the responsibility of the developer and not catalyst optimiser and then feed the sensible program to catalyst optimiser. The implementation of Dataservice is using data frame. But keep a note that they are all one liner implementation. The composing of operations happens at the abstract level and not using dataframes. Meaning ul see multiple joins in feature gen, but will see only one join call using dataframe in the entire application.
Question: Ah, I didn't realise, I thought you were using the regular dataframe api under the hood.
Answer: Yes. The implementation of Dataservice is using data frame. But keep a note that they are all one liner implementation. The composing of operations happens at the abstract level and not using dataframes. Meaning ul see multiple joins in feature gen, but will see only one join call using dataframe in the entire application.
Question: You said "this is sellable to the client, we didn't talk about FP there" - do you find there's a resistance issue to FP from clients?
Answer: Well the answer is “yes and no”. It’s more of a matter of familiarity. Things are smooth as soon as we are able to talk about benefits and able to teach them. But if we are just developing in silo, without talking things get a bit hairy for them and result in rewrites. The resistance exists in general, and the usual feedback is we can’t get developers who know these topics to maintain the code. Solution, I think is to write the code step by step and work along with them.
YOU MAY ALSO LIKE:
Unveiling much simplified Functional Programming in Scala for Data Engineering
Afsal Thaj
Principal ConsultantSimple Machines