Please log in to watch this conference skillscast.
A couple of terabytes of data is not impressive by today's standards. A hard drive of that capacity costs about a hundred dollars. But things quickly get complicated when one needs to draw insights from a corpus of unstructured game scenarios that are increasing at a rate of a terabyte a year.
You will hear some lessons learned by a data scientist wearing an extra hat of data engineer on this fun side project. The talk will cover topics from using Apache Spark distributed computing framework and optimizing Delta tables to making sense of resulted mega-dataset with graph theory and an interactive Streamlit application.
Question: What do you think of Koalas? https://koalas.readthedocs.io/en/latest/
Answer: I think Koalas is a gold mine for creating blogs/tutorials for someone who wants to get online exposure. There is definitely a future to it - for Spark to be widely used it has to be familiar - people want to use Pandas.
Question: I'm just starting to explore Spark so good to learn some of those gotchas. A couple of questions:
- When you say 'native Spark', does that include pyspark? How do you find using pyspark?
- How many different operations did you end up doing with pandas UDF? How flexible was it. Did it also have some issues with being a relatively new feature?
- Pyspark, Spark SQL, Rspark - they all compile to Scala and run that way, as I understood. I found Pyspark is OK and not too much different from Python once one gets used to some quirks like lazy evaluations etc. It was infuriating in the beginning that I can’s even find a max value of a column - or to print out a column. Like, how is below for getting a value of a dataframe?
So of course I was happy to find .toPandas()
res = c.sql("SELECT max(id) as maxid FROM table_name") id = res.collect().asDict()['maxid']
- I ended up with 3 UDFs.
In the original data structures there are 3 different tables. UDF can return 1 dataframe, so I created 3 different ones to return all 3 tables, then wrote them to delta
Question: I have worked with parallelising some ML processes with pandas and sklearn just using vanilla python but mostly with small CSV data and even that could get pretty hairy. I would like to know what processes you encountered around testing during development using SPARK and Pandas UDF?
Answer: Testing and debugging is hard on Spark I found.
I switched to my IDE (PyCharm) and used databricks-connect, so the load was sent to my cluster in the cloud, but I was able to debug my code in the coziness of favourite IDE.
If you’re using EMR or another flavour of Spark - I am not sure if such connector exists.
Question: I always confuse about the partition in Spark
Just wondering do you just append everyday new data to existing parquet files, not update existing data.
Do you need to load existing
parquet files to dataframe and join/union with new data? can you append data to an existing partition?
Answer: If you use Delta - it abstracts underlying files and allows us to treat a collection of files just like any other data table. So you can INSERT, MERGE, DELETE, UPDATE, SELECT. It’s pretty amazing how that works.
If you’re using vanilla parquet files - I think you can just append them yeah
spark.write .mode("overwrite") .format("parquet") .partitionBy("Country") .save(parquetDataPath)
YOU MAY ALSO LIKE: