YOW! Data 2021

Wednesday, 12th - Thursday, 13th May, Online Conference

18 experts spoke.

YOW! Data is an opportunity for data professionals to share their challenges and experiences while our speakers share the latest in best practices, techniques, and tools.

The 2021 conference was online in a two-day event featuring invited international and Australian speakers sharing their expertise in Data Science, Data Engineering, Machine Learning and AI.

Excited? Share it!


Evolving the ML Platform organisation at Netflix: a case study

Do you wish there was a Machine Learning model to tell you how to structure your ML teams? So do I! While we're waiting for that, I'll share the story of how the ML Platform organisation evolved at Netflix. Although this story is specific to our own journey to expand Netflix ML investments, there are a few lessons learned along the way that you'll be able to relate to. There are several factors going into org structure that we'll discuss, including: the specialty and skillsets of ML practitioners, the variety and depth of ML use cases, who's responsible for the data, the ownership model as ML projects go to production, and how the underlying Platforms are situated. I look forward to sharing and hearing your own thoughts afterward!


Question: Do AEs at Netflix prototype ML models themselves or just validate/deploy models prepared by others e.g. data scientists? Considering the differences of skills and responsibilities, do AEs normally earn more than DSs? (Please ignore this Q if confidential)

Answer: AEs do research, prototyping in addition to productionising their models!

It's hard to compare salaries apples-to-apples because different folks are valued for different skills. So I wouldn't make a generalisation that one makes more than the other!

Question: How long did it take for Netflix to understand the differences between algorithm engineers and data scientists?

Answer: I would say that the key difference is where they fall on the software engineering spectrum and that folks were explicitly hired with a desired level of this skill. I would also say that AEs working in Personalisation were more likely to have a background in search & recommender systems.

Question: How important do you think having a product research team composed of Data Scientists/ ML Engineers is to a business? It seems like a lot of organisations have the ML hammer but are not always clear on which nail to hit. What’s your advice for setting up teams to identify potential use cases apart from capitalising on already existing customer data?

Answer: There are a lot of methods that are much easier to apply than ML, so I definitely don't think ML is right for every problem! It really depends on the business and the data.

Depending on the problem setup basic heuristics, mathematical models or even really basic linear models would be a starting point. I'd want to establish a baseline that needs to be improved upon before applying ML.

You also have to be able to define an objective to optimise and collect a dataset that can be labeled reliably with "correct" answers for the model to learn on.

Bottom line, see if you can come up with the least fancy way to solve the problem first, see how far you can get and when you've exhausted that option, start to consider ML, but not before. The size of the prize also needs to be big enough. If you have $1B in revenue but the opportunity is to save costs or increase revenue by $100K, it is likely not worth the cost to pay the Data Scientist to do the work!

Julie Amundson

Director of Machine Learning Platform Experience

Building & Operating Autonomous Data Streams

The world we live in today is fed by data. From self-driving cars and route planning to fraud prevention to content and network recommendations to ranking and bidding, the world we live in today not only consumes low-latency data streams, it adapts to changing conditions modeled by that data. 

While the world of software engineering has settled on best practices for developing and managing both stateless service architectures and database systems, the larger world of data infrastructure still presents a greenfield opportunity. To thrive, this field borrows from several disciplines : distributed systems, database systems, operating systems, control systems, and software engineering to name a few. 

Of particular interest to me is the sub field of data streams, specifically regarding how to build high-fidelity nearline data streams as a service within a lean team. To build such systems, human operations is a non-starter. All aspects of operating streaming data pipelines must be automated. Come to this talk to learn how to build such a system soup-to-nuts.


Question: In your case are the Observer and Deployers are the sum of a few systems and processes? or have your team(s) invested in creating centralised solutions?

Answer: I'm actually at a startup now, so I use a different system now than I used at PayPal.

Let me describe PayPal here first…

The Deployer actually does something i missed in the talk

  • It deploys code
  • It waits for O's signal to settle and then check lag and loss metrics (It can also check scoring skew)
  • If it detects something is wrong, it will rollback the code AND roll back the Kafka checkpoint.
  • This is why my system is AtLeastOnce delivery. This means that the deployer can replay messages during the window of a bad deployment

The Observer system uses reservoir sampling instrumentation in each microservice to collect data. The data is sent over Kafka to a Streaming Spark job that collects the data and writes it to ES (our metrics store).

A separate Spark job computes rolling-window loss and lag. This system will send alarms.

In my current startup, we run in AWS.

Our Observer system is 100% cloudwatch. To save money, we do some client side metric sampling for all time-based metrics (e.g. Lag). We can't sample Loss metrics or data points at all.

In my startup, we run our own K8s on EC2. We do HPA CPU autoscaling on K8S over memory-based EC2 autoscaling (2 level autoscaling). It's a bit of an advanced topic for a keynote.

Our Deployer system is currently more complex.. Deployment is via Kubectl commands.

Question: Whats the best way to learn more and practise with a solution?

Answer: I can write up something up that shows more implementation details. You can connect with me on https://www.linkedin.com/in/siddharthanand/ (email : sanand@apache.org).

As I publish on this topic, I'll share it on LinkedIn.

You can email me at sanand@apache.org after this Q&A with questions too and I can always find time for zoom mentoring/etc…

Question: On the e2e lag, does the expel time part means the data have reached to its destination, say becoming available in a database?

Answer: In the expel node, we

  • Read from kafka (autocommit disabled)

  • write to say BigQuery

  • Wait for a 200 (OK)

  • When we get it, we call KafkaConsumer.acknowledge()

So, yes.. E2E lag encapsulates receiving the request at S and confirming the send at D. The entire chain is transactional.

Sid Anand

Chief Architect

Foundations of Data Teams

Successful data projects are built on solid foundations. What happens when we’re misled or unaware of what a solid foundation for data teams means? When a data team is missing or understaffed, the entire project is at risk of failure.

This talk will cover the importance of a solid foundation and what management should do to fix it. To do this I’ll be sharing a real-life analogy to show how we can be misled and what that means for our success rates.

We will talk about the teams in data teams: data science, data engineering, and operations. This will include detailing what each is, does, and the unique skills for the team. It will cover what happens when a team is missing and the effect on the other teams.

The analogy will come from my own experience with a house that had major cracks in the foundation. We were going to simply remodel the kitchen. We weren’t ever told about the cracks and the house needs a completely new foundation. In a similar way, most managers think adding in advanced analytics such as machine learning is a simple addition (remodel the kitchen). However, management isn’t ever told that you need all three data teams to do it right. Instead, management has to go all the way back to the foundation and fix it. If they don’t, the house (team) will crumble underneath the strain.


Question: The house metaphor extends to “tech debt” too.

If I want to add 1 more level (feature) to this building, what is the cost of renovating versus rebuilding it from scratch at the new feature amount of levels?

Renovating is a very different question depending upon the existing foundations.

Answer: It really depends on how well you built your foundation. A foundation built with duct tape and hope will be different than a solid one.

Question: The picture of the poorly engineered house was the image I have been needing to describe to management why a “simple renovation” is going to cost more than simply rebuilding a new foundation at this point in time.

Answer: There's a mountain of technical debt somewhere. It's a delicate surgery to get things fixed. Make sure to fix the org issues that brought the original problem or you're doomed to repeat them.

Question: What are your thoughts on cross-functional data teams versus more isolated platform/internal product data teams? Have you seen any great examples of journeys that transition between these subtypes of data teams to create value?

Answer: I talk about this in the book in the Data Ops chapter. IMHO cross-functional/DataOps teams are the highest and best usage. They create the optimal value relative to cost. However, it is an advanced configuration. IMHO you should only embark on this journey once you've established a solid foundation and friction is your biggest daily issue.

Question: Particularly interested in any experience and ideas you have working with companies that are consultancies - i.e. My company consults to other companies to solve specific problems, we develop data science models and deploy in a way the client can use them. We work with their data, but do everything else in-house. It seems different from a lot of the focus of the talks at this conference which are all about in-house data teams. What kind of different problems have you seen? How does your 3-pillar approach translate to this business model?

Answer: I think it's the same with the possibility that operations isn't there. Theoretically, your client is doing the operations. IMHO you still need data science and data engineering to this right. Another big issue will be the communication between the client and your team.

Question: On a similar question, wondering if the need for operations peeps in a data team is reduced at companies where there are data platform teams offering central platforms for data teams to use?

Answer: It's reduced but never eliminated. It's a similar thing with the cloud. You can't fire the ops team but there is a reduction. I've had discussions with companies who have done central platforms. I even interviewed one for the book. There are ops issues of who is responsible when something breaks? Who figures out what broke? Did the framework/hardware fail? Or did the code you wrote fail? Ops people have to sort this out.

Jesse Anderson

Data Engineer, Creative Engineer and Managing Director
Big Data Institute

Islands in the Stream - What country music can teach us about event driven systems

Event driven systems are all the rage. It's with good reason we're witnessing a transformation with businesses adopting event driven systems. Building systems around an event-driven architecture is powerful pattern for creating awesome data intensive applications.  But before we sail away to another world, let's avoid the common pitfalls of designing & running event driven systems.

Islands in the Stream - what Kenny Rogers can teach us about event driven systems from the wisdom of a country music classic


Question: One question when we use the event driven process, how can we guarantee the response time and maintain the state. for example with synchronous way we call an api, we know it success or not in a sec. in the asynchronous way we don't know when we can get the result, do you have any idea or suggestion how to solve this?

Answer: Great question. Building EDA systems with some known SLA constraint I reckon is easier than traditional microservices arch as you have isolation for the consumers. Something like a bunch of Kafka consumers can be load balanced, and scaled pretty independently. When an actual SLA outcome is required you can build a back-pressure system - but it gets a bit complex.

Question: I remember going to town when I discovered the Observer pattern around 2002 and trying to create and event driven engine model. Super hard at the time, but I suspect a whole lot easier with todays tools.

Answer: Thanks. Yeah - a lot of great frameworks exist now that manage the hard “distributed” problems so you can concentrate on just solving the unique (business) problems. I’m a huge fan of k-streams and flink - but these problems are much easier to solve these days for sure!

Question: . A question about when things go wrong — if a producer produces erroneous events for a while, what’s the pattern for telling consumers of that event topic to ignore those events?

Answer: There’s a pattern called a “poison pill” when you really need to discard a message logically from immutable storage. This can also take the form of a new-day offset if you just want to scrub history. In Kafka this takes the form of a consumer offset.

Simon Aubury

Principal Data Engineer

Lessons learned from building ML products

Building products based on machine learning requires much more than taking a ML algorithm and deploying it in the cloud. Based on my experience as a researcher, working in ecommerce and independent consultant, I talk about some of the lessons learned what is needed beyond pure ML algorithms to successfully build products with ML. How do you identify customer problems that can be tackled with ML? How does the technology landscape around ML look like? How do you set up teams and organizations to be "AI ready?" I'll be sharing some of my observation and insights.


Question: At our company we currently have a (small) centralised data team, and we often interact with the dev team to get systems productionised. We've been thinking about expanding the team to have "go-betweens" in other teams, which sounds like the cross-functional teams you mentioned in your talk.

Have you had experience moving from the centralised to the cross-functional paradigm? Any useful tips?

Answer: At Zalando, they did more of a big-bang approach where they closed down the central team at some point and moved people into the cross-functional product teams. But even before that they started to move out some teams, like recommendation, that were already further.

I think once you realise that a solution is not just one off but needs continual improvement, you can start building cross-functional teams one by one.

At some point you might also go back to have a central team to look for new opportunities within the company.

Thanks for everyone listening to my talk. Just some pointers if you want to stay in touch:

  1. You can follow me on twitter https://twitter.com/mikiobraun.

  2. Connect with me on Linkedin https://www.linkedin.com/in/mikiobraun/.

  3. I’ve also recently started a weekly newsletter if that’s your thing https://www.getrevue.co/profile/mikiobraun.

Mikio Braun

Independent Consultant

Yepoko Lessons For Machine Learning on Small Data

Let's face it, in most companies, the amount of good data available to perform machine learning is very small. Most data are small data. So how can we do good machine learning on small data?


Question: This started from a puzzle?!?! how could anyone just casually solve it?

Answer: 20 mins to not look at the screen and a pen and paper gets you a lot of fun.

Question: I wonder how the model behaves when we categorise the words based on prime and composite numbers as another feature!

Answer: I'm working on a project that involves that! and mathematical reasoning.

Here's the puzzle https://ioling.org/problems/2012/i2/

To share some clarification on what @Chewxy mentioned about how Chinese is more structured when numerics.

千 = in thousands

百 = in hundreds

十 = in tens

So if you were to construct a number in let's say, thousand, you would write something like the below in Chinese:


Which in English means five thousand five hundred and twenty.

But direct translation it means:

Five thousand five hundred and two tens.

Xuanyi Chew

Chief Data Scientist

Data Mesh; A principled introduction

For over half a century organizations have assumed that data is an asset to collect more of, and data must be centralized to be useful. These assumptions have led to centralized and monolithic architectures, such as data warehousing and data lake, that limit organization to innovate with data at scale.

Data Mesh is an alternative architecture and organizational structure for managing analytical data. Its objective is enabling access to high quality data for analytical and machine learning use cases - at scale.

It's an approach that shifts the data culture, technology and architecture - from centralized collection and ownership of data to domain-oriented connection and ownership of data - from data as an asset to data as a product - from proprietary big platforms to an ecosystem of self-serve data infrastructure with open protocols - from top-down manual data governance to a federated computational one.

In this talk, Zhamak will introduce the principles underpinning Data Mesh and architecture.


Question: I'm a ML/big data consultant and we are working with one of our clients who use the data mesh architecture. The problem they have is, since we don't take ownership, the verticals in the organisation wouldn't come to us for datasets and services. And also, since this is a different mentality from the rest of the organisation, we need to introduce ourselves in a different way but we are not gaining much attraction. Have you come across this before? and what is your recommended solution?

Answer: I need to understand this better … in my experience the data scientists and analysts are so poorly served that they are always looking for data - even better data products - and there is always friction to get to them. So if the DM platform removes friction, access control, discoverability, and serve high quality data in a way that meet the needs of their native tools/processes I’m curious why they still don’t show up for a bucket?

Question: With the data mesh approach, at a high level it feels like it is domain driven and ownership will be at domain level. Will this make implementing master data management solutions or "one source of truth" needs difficult for complex organizations.

Answer: One source of truth is an ever moving goal post. Yes you are right that DM shifts from one source of truth to the most trusted truth - but it does provide a certain guardrails that supports the notion: (1) data product is immutable and readonly - data never updates for a particular processing time, so that removes a lot of accidental complexity that comes from different updated information - (2) data product has SLOs that they need to guarantee in terms of quality and accuracy, integrity, etc. (3) Aggregate data products can be created to provide the mastering capability if needed around core concepts.

Question: With distributed ownership of data, do you have any thoughts around how to incentivize data owners/producers to maintain baseline data usability, data quality, SLAs etc and own the data they produce so that they can think of data as a product.

Answer: Great question, Guaranteeing data quality in a way that makes the users happy is part of their job, and their OKRs should include that. NPS, platform observability automation to check the data integrity, quality, etc. could be some of the tools. Dave and I from ThoughtWorks gave a talk recently on how to manage and guide evolution of organizations using “fitness functions”. Check out By ThoughtWorks the talk is online “Guiding the evolution of Data Mesh with fitness functions”.

Question: In most of the organizations the data is not owned by the business users, In your experience what is the best way to educate the business owners to own the data within the organization and educate them about various aspects of data including Data Privacy…

Answer: It’s very hard to do that without bringing all the other 3 pillars to life. It’s very hard to do that without having a business-domain-oriented tech (including data) capabilities to support the business, or a platform that empowers them, or a governance model that include them …

So to get started we need all of those pieces to come together.

Additionally, this is a “transformation” so organizational change is part of this, and most importantly a top-down support.

Zhamak Dehghani

Principal Consultant

Taming the Long Tail of Industrial ML Applications

Data Science usage at Netflix goes much beyond our eponymous recommendation systems. It touches almost all aspects of our business - from optimizing content delivery and informing buying decisions to fighting fraud. Our unique culture affords our data scientists extraordinary freedom of choice in ML tools and libraries, all of which results in an ever-expanding set of interesting problem statements and a diverse set of ML approaches to tackle them. Our data scientists, at the same time, are expected to build, deploy, and operate complex ML workloads autonomously without the need to be significantly experienced with systems or data engineering. In this talk, I will discuss some of the challenges involved in improving the development and deployment experience for ML workloads. I will focus on Metaflow, our ML framework, which offers useful abstractions for managing the model’s lifecycle end-to-end, and how a focus on human-centric design positively affects our data scientists' velocity.


Question: Does metaflow provide an option of hyperparameter tuning using Bayesian methods rather than just gird search?

Answer: Great question. Yes, we are working on HPO integrations. This PR lays down the ground work for it - https://github.com/Netflix/metaflow/pull/510. We are exploring integrations with Optuna. I would love to know which HPO libraries/services are you using today!

Question: Metaflow is awesome but if it uses AWS underneath, what is the advantage of it over similar ML OPS features in AWS Sagemaker?

Answer: Currently, our integrations are with the AWS cloud, but all our integrations are plugins so it's easy to support Azure, GCS and On-prem.

Question: First of all we at REA group recently decided to use Metaflow as our orchestration tool. It is an amazing tool, thanks for open sourcing it. The endpoint function is really neat! Have you guys thought about integrations to AWS sagemaker inference. If yes, would that be open sourced?

Answer: Nice! I would love to get feedback on your experience with Metaflow. Yes, indeed we will surface a bunch of inference backends when we open source @endpoint . However, if you would like to get started without a formal integration, here is how Metaflow can work with Cortex (which is similar to Sagemaker inference ) - https://site-cd1e85.webflow.io/post/reproducible-machine-learning-pipelines-with-metaflow-and-cortex

Question: Are the steps of training flow mentioned in the presentation (snapshot, restore, etc) implemented via MetaFlow?

Answer: Within Metaflow you can execute arbitrary python code. We don't enforce very many constraints except that it needs to be a well formed DAG with a single entry point and exit point to the graph. Here is some more information - https://docs.metaflow.org/metaflow/basics

Question: Just wondering how Metaflow approaches the reusability of models. Other than grid search, does Metaflow make use for similar previously cached workflows or model pickles automatically or is that left solely to the user piecing them together separately?

Answer: You can access any prior result (within a metaflow flow or any Python process) - https://docs.metaflow.org/metaflow/client

Savin Goyal

Machine Learning Infrastructure

Rights, Sovereignty and Governance in Official Reporting: Considerations in the Use of Aboriginal and Torres Strait Islander Data

The realisation for Indigenous people in Australia to be counted in official statistics occurred in 1967. The identification of Indigenous people in Australia in national data highlights a range of historical and contemporary issues that require our attention. This includes how Indigenous people have been defined and by whom, as well as how identification is operationalised in official data collections. Furthermore, the completeness and accuracy of Indigenous people identified in the data and the impact this has on the measurement of health and wellbeing must also be taken into account. Official national reporting of Indigenous people is calculated using data from censuses, vital statistics, and existing administrative data collections and/or surveys. In alignment with human rights standards, individuals in Australia can opt to self-identify as ‘Indigenous’ in the data. Australia’s colonial context in which Aboriginal and Torres Strait Islander data is derived results in considerations about the sovereign rights of Indigenous people globally in the use of data and how this can be actioned through data governance processes.


Question: How is the under-registration of people (eg with births) known? Seems like that would be in "unknown unknown" territory?

Answer: Most births that occur in Australia are picked up by the perinatal data set. These births send a notification to the birth registry and then when the parents register the birth it results in a 'Birth Registration'... so by comparing the perinatal data set and the birth registration data set we can see the differences.

Question: Do we have any idea why the rate of birth reporting for indigenous people is lower, particularly in rural areas? What's different between the processes to record data in the perinatal datasets and the birth registry?

Answer: The majority of Aboriginal and Torres Strait Islander people live regionally and remotely, this means that if the birth registration documents aren't completed in the hospital/clinic before people leave, then there is potential that these documents are not sent into the registry office. Docs, midwives and 'responsible person' put the data into the perinatal collection.

We have been working with registrars and QLD Reg of BDM have done some extensive work over the past few years to improve the issues in their state.

Question: There's been so much COVID related data flying around, have you seen these same kinds of issues with reporting of COVID rates and mortality amongst Aboriginal and Torres Strait Islander people or is it any better in these more recent data sets?

Answer: One issue with COVID data is the pathology centres (as private entities) are not required to collect Indigenous status... this has posed big issues on Indigenous reporting...

We published an article on the issues with COVID-19 and Aboriginal and Torres Strait Islander data if your interested in a deeper dive... https://content.iospress.com/articles/statistical-journal-of-the-iaos/sji210785

Kalinda Griffiths

Scientia Lecturer
University of New South Wales

Scaling the Machine Learning Platform at DoorDash

DoorDash’s mission is to grow and empower local economies. DoorDash’s business is a 3-sided marketplace composed of Dashers, consumers, and merchants.

As DoorDash's business grows, it is essential to establish a centralized ML platform to accelerate the ML development process and to power the numerous ML use cases.  We are making good progress, but we are still in the early days of building out our ML platform.

This presentation will detail the DoorDash ML platform journey that includes the way we establish a close collaboration and relationship with the Data Science community, how we intentionally set the guardrails in the early days to enable us to make progress, the principled approach of building out the ML platform while meeting the needs of the Data Science community, and finally the technology stack and architecture that powers billions of predictions per day and supports a diverse set of ML use cases. They include search ranking, recommendation, fraud detection, food delivery assignment, food delivery arrival time prediction, and more.


Question: Have you considered a graph DB for correlating features? (e.g. rather than a k/v)?

Answer: yes, we are considering using Neo4j for a bigger upcoming project.

Question: How did you make sure the ML Platform you were building would solve most of the pain points for the customers (data scientists, ML engineers etc). How did you go back collecting that feedback. Was it through the ML council as well?

Answer: We have many ways to interact w/ DSs, via slack, regular training sessions, etc.

Question: Tying into the first question, sometimes the data scientists might be nervous for the uptake of a new process/platform. How did you go about onboarding customers onto the platform.

Answer: Whenever there is a new use case, we collaborate pretty closely with them from the beginning of their project. Documentation plays an important part in onboarding customers

Question: Bit of a silly and fun question, but kind of interesting in terms of service quality and machine learning I suppose: I always have problems with my pizza ordering cold when I use delivery apps - Doordash as well as Ubereats and Deliveroo. I've noticed this can be due to a number of reasons:

  1. The restaurant is far away and there's traffic

  2. The driver picks a slower route

  3. The driver arrives late at the restaurant

  4. The driver is delivering an order on the way

  5. The driver is delivering an order after mine but had to wait for the next order to be completed at the restaurant

Do you know of any strategies for what restaurant to choose to ensure my pizza arrives hot? Rating, distance, photo quality or other? Are there any good predictors of pizza arrival hotness in the application UI?

Also, I think you briefly mentioned food arrival quality in your talk. What kind of predictions are in place to ensure this, and what sort of models are you looking to train (or features to train on) to improve this in the future?

Answer: There are two parts - food order pickup time and food delivery time. These are both difficult problems because we don't control the weather, traffic, how busy the restaurants are. We have some control about when to ask the dashers to pickup the food. The logistics team is constantly looking for ways to improve their models to reduce the pickup time and delivery time.

Question: Was wondering if Doordash allows it’s merchants to give their input into the weightings for the recommendations model? This could get rid of issues such as recommending unrelated sides/drinks in orders, like dumplings as sides for dumplings

Answer: Good question. I am not aware of this at the moment. Maybe we should figure out a way for customers to give feedback on the bad recommendations.

Hien Luu

Head of Machine Learning Infrastructure

Using AI to Mine Unstructured Research Papers to Fight COVID-19

There is an overwhelming amount of information (and misinformation) about COVID-19. How can we use AI to better understand this disease? In this session, we take an open dataset of research papers on COVID-19 and apply several machine learning techniques (name entity recognition of medical terms, finding semantically similar words, contextual summarization, and knowledge graphs) which can help first responders and medical professionals better find and make sense of the research they need. We will dive into the techniques used and share the code repository, so developers will walk away with the understanding of how to build a similar solution using Cognitive Search.


Question: What are you using for the knowledge graph (I work for Dgraph Labs, so am generally curious about that side of the stack)?

Answer: Simple visualization code using D3 is at https://github.com/liamca/covid19search/blob/master/web-app/WebMedSearchV2/Views/Home/Graph.cshtml Not nearly as cool as Dgraph. :)

Question: With the dimensions/categories on the left side of https://covid19search.azurewebsites.net, how did the team come up with relevant categories to be made available for website users?

Journal & Contributors makes sense.

But stuff like:

  1. Genes

  2. Family Relations

  3. Examination Names

  4. Etc..

Answer: The Journal and Contributors were pulled directly from the metadata. The remaining categories were all extracted from the healthcare entity machine learning model that I mentioned. (Guys down the hall in our building at Microsoft built it.) It's documented at https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health?tabs=ner but let me know if you have any questions. We only used the NER functionality but lots of other great stuff there. The idea was to help researchers pinpoint the topics they cared about faster.

Question: Did your team also use the extracted entities for feature engineering or search weighting rather than just categorical labelling? If so, was it done without the medical domain knowledge?

Answer: In this scenario, we just used the extracted entities as facets for the search, as the end goal was to empower doctors/researchers to find the medical research they needed more efficiently. But yes, you certainly could use the entities for other purposes. We do have both generic entity extraction and medical/healthcare-specific domain entity extraction ML models.

Question: So you are using a DNN to rerank BM25... How do you evaluate the quality of the ranks?

Answer: Yes, we use a DNN to rerank BM25. We evaluate the quality of the ranks using xDCG scores. It's trained to predict DCG scores for relevance and click probability using billions of {query, click, document} tuples from Bing logs.

Jennifer Marsman

Principal Software Development Engineer

Playing with Words: Building Products with NLP

Imagine machines that interact with us using the same interface we use to interact with each other — spoken language! Recent progress in NLP has opened up new possibilities for language-based systems. In this talk, we'll explore the recent history of language models and highlight novel applications of statistical and deep learning approaches. Then, we'll explore emerging products that automate, generate, and create using these models, and discuss the implications for building them, including safety, ethics, and the invention of new design metaphors. Finally, we'll speculate about where this might take us in the next few years. Can machines ... play?


Question: "computable" = as numbers that you can do arithmetics on?

Answer: Yes, as in you can calculate the distance between two words (or sentences).

Question: Recently came across a documentary called "coded bias" where the there will be bias in how machines take decision because there is bias in the data used to train it, any thoughts about how big is the problem.

Answer: The bias is in both the data used to train the systems as well as the design of the systems themselves. It’s a huge problem.

There’s quite a lot of academic work on understanding this problem, and applied work on systems (like LIME and SHAP) to help interpret trained models.

Question: Do you have any resources you can suggest for ML + creativity, or people doing interesting work in this space?

Answer: I’m a fan of the work at RunwayML: https://runwayml.com/ … building tools for folks in the film and entertainment industries to use deep learning capabilities.

And in open source, the ml4a work: https://github.com/ml4a

Question: Very interesting point around the diversity/lack of consistency of roles/titles in the data space... What do you hope will be different in this area in 5-10 years?

Answer: I’d like to see the titles and job functions and management structure stabilize, so that if you’re a “data scientist” at one company it’s more or less the same job as at another company.

I look to “software engineer” as an example of what this should eventually look like as the profession matures.

Question: Liked your example of NLP around property descriptions. Have you come across other examples of useful ML in the real estate industry?

Answer: ML in real estate is generally around bayesian inference for things like estimation of property prices, I haven’t seen too much NLP in real estate specifically.

Question: What was the name of the AI tool to help authors when they were stuck again?

Answer: It’s https://www.sudowrite.com/.

Question: Interesting mention of that creative story writing tool (what was it called again?), great example of using generative NLP to tackle creative issues. Do you have any insights on how this could affect the creative industry as a whole? Eg. Potentially over-saturating the writing community even more or being used to generate false information?

Answer: The tool I mentioned is Sudowrite: https://www.sudowrite.com/. I suspect that this becomes an efficiency tool for creators, the same way that gmail will suggest words for your e-mails, but only for the standard phrases. It’s great for mediocre content, not great for delicate creative surprise.

I like to image a world in which writers will curate and train their own model assistants over time, and keep them and bring them from task to task… but that’s probably more poetic than probable.

Question: Have you ever come across any initiatives in companies that promote data science and machine learning internally within businesses (playful, creative or any other) that have high impact but might not be known about widely?

Answer: Can you give me an example of what you have in mind? For new product development?

Question: The end goal would be new product development I suppose, but also to create a nurturing learning environment for employees who are interested in the field. Examples would be: Hackdays dedicated building ML models of existing data, frequent running talks on ML topics, bringing in guest speakers on specialized topics that can be applied to work, internal machine learning competitions (internal kaggle) etc.

Answer: Ahh yes! I love a deliberate process of asking people for their silliest and “worst” ideas, and then allowing hack days to actually experiment and build some of them.

One bank I worked with a few years ago did a morning session on ML for product managers, and then an afternoon session on scoping out potential places in the product where ML could be helpful.

That was really successful, since product managers are generally better positioned to see the opportunities for ML.

Question: What's the timeline on hidden door? I have a 9-year-old daughter who loves to write…

Answer: We’re building to a private alpha at the end of the year. If she’s interested in playtesting, please join us here: https://www.hiddendoor.co/playtesting.

Hilary Mason

Hidden Door

Apache Pulsar and the Streaming Ecosystem

Apache Pulsar is an open-source distributed pub-sub messaging system, developed under the stewardship of the Apache Software Foundation.

This talk will show how its unique architecture enables Pulsar to seamlessly support both streaming and messaging use cases in a single unified platform.

We will also show where Pulsar fits with the broader ecosystem of data streaming technologies and all the interoperability that is available out of the box, making it a particularly good choice for supporting any kind of data platform, where versatility, interoperability and scalability are the key requirements.


Question: It looks to me like Pulsar has been designed with elastic infrastructure in mind since you’ve separated out the computer and storage. Unlike other systems which are designed to run on their own cluster. Is this a fair assessment? Do you see people taking advantage of this?

Answer: Yes, it’s part of normal operations. It’s useful at any cluster size.

Other possibilities, like scaling serving and storage independently are more useful only on clusters of a certain size.

Question: What’s the feedback around the cost savings, scalability benefits etc that users are experiencing?

Answer: Yes, the main driving factor for separating the storage layer (bookie) and the serving layer (brokers) is to avoid tying the data of a particular topic to one specific node.

The 2 layers allow for that, in conjunction with segmenting the data.

Not having the data tied to one node is key, because to easy add more nodes, take down nodes, etc.. without the need for expensive rebalancing operations.

Also, the high write availability is only possible if you decouple serving from storage, because it allows, in the presence of storage nodes failures, to immediately switch new writes to healthy nodes.

I think that the benefits of this architecture are more visible on the scalability and operability of the system, rather than directly on the cost saving. Although, yes, the auto scale up/down of the cluster does result in using infrastructure on demand.

Matteo Merli

Co-creator and PMC Chair
Apache Pulsar

Assisting design with machine learning in Canva’s editor

Our team at Canva focuses on building features that make design simple, enjoyable and collaborative for more than 55 million people across the globe. For many who haven’t used design tools, starting with a blank page can be intimidating, which is where Canva’s library of more than 500,000 templates comes in. Unfortunately, switching between templates once required retyping your content. To fix this, we created a feature for our users to bring their text with them while exploring the library. The initial challenge was that the template metadata the feature relied on was scarce and costly for our in-house designers to annotate.

We wanted to predict metadata for our designers inside the Canva editor, but had to consider a number of real-world engineering tradeoffs. First, we’ll explain the user problem and provide a glimpse inside some of our templates and the metadata that enables text transfer. Then, we’ll explain what features we extracted for our scikit-learn random forest classifier and how we combined it with a designer-in-the-loop to bootstrap enough batch-predicted metadata to launch an MVP version of the feature. Finally, we’ll explain how we decided to reimplement model storage and inference in our TypeScript frontend stack. Creating this new feature was a joint effort made possible by a multidisciplinary team of designers, engineers and data scientists. We’re looking forward to sharing some of the lessons we learned along the way to shipping this smart feature.


Question: Would be keen to understand how that implementation is built where ML is on the front end... That opens up pretty cool positive UX possibilities

Answer: Yep, so the steps for us were:

  • Build parallel implementations of the feature extraction code in Python and TypeScript. Then check that everything is always equivalent. This was more complex for our use-case because we need to do some geometric calculations.
  • Figure out how to serialize the trained Scikit-learn model numpy arrays into JSON.
  • Write code to deserialize it and run inference.

Question: Frontend protobuf/flatbuffers not working?

Answer: It's a bit of a yak shaving exercise, but it was going to be hard to include a dependency on a full ML library like TensorFlow.js, and we weren't confident enough that it would work to pull backend engineers off other tasks.

We use a lot of proto, but mainly only serialise out to Java and TypeScript - not Python. We did develop a Python dialect for our code-generation stack for this and this made it a lot easier.

We also considered this library: https://github.com/nok/sklearn-porter

But it quite literally turns the random forest into 100s of lines of if statements, so too hard to code review.

I've looked around for neat cross-framework solutions to this. I couldn't find many.

Question: Maintaining parallel implementations is pretty amazing. Is the same methods used for the video templates as well?

Answer: It's a lot of work and I'm not sure I'd recommend it as it's so hard to maintain. We've also played around with running the TypeScript code inside node to do feature extraction.

We stuck to simple cases (e.g. social, posters, ...) and IIRC video templates were pretty new at that point.

There are definitely cool things you could do if you can understand the role that images and video are playing in a design.

Question: How long would you say it took for this feature from idea to production?

Answer: Thanks! I think we got it out within 4 months. As usual, the modelling part was only a tiny part of the effort, building the backfill tooling took a while and figuring out how to move the text around. There are still definitely rough edges. I don't think we ever solved discoverability aspect neatly. Again another trade-off between visual clutter and telling users about features!

Question: You didn't have to manually label because you already had prelabelled data right? (didn't catch that bit thanks 4G internet)

Answer: Definitely wasn't perfect.

We had partial coverage in our data -- enough to train a model. We focused on shipping categories at a time, so the goal was to predict the roles on the unlabelled templates and manually check these (designer-in-the-loop) as this is faster than marking them up from scratch.

Will Radford

Data Scientist

Sweet Streams are Made of These: Data Driven Development for Stream Processing

The strength of a powerful stream processing engine is in how fast, and how much data it can process. This naturally adds complexity to existing integration points and can lead to development overhead. Luckily, there is a set of data-driven development principles that are built to alleviate precisely these challenges. This talk will go over what these are and how to apply them at various points throughout the development process, using real-world successes (and failures!) as examples. Although the examples are for highly complex systems, this talk will be beginner-friendly and applicable to non-streaming use cases.


Question: You mention possible consumers of the dashboard being "leadership", "new hires", etc. Do you need to create different dashboards for different technical groups?

Answer: Typically I create one main dashboard for my team, and then I use integration tools to embed specific, single panels of my dashboard into whatever communications platform will be easiest.

So, for other engineers, I keep it nice and familiar with Slack integrations for usually the exact same panel I'm using, and then I've personally found that embedding a simplified visualization (i.e. pie chart, a single aggregated numerical output, etc) into an internal Wiki/blog space works best for leadership, etc

As for new hires, I didn't mention it but I really appreciate that my company lets each team do a full intro presentation to each new group of hires, so I have a presentation ready to go- I'm actually currently working on getting some auto-updating visualizations in there too.

Caito Scherr

Developer Advocate

Analyzing a Terabyte of Game Data

A couple of terabytes of data is not impressive by today's standards. A hard drive of that capacity costs about a hundred dollars. But things quickly get complicated when one needs to draw insights from a corpus of unstructured game scenarios that are increasing at a rate of a terabyte a year.

You will hear some lessons learned by a data scientist wearing an extra hat of data engineer on this fun side project. The talk will cover topics from using Apache Spark distributed computing framework and optimizing Delta tables to making sense of resulted mega-dataset with graph theory and an interactive Streamlit application.


Question: What do you think of Koalas? https://koalas.readthedocs.io/en/latest/

Answer: I think Koalas is a gold mine for creating blogs/tutorials for someone who wants to get online exposure. There is definitely a future to it - for Spark to be widely used it has to be familiar - people want to use Pandas.

Question: I'm just starting to explore Spark so good to learn some of those gotchas. A couple of questions:

  • When you say 'native Spark', does that include pyspark? How do you find using pyspark?

  • How many different operations did you end up doing with pandas UDF? How flexible was it. Did it also have some issues with being a relatively new feature?


  • Pyspark, Spark SQL, Rspark - they all compile to Scala and run that way, as I understood. I found Pyspark is OK and not too much different from Python once one gets used to some quirks like lazy evaluations etc. It was infuriating in the beginning that I can’s even find a max value of a column - or to print out a column. Like, how is below for getting a value of a dataframe?

    So of course I was happy to find .toPandas()

    res = c.sql("SELECT max(id) as maxid FROM table_name")
    id = res.collect()[0].asDict()['maxid']
  • I ended up with 3 UDFs.

    In the original data structures there are 3 different tables. UDF can return 1 dataframe, so I created 3 different ones to return all 3 tables, then wrote them to delta

Question: I have worked with parallelising some ML processes with pandas and sklearn just using vanilla python but mostly with small CSV data and even that could get pretty hairy. I would like to know what processes you encountered around testing during development using SPARK and Pandas UDF?

Answer: Testing and debugging is hard on Spark I found.

I switched to my IDE (PyCharm) and used databricks-connect, so the load was sent to my cluster in the cloud, but I was able to debug my code in the coziness of favourite IDE.

If you’re using EMR or another flavour of Spark - I am not sure if such connector exists.

Question: I always confuse about the partition in Spark

Just wondering do you just append everyday new data to existing parquet files, not update existing data.

Do you need to load existing parquet files to dataframe and join/union with new data? can you append data to an existing partition?

Answer: If you use Delta - it abstracts underlying files and allows us to treat a collection of files just like any other data table. So you can INSERT, MERGE, DELETE, UPDATE, SELECT. It’s pretty amazing how that works.

If you’re using vanilla parquet files - I think you can just append them yeah


Rimma Shafikova

Data Scientist

Do you want ML with that? When to say yes and why to say no.

In this talk I'll speak about why you should only use ML when you really need to, some techniques we've used successfully at Xero to help cut through the noise/analysis paralysis, and why it might help to consider approaching the build of an ML inside the system the same way you might decide what car to buy.


Question: The "should we build it" ML template looks a lot like the decisions you'd go through to build any kind of product - was that a starting point and it was tweaked to make it ML-specific?

Answer: Hmm, I can't claim authorship so don't have full lineage over this answer but my best guess from knowing the usual approach of the peeps who did is that they will have blank sheet of paper-ed it, then canvassed other ideas (from the internet and other Xeros), merged them into a hybrid and tweaked the resulting template over time. Hence our use of Confluence.

Question: Can you say anything about the volume and accept rate of the idea review pipeline?

Answer: Oh that's a great question and will be a useful byproduct to the formalism of the pipeline in the medium term. Too early to have any useful stats to share at this point.

Question: Really interesting talk, your process sounds very deliberative and considered. I think it'd avoid going on a lot of wild goose chases, have you found much resistance to it? Also do you ever flip the template and use it for deciding to kill a product in a case where there might not be a clear replacement?

Answer: You are very kind. We're aiming for “mindful and conscious” as the pace of change doesn't accommodate “deliberative and considered” very easily.

No, we've not found resistance at Xero although I know exactly where you're coming from as I have seen similar ideas run into resistance elsewhere.

We do put a lot of work into internal comms and 'explaining the why' though.

Kendra Vant

EGM - Data, ML & AI

Data Rainbows - select * from cloud;

Drowning in a lake? Stuck inside a warehouse? See your data in a different light! Postgres Foreign Data Wrappers provide SQL queries to live cloud data - all the structure and much lighter weight. In this session, we'll explore the potential of Data Rainbows for growing cloud environments and outline the challenges of working with data you can see but can't quite touch.


Question: Can you do joins? For example could you set up a virtual table for an exchange rate API, and then join that to your gamestop hourly data in a single query?

(Edit: posted this question a few minutes before you mentioned joins!)

Answer: Yep ... you can definitely do joins!

Disclaimer - It can get a little weird with the postgres query optimizer, so if you hit any weird edge cases please let us know.

Question: Do you have any caching features to improve query performance and to reduce the risk of getting hit by API rate limiting?

Answer: Yes, steampipe automatically caches queries. It's 5 mins by default, but configurable through the configuration file.

Other key ways we avoid rate limiting (features of the https://github.com/turbot/steampipe-plugin-sdk SDK that all plugins use):

  1. Plugins can set parallel limits on a per-table basis. We find the API throttling varies wildly by API, so it's helpful to do it this way.

  2. Retry with fibonacci backoff is easy to add to tables as well.

  3. We stream results through, so you get immediate wins while waiting for throttled items to complete.

Question: What’s the migration path from “live querying” during exploration phase to data warehousing?

Answer: Because Steampipe is really just a Postgres database, you should be able to treat it like any other data source?

We've had some users start embedding the Steampipe engine (with its embedded Postgres install) inside Docker containers to run the service mode.

You can learn more about running Steampipe in service mode at https://steampipe.io/docs/using-steampipe/integrations#steampipe-service.

Question: I was just wondering, what’s the overhead in terms of time it takes setting up steampipe against a number of apis? And is there much overhead maintaining the connectivity to a significant number of apis with this tool as time passes?

Answer: Writing new plugins is days of work (typically in Go). We have 23 already at https://hub.steampipe.io and a number of open source repositories are springing up too where people are starting to build their own.

If you'd like a sense of size, please check out https://github.com/topics/steampipe-plugin for example plugins., They are all open source!

Using plugins is very light weight I had 20+ installed for that talk. Installing new plugins is just

steampipe plugin install aws

and uses your default CLI creds. (configurable of course).

Question: I see that that there are options for steampipe to output data in csv and json formats rather than table. Would these output formats be live as well or are there support for a live format? Seems like feeding live output into a visualisation tool would definitely be rewarding.

Answer: You can run steampipe in service mode:

~ $ steampipe service start
Steampipe database service is now running:
 Host(s):  localhost,,
 Port:     9193
 Database: steampipe
 User:     steampipe
 Password: 0068-EXAMPLE
Connection string:
Steampipe service is running in the background.
 # Get status of the service
 steampipe service status
 # Restart the service
 steampipe service restart
 # Stop the service
 steampipe service stop

So, just start it that way and then connect to it as you would any other Postgres database!

If you use our CLI, then you can set data formats as you say, and those come straight from the live data.

Question: I was wondering, if an api response changes (name/type/structure) and you update the plugin to take that change into consideration, does it also handle the postgres table changes automatically? Does it create a new table for the updated plugin or update the existing table?

Actually, just saw this in the documentation:

Steampipe tables provide an interface for querying dynamic data using standard SQL. Steampipe tables do not actually store data, they query the source on the fly

Answer: Correct ... it really focuses on the live data. Each query is effectively converting the source API data (usually JSON) into the SQL table columns.

Changes to table structure can be done with plugin versions. We're usually releasing new plugin versions each week right now with new tables, extra columns etc. (We're careful with deprecations.)

Dropping a few links here in case helpful to others:

  1. Nathan - https://twitter.com/nathanwallace

  2. Steampipe (CLI was in service mode during the talk) - https://steampipe.io

  3. GitHub repository for Steampipe (we'd love your help building plugins!) - https://github.com/turbot/steampipe

Nathan Wallace


Other Years

Thank you to our sponsors and partners