3 DAY CONFERENCE

YOW! Data 2020

Tue, 30th Jun - Thu, 2nd Jul, Online Conference

9 experts spoke.
Overview

We're delighted to present an online version of YOW! Data in 2020, featuring selected invited speakers from our face to face conference. YOW! Data is an opportunity for data professionals to share their challenges and experiences while our speakers share the latest in best practices, techniques, and tools.

To avoid attendee "screen fatigue" and ensure we can accommodate reasonable time slots for both speakers and attendees across time zones, our online conference will take place across three days, with talks scheduled by theme (Data Science, Data Engineering, Machine Learning/AI) for a few hours each day. Talks will be delivered live and followed by a Q&A via chat. We’ll also schedule ample breaks so you have time to grab a coffee before networking and interacting with speakers and other attendees.

So although we can't meet face to face right now – you can still network, meet the experts and hone your skills all in one event!

Excited? Share it!

Programme

How COVID-19 has Accelerated the Journey to Data-driven Health Decisions

The speed with which COVID-19 has taken over the world has raised the demand for data-
driven health decisions and the shift towards virtual may actually enable the necessary data
collection. This session talks about how CSIRO has leveraged cloud-native technologies to
advance three areas of the COVID-19 response: firstly we worked with GISAID, the largest
data resource for the virus causing COVID-19 and use standard health terminologies (FHIR)
to help collect clinical patient data. This feeds into a Docker-based workflow that creates
identifying “fingerprints” of the virus for guiding vaccine developments and investigating
whether there are more pathogenic versions of the virus. Secondly, we developed a fully
serverless web-service for tailoring diagnostics efforts, capable of differentiating between
strains. Thirdly, we are creating a serverless COVID-19 analysis platform that allows
distributed genomics and patient data to be shared and analysed in a privacy- and
ownership-preserving manner and functioning as a surveillance system for detecting more
virulent strains early.

Q&A

Question: Where is the most value a citizen data scientist can add in the fight against covid?

Answer: There are so many tasks to do that certainly everyone is appreciated who wants to chip in. I think the biggest value is in collecting and combining data sources. There are a lot of databases that could be tapped into but the conversion between one format to another can be painstakingly slow. So having volunteers combine these would be a huge help.


Question: Is there any way to train the system to recognise all or most of the different ways to say (for example) "lost sense of smell" - or are there too many / takes too long?

Answer: Yes this is basically what NLP and ontologies do. There is a huge body of work which we can thankfully tap into but if this data is collected better from the get go there is less room for error. So the magic lies in designing software that makes it as seamless to record the data correctly as it would be by just typing in free text.


Question: Are those easy-to-use Folding at home COVID-19 support helping?

Answer: I am not a protein structure expert so I don't know the extent of how it is used and whether this could have been running on a HPC in 10 minutes. I know vodafone came to researchers with the brief of coming up with something to use the mobile CPUs that are sitting idle overnight, so we came up with something - back then it was not very useful, but this might have changed since. Where it helps though is with the awareness and the general idea of wanting to support science. If the folding-at-home evolves to you wanting to write a small python script to combine two datasets then it has helped in my books


Question: It seems interpretable ML models are useful in this context, but a majority of tools are not. When is interpretability necessary for researchers?

Answer: It probably is not necessary for researchers because people using ML are hopefully trained to a level to choose the right algorithm for the right job, i.e. is accuracy important or do you want to find out something about the data. Typically, we work with clinicians or biologists and unless we have a mechanistic story (gene A works in this pathway to cause this molecular change), we don't get buy-in. So for establishing trust that the tools we build gain real insights, explainable ML is crucial.


Question: Do you see a data trust (independent & fiduciary steward of data) popping up in Australia connecting healthcare providers, insurers, cloud & platform vendors, researchers and citizens ?

Answer: That is technically what MyHealth wants to achieve. It will be a while, there are more than just technical problems to work through though


Question: Are there mechanisms set up in your team to experiment with more relevant/new ML techniques or cloud tech to apply to genomics? If so, how do you keep up with that, on top of the research work you are doing?

Answer: We are on the boundary of applied/algorithmic ML, meaning that we are super interested in new ML if there is a strong reason we believe it to be better. I.e. we dappled with DeepLearning but found that the hyperparameter optimization space is too large to get good results practically: for our application space RF or LR outperforms them. But I guess once the systematic hyperparameter search in the cloud becomes cheaper this might change. So in short: we are always interested in new algs if there is a compelling argument why it should be better. As to keeping up, yes that is hard especially with such a large volume of news/papers about fundamentally new algorithms being just marketing.

Dr. Denis Bauer

Team Leader Transformational Bioinformatics
CSIRO

Applying Dynamic Embeddings in Natural Language Processing to track the Evolution of Tech Skills

Many data scientists are familiar with word embedding models such as word2vec, which capture semantic similarity of words in a large corpus. However, word embeddings are limited in their ability to interrogate a corpus alongside other context or over time. Moreover, word embedding models either need significant amounts of data, or tuning through transfer learning of a domain-specific vocabulary that is unique to most commercial applications.

In this talk, I will introduce exponential family embeddings. Developed by Rudolph and Blei, these methods extend the idea of word embeddings to other types of high-dimensional data. I will demonstrate how they can be used to conduct advanced topic modeling on datasets that are medium-sized, which are specialized enough to require significant modifications of a word2vec model and contain more general data types (including categorical, count, continuous). I will discuss how my team implemented a dynamic embedding model using Tensor Flow and our proprietary corpus of job descriptions. Using both categorical and natural language data associated with jobs, we charted the development of different skill sets over the last few years. I will compare data science skill sets in US jobs vs Australian roles, specifically focusing on how tech and data science skill sets have developed, grown and pollinated other types of jobs over time.

Q&A

Question: Do you think Problem Solving as a skill is under-represented in the world of DS?

Answer: It’s hard to say. I am not sure problem-solving belongs in job descriptions because its 1) a skill that is mentioned in every Job 2) there isn’t an easy way to test for it without being biased. I do think that critical analysis is something that is really important, being curious about a problem (and knowing when to stop digging!) but it’s hard to test in an interview or process without creating lots of work for everyone involved and without it being completely artificial.


Question: Fascinating insights the trend decline in the US but AU was rising. Perhaps a lag in adopting the commoditized solutions in our tech markets?

Answer: That is exactly what I think. I think US corporates are also faster than Australian corporates is changing (some) systems and the only way to know is to wait a little while longer so we see how it plays out!


Question: I might have missed it but what was the source of data to these studies to make sure it is a good representative of everything in the market and a fair comparison between US and AU?

Answer: Great question and I completely forgot to mention it. They were job descriptions from a variety of professional full-time jobs. We’ve been running a job search site before we became a recruiting data tool and so this is data was a random selection of that data. The jobs were sourced from employers and we ensured our data was clean (no duplications or jobs from job boards). There is a large variation in the companies represented, from small startups, to SMBs, Fortune500/Public companies. It’s not 100% but it seems like a pretty good sample.


Question: I wonder what other skills related to DS roles like infrastructure (not just broadly AWS) were there - like in DevOps it's all Terraform/Chef/etc.

Answer: Before I lose access to my computer and go through the drama of restoring credentials (and crying a lot), I didn’t see much of a change in more of the infrastructural type skills in Data roles, with the exception of AWS which seems to be growing as a requirement. This is interesting because the results I showed from DevOps roles in Australia (in the thread from Birger’s first message) showed stronger growth for GCP. GCP dominance is not the case as a requirement in data roles. I have to caveat all of this with the idea that our coverage of Aussie jobs is not as deep as US jobs (and it’s a smaller market that doesn’t have the same velocity), so its hard to delve further.


Question: What advice would you give to people looking at studying in the DS area? Is it worth getting a degree and then post-grad qualifications?

Answer: Great question Gretchen. I have enjoyed the fact that data science hasn’t been an academic field per se (at least when I joined) so we have a variety of people from all sorts of different backgrounds and approaches. I think there are lots of different ‘feeder’ degrees in data science, including economics, bioinformatics, sociology. I hope we continue to steal people from these because a lot of these areas have developed impressive models that can be used in the industry.

I do think on a personal level though, having core skills in relational databases, some proficiency in statistics and domain expertise in an area (with the ability to ask questions about the data) is critical.


Question: Roughly where is the effort spent in this sort of analysis %corpus; %topic model; % analysis; %visualization" % head-scratching?

Answer: The initial work was 30% from corpus processing to model which resulted in the key changing words list. 30% in the testing and validation of important words and their synonyms/aliases. 30% head-scratching with the SQL result (and the results from the Large Corpus experiments!) and the rest of it was 10% analysis for Australia.


Question: Can you supply any links to references in relation to how dynamic word embeddings are performed?

Answer: Yes, I glossed over that in the interest of time and collective sanity:

Balmer and Mandt, arXiv: 1702:08359

Yao, Sun, Ding, Rao and Xiong, arXiv: 1703:00607

Rudolph and Blei, arXiv: 1703:08052 and the Rudolph repo is http://bit.ly/dynbernemb


Question: Am wondering what intrinsic biases / blindspots / assumptions dynamics embeddings have and how you determine the confidence of topic evolution, like job skills you demonstrated?

Answer: I think biases of embeddings are problematic when you use them in an ML system (i.e. making decisions). It’s less (although not completely) of a problem when you are using the embedding to tell you how the world is, which is what we used it for.

In terms of confidence, we had a dataset that was separate and we evaluated the top words/concepts on that to see if we could reproduce the trends. The important thing to note (which I completely forgot) was that the trending words don’t tell you the direction of the trend. So somethings like Hadoop were decreasing and other things (like Tableau) were increasing but we wouldn’t know until we checked on the words through our independent data-set. All of the +/- % I showed were on the independent data-set using synonyms and extracted features.

My rule of thumb is that the dynamic embeddings didn’t highlight things that were super subtle (<20% change). But you can change the parameters but will increase the false positive rate too.

Maryam Jahanshahi

Research Scientist
TapRecruit

The Data Literacy Revolution

The popularity and ubiquity of data science, data analytics, AI and the trend towards digital transformation have led to massive, repeated failures in many businesses. Despite billions spent, hundreds of Ph.D.s hired, and much boasting in conference presentations, many enterprises are still struggling to leverage the value of these new technologies. The missing ingredient is the literacy of the rest of the organisation, particularly senior management.

This presentation will describe this new literacy: “data literacy”, the analogy with computer literacy, and reasons why this skill set will soon be as essential to all professionals as computer literacy is today. It will address issues of automation, the advent of decision making as the key managerial activity and the resulting democratisation of AI and analytics, however still maintaining a class of data science and analytics experts. The presentation will address issues of mindset, as well as skill set, and the ways in which management engagement with data analytics must change to leverage its value.

Q&A

Question: Do you see data literacy correlated to the culture of decision making and power distribution at org?

Answer: My instinctive response would be “yes, of course” but I would like to understand your question a bit better. I would also throw in the word “incentives” which is a consequence of power and culture and directly influences the relationship with data-driven decision making


Question: What factors lead to the tipping point for an org to seriously invest in data literacy? compliance costs, regulations, customer attrition, bad PR, something else?

Answer: Seriously invest? Or to actually take it seriously? Seems that the last 20 years are all about massive analytics investment without seriousness. But I would say that what makes the actual leaders to personally invest their time and effort into becoming personally data literate and engaged with analytics is when they fall into a simple trap: they have to make good decisions, bad decisions have bad consequences which they cannot escape/defer/dilute and no decision is a bad decision.


Question: You mention lack of literacy and not being able to understand/make use of analytics. What about the perspective of not knowing how to/what to ask of analytics?

Answer: I would say “ask” and “make use of” are essentially one and the same.


Question: An interesting attribution exercise: how much of this is due to lack of knowledge or “illiteracy” and how much due to the sheer lack of curated and timely data?

Answer: Most of it is illiteracy.


Question: I've heard the opinion that 'Australia, USA, Canada and UK all use legal liability to ensure companies do the right thing. As such, companies really only have an obligation to demonstrate to a court that they did due diligence to make the decision.' Do you think that data analysis simply serves this role, to give the appearance of due diligence without actually having used the data properly?

Answer: What, another way to spend money on “data” without actually having to use it for decision making ? I am sure that would have excited people with big budgets 6 months ago. Maybe even now. But 6 months from now? dunno


Question: Do you think MBAs of the future will change to include analytics in the development of the leaders?

Answer: Yep, already have plenty. Some of Melbourne's best analytics program is actually offered by business schools (Melbourne uni's Master of Business Analytics), some of the courses are also required for MBA now


Question: Maybe executives and "analytics professionals" need more than just "data literacy"? How can you make good decisions if you don't understand what you are measuring (even if you do have the tech skills)? Have we even established that most people understand how their businesses operate?

Answer: This goes without saying. And I can say quite a bit about the lack of actual business skills, as distinct from PR / “getting promoted” skills of many folks. Nonetheless, the main thing preventing the realization of value from analytics is data literacy. But business literacy is a thing too.


Question: What is data literacy? How do we improve our data literacy for making informed decisions?

Answer: Learn critical thinking, statistics, science.


Question: Why do you think analytics matters more in current adverse circumstances than before?

Answer: Because making good decisions matters more. Because not making decisions is a recipe for disaster in a way it isn’t in good times. Because with competition there are only so few survivors, and they have to work very hard just to survive. And the work is thinking, decision work. And you will probably make better decisions if you use all available information and reason about it in the most effective way.

Dr Eugene Dubossarsky

Managing Partner of the Global Training Academy and Chief Data Scientist
AlphaZetta

Cluster-wide Scaling of Machine Learning with Ray

Popular ML techniques like Reinforcement learning (RL) and Hyperparameter Optimization (HPO) require a variety of computational patterns for data processing, simulation (e.g., game engines), model search, training, and serving, and other tasks. Few frameworks efficiently support all these patterns, especially when scaling to clusters.

Ray is an open-source, distributed framework from U.C. Berkeley’s RISELab that easily scales applications from a laptop to a cluster. It was created to address the needs of reinforcement learning and hyperparameter tuning, in particular, but it is broadly applicable for almost any distributed Python-based application, with support for other languages forthcoming.

I'll explain the problems Ray solves and how Ray works. Then I'll discuss RLlib and Tune, the RL and HPO systems implemented with Ray. You'll learn when to use Ray versus alternatives, and how to adopt it for your projects.

Q&A

Question: Does Ray broadcast to nodes when distributing state? If so is that a bottleneck? And is it difficult to reason about when the broadcast might happen when writing code for Ray?

Answer: Ray doesn’t give you a lot of control (yet) about this fine-tuning of behavior. Mostly it tries to balance resources across a cluster. For example, an actor’s state is stored in a distributed object store (with some optimizations to skip it…), so if another actor/task needs that state on another node, Ray will copy it over. That has the nice effect of collocating the data where it should be, with the compute, over time. There are a few best practices about not asking for results when you don’t need them (like the results of those calls to make_array, but otherwise, Ray is reasonably smart about where it schedules stuff to avoid too many network hops moving data around.


Question: What are your thoughts on frameworks vs containerisation for MLOps?

Answer: Fortunately, you can use Ray for fine-grained distribution (at the MB level for example), and larger frameworks, like K8s, for macro-level scheduling. Ray has an autoscaler that people use to dynamically scale pods.


Question: I worked at a company that was reluctant to allow provisioning of cloud resources. So we made our own cluster from our work computers overnight.

Does Ray work with local commodity hardware networks? It seems like Ray could have reduced a lot of overhead work for us coordinating our “poor-mans-clustering”.

Answer: Yes, it works great that way. That’s mostly how the Berkeley researchers worked at first, on donated clusters.


Question: Any comments on performance of Ray vs alternatives?

Answer: two popular multithreading/multiprocessing libs in Python are joblib and multiprocessing.Pool. Ray provides API-compatible replacements that are a bit slower on the same node, but break the node boundary if you want to scale beyond a single machine. If you use asyncio, Ray works nicely with that, too, as an alternative syntax from what I demonstrated.

Dean Wampler

Dean Wampler, Ph.D., is the Architect for Big Data Products and Services in the Office of the CTO at Lightbend, where he focuses on the evolving “Fast Data” ecosystem for streaming applications based on the SMACK stack, Spark, Mesos, Akka (and the rest of the Lightbend Reactive Platform), Cassandra, Kafka, and other tools.

Stream Processing for Everyone with Continuous SQL Queries

About four years ago, we started to add SQL support to Apache Flink with the primary goal to make stream processing technology accessible to non-developers. An important design decision to achieve this goal was to provide the same syntax and semantics for continuous streaming queries as for traditional batch SQL queries. Today, Flink runs hundreds of business critical streaming SQL queries at Alibaba, Criteo, DiDi, Huawei, Lyft, Uber, Yelp, and many other companies. Flink is obviously not the only system providing a SQL interface to process streaming data. There are several commercial and open source systems offering similar functionality. However, the syntax and semantics of the various streaming SQL offerings differ quite a lot.

In late 2018, members of the Apache Calcite, Beam, and Flink communities set out to write a paper discussing their joint approach to streaming SQL.
We submitted the paper "One SQL to Rule Them All – a Syntactically Idiomatic Approach to Management of Streams and Tables" to SIGMOD - the world's no. 1 database research conference - and it got accepted. Our goal was to get our approach validated by the database research community and to trigger a wider discussion about streaming SQL semantics. Today, the SQL Standards committee is discussing an extension of the standard to pinpoint the syntax and semantics of streaming SQL queries.

In my talk, I will briefly introduce the motivation for SQL queries on streams. I'll present the three-part extension proposal that we discussed in our paper consisting of (1) time-varying relations as a foundation for classical tables as well as streaming data, (2) event time semantics, (3) a limited set of optional keyword extensions to control the materialization of time-varying query results. Finally, I'll discuss how these concepts are implemented in Apache Flink and show some streaming SQL queries in action.

Q&A

Question: what's the best cloud-based solution to host Flink? Are there any SaaS solutions in a similar vein to Confluent Cloud?

Answer: Amazon's Kinesis Analytics for Java is backed by Flink and a true cloud product. To my knowledge, most people run Flink on Kubernetes. Yarn is another popular option. Few run it on Mesos. There are a bunch of Kubernetes operators for Flink out there.


Question: Where do you see customers hosting Flink - is Kubernetes popular or a cloud-specific container solution such as ECS on AWS and keen to find out more about Alibaba Cloud - I'd imagine there must be some customers using Flink at an incredible scale.

Answer: We (Ververica) have a product that's based on Kubernetes to manage Flink jobs and their lifecycle. We're currently working on SQL support for this product. Some users run Flink on really large scale (10000+ cores, 20+ TB state) like Netflix or Alibaba.


Question: I see several use cases coming out of this. Perhaps getting carried away, what’s your take on temporal joins especially in context of a serving layer. how are these joins resolved and updated between windows?

Answer: Temporal joins are a super interesting topic. We've identified 2 common patterns for temporal joins.

  • Interval Joins, joins where one row of table A should be joined with rows of table B that are not more than x minutes apart from each other. You can imagine each row of A has a join window that depends on its own timestamp and the join connects the row with all rows of the other table that fall into that window.
  • Something that we call an enrichment join. You have a (streaming) table and want to enrich its rows with data from another table (which is also changing but usually slower changing). The tricky part is to join with the right version of the other table.

I guess you are referring to the second case. Flink supports them for look up tables in external databases but only with processing time semantics so far ( https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/streaming/joins.html#join-with-a-temporal-table)

The community is currently discussing how this could be supported with event time semantics. Btw. my categorization of different joins is based on use cases. They can both be expressed with regular SQL but need to follow a specific pattern. For example an "Interval Join" is expressed like this:

SELECT
 HOUR(r.eventTime) AS hourOfDay,
 r.psgCnt,
 AVG(f.tip) AS avgTip
FROM
 Rides r,
 Fares f
WHERE
 r.rideId = f.rideId AND
 NOT r.isStart AND
 f.payTime BETWEEN r.eventTime - INTERVAL '5' MINUTE AND r.eventTime
GROUP BY
 HOUR(r.eventTime), r.psgCnt;

Question: For the uninitiated, where do you see Flink best fit compared to other frameworks and engines out there e.g. Spark, Lightbend, Apex, etc?

Answer: It's true, there are a bunch of stream processing systems out there but the space seems to consolidate a bit. The Apache Apex community decided to retire the project, Apache Heron is also inactive. From my point of view, there are three major open-source distributed stream processing systems: Spark, Kafka Streams, and Flink. My impression is that the Spark community stopped investing in streaming features a while back. They support a feature set that's sufficient for real-time ETL with latencies of a few seconds. This is what Spark does well. If you want to build more advanced stream processing applications or need lower latencies (e.g. for altering, fraud detection, etc.) it's not the best choice IMO. I have to admit that I'm not super familiar with Lightbend's framework. Flink works well for big streaming problems with exactly once correctness requirements and low latency. We are running a community conference called Flink Forward and collected ~350 talks on YT with use cases.

Fabian Hueske

Co-Founder, Software Engineer
Ververica

Self supervised learning & making use of unlabelled data.

The general supervised learning problem starts with a labelled dataset. It's common though to additionally have a large collection of unlabelled data also. Self supervision techniques are a way to make use of this data to boost performance. In this talk we'll review some contrastive learning techniques that can either be used to provide weak labelled data or to act as a way of pre training for few-shot learning.

Q&A

Question: Are there any libraries you would recommend for self-supervision?

Answers: not specifically; I see a lot of these techniques as being more around orchestration of moving parts.


Question: Do you think there's a gap there for libraries that could replace the bespoke code or it just is what it is?

Answers: Yeah absolutely! transfer learning went through a similar thing a while back. it used to be tricky and clumsy to do but, as it became more and more common, the libraries abstracted more and more of it. now it's trivial to do transfer learning. The same thing will happen for these techniques I think.


Question: Self-supervision has been applied to unstructured data (NLP, images etc.). i haven't seen much work on tabular data beyond auto-encoding in TabNet (https://arxiv.org/abs/1908.07442), but it's an important data source at most companies.

what kind of techniques can be used on tabular data?

Answer: I think it's the same set of training techniques. the encoding part covers an inductive biases; e.g. modality, things like images vs text, but the second part, the encoding to the ypred it doesn't matter what the input was anymore. I think for the self-supervision problem it's more about how you define the positive/negative pairs; what examples do you want the encoding to be similar for vs not.


Question: Where can I learn about self-supervision learning in detail. Any recommendation with reference to use the same in the Financial Service Sector?

Answer: Two interesting papers recently I think are <a href="https://arxiv.org/abs/2006.10029" target="blank">https://arxiv.org/abs/2006.10029 & https://arxiv.org/abs/2006.07733. from these, I’d look back through common references and find what's interesting to you.


Question: On transfer learning, can we use encodings from a general domain to train more specific domains. For example, using an encoder from say a public feed on the internet to train on a highly specialized corpus, e.g. emails in a trading company?

Answer: Yes; as long as the specific domain, in some way, is a subset of the general domain. it's common for the large original model to deal in a wide set of labels and you'll see a good result if your labels, in some way are a subset. The best results are when the input data is similar; e.g. you might not see as good a result if the large model was trained on camera phone images but you're trying to transfer to images taken from the Hubble space telescope.

Mat Kelcey

Machine Learning Principal
ThoughtWorks

How to be a more impactful data analyst

As the sole analyst in a fast-growing Australian startup, I experienced the pain of the traditional analyst workflow — stuck on a hamster wheel of report requests, Excel worksheets that frequently broke, an ever-growing backlog, and numbers that never quite matched up.

This story is familiar to almost any analyst. In this talk, I’ll draw on my own experience as well as similar experiences from others in the industry to share how I broke out of this cycle. You’ll learn how you can “scale yourself” by applying software engineering best practices to your analytics code, and how to turn this knowledge into an impactful analytics career.

Q&A

Question: I see trends to further expand roles into specialised areas (Analytics Engineer, Feature Engineer, DataOps Engineer). I guess I’m curious how “big” an organisation needs to be to warrant these specialisations?

Answer: I think we see specialization more around “discipline” of analytics, e.g. “this analytics engineer is going to work with the product team”


Question: It would be interesting to see how this progression leads to Data Scientist.


Answer: I did a data science course, and it was enough to teach me that data science wasn’t for me - definitely not knocking it, I think data scientists are incredibly talented, just wasn’t the right fit! 

Fortunately, we have a few different talks on how analytics engineering + data science complement each other! 

Predicting customer conversions using dbt + machine learning - Kenny Ning, Better.com Using dbt in a machine learning pipeline – Sam Swift, Bowery Farming 

I think data scientists spend a ton of time cleaning data and doing feature engineering (which I accidentally called “feature extraction”, oops!), and learning how to use these same patterns can free up time to do the “real” data science work


Question: when you transitioned from the Data Analyst role to the Analytics Engineer, did you find that the Data Analyst role still had to be filled by someone else?

Answer: Not initially — the company was at a stage where the main need was just “reporting”. I moved to the US shortly after that, but have kept up a good relationship with the team there.

They experimented with encouraging stakeholders to self-serve their deep dive analytics, but now I believe they are hiring specialists data analysts to do the real “why?” style work


Question: Also, when you said you added tests to your queries/transformations, did you use dbt for this too?

Answer: Yep


Question: To follow up on the tests, did you find that the prebuilt tests in dbt were enough or did you have to build new ones to provide good coverage?

Answer: This is a pattern we’ve seen a few times over at other companies. Normally a data analyst doesn’t have the bandwidth to do real analysis, because they are stuck churning out reports. It’s about solving the immediate need the right way I guess. dbt tests get you 90% of the way there in my experience. I use a combination of the builtin tests + custom ones I write myself (which I normally open source — example) You can definitely use a complementary tool like Great Expectations with dbt though!


Question: Would you still say that the best way to store business rules is encoded within SQL? I've seen issues where these rules get buried deep within queries and people forget about them and don't know how to change them. That being said, I can't think of a better way to resolve the issue.

Answer: Writing your SQL queries in a maintainable and modular way goes a long way to solve this. For us, this means having the cleaning logic in one place (e.g, renaming columns, casting fields to the right timestamp) and business logic in downstream models. Then adding descriptions and comments is a huge ++


Question: What is the best link for people to learn more about dbt?

Answer: Best place to get started is our tutorial! All you need is a Gmail account to do it (it runs on BigQuery!)


Question: What were the unexpected things you found when you started thinking of your work as a product?

Answer: Things that spring to mind:

  • The amount of work it was to change other people’s mind that I was building a product — some people just want you to do the report for them (ofc many are happy they can do it themselves!).
  • Finding the balance between doing things the “quick and dirty” way versus the best-practice way. At times I’d sink too much time into building new data assets that weren’t used much, when I could have just done the rough version and gotten the same value
  • Learning to not be too “clever”. Once I learned how to write Jinja (a templating language you can use in dbt, similar to liquid), I wanted to make everything use Jinja. But that was a bad idea — too much Jinja is hard to read for others, especially compared to SQL

Question: What documentation did you find the most useful to add to help you get others to use the products you were building? what didn't work?

Answer: My rule of thumb — “document the things that don’t make sense”.

I have seen lots of companies try to document every single field, but sometimes, you don’t really need to document that customerid is “the primary key of the customers table”


Question: Do you see data modelling (vault, kimbal, inmon, etc) a role for data engineering or the analytics engineer or someone else?

Answer: We (Fishtown Analytics) take a “Kimball-lite” approach to data modeling — we’ve got conventions, but we’re not too strict on them. Definitely for us, it falls on the AE. In bigger teams (e.g. the team at Casper — a mattress company here in the US), we see the analysts “spec out” the models, applying Kimball designs, while the analytics engineers translate it to code


Question: How do you manage expectations during the process of tidying up your analytical pipelines? I guess you get more requests from stakeholders, but you had to select a few to accommodate your agenda?

Answer: It’s always an uphill battle for sure! Some tricks I have seen used:

  • Winning one stakeholder over at a time — find the person who has the most manual data process (usually they are in finance, and it involves getting CSVs emailed to them and putting them in Excel), and work with them to improve their process (by moving some of the work over to your transformation pipeline). Then, when their life is easier, they’ll (hopefully) become an advocate
  • Communicate widely about the wins you have! Sometimes you just gotta share the wins so people realize you’re working on something
  • Having dedicated office hours for adhoc questions. Especially if you have a Slack channel for it where people can drop questions, and then once every other day you’ll address those questions (rather than answering real time)
  • Telling someone that if they need their work done this sprint they have to be the one to inform the party who will miss out (amazing how many requests can suddenly wait until next sprint )

Question: I wonder if anomaly detection of data under this framework would fall under the responsibility of the analytics engineer then? I've known of companies that have data engineers looking managing anomaly detection instead

Answer: In my opinion, this falls within analytics engineering! You can add source tests in dbt that will find most anomalies.But the lines between each role are pretty blurry — some AEs are good at doing analytics work too, some are more comfortable writing python (I’m in the latter category these days!). So, like lots of things, “it depends”


Question: How would you manage stakeholders that give you vague “bug reports” like “this data looks fishy?”

Answer: Force them to create an issue in your dbt repo. Add <a href="https://github.com/fishtown-analytics/dbt-init/blame/master/starter-project/.github/issuetemplate.md">an issue report template to make it so they have to write it well.

Claire Carroll

Analytics Engineer
Fishtown Analytics

Apache Pulsar: The Next Generation Messaging and Queuing System

Apache Pulsar is the next generation messaging and queuing system with unique design trade-offs driven by the need for scalability and durability. Its two layered architecture of separating message storage from serving led to an implementation that unifies the flexibility and the high-level constructs of messaging, queuing and light weight computing with the scalable properties of log storage systems. This allows Apache Pulsar to be dynamically scaled up or down without any downtime. Using Apache BookKeeper as the underlying data storage, Pulsar guarantees data consistency and durability while maintaining strict SLAs for throughput and latency. Furthermore, Apache Pulsar integrates Pulsar Functions, a lambda style framework to write serverless functions to natively process data immediately upon arrival. This serverless stream processing approach is ideal for lightweight processing tasks like filtering, data routing and transformations. In this talk, we will give an overview about Apache Pulsar and delve into its unique architecture on messaging, storage and serverless data processing. We will also describe how Apache Pulsar is deployed in use case scenarios and explain how end-to-end streaming applications are written using Pulsar.

Q&A


Question: Are brokers completely stateless? Is there any overhead (e.g. rebalancing etc) when scaling up and down the number of brokers behind your load balancer?

Answer: yes, it is more or less instantaneous


Question:  Does Pulsar have a platform to run on in a similar vein to confluent cloud so you don't have to manage deployments?

Answer: check out https://kesque.com/


Question: Getting started with Pulsar on Kubernetes with the Helm Chart, for the persistent volume claims, does it automatically re-balance and you add pods?

Answer: yes, it automatically rebalances


Question: Are the Pulsar Functions were a lot of the data cleaning/formatting would happen before they enter a dataset for e.g. training?

Answer: Definitely and you can model serve using pulsar functions too.


Question: Does Pulsar sit on top of Zookeeper in the same way that Kafka does (or did)?

Answer: Pulsar uses ZK but very minimally just some metadata - Kafka uses ZK very intensively.


Question: Do you have magic dust which keeps Zookeeper running?  Many I know see it as the weak link.

Answer: Use zk minimally - that what pulsar does with zk - just the metadata


Question: The companies that are currently using Pulsar seems to mostly be large (Chinese) companies with ridiculous amount of traffic, by Australian standards. Do you think there is a certain scale beyond which Pulsar becomes practical, or do you think it can be used in a cost-effective way by smaller players?

Answer: Lots of US and European companies use it as well. They have not talked in public about it. A big car manufactures uses it for getting data from the cars and publishing for internal consumption.

Karthik Ramasamy

Senior Director of Engineering
Splunk

Data Maturity Levels

At a startup, typically the main concern is survival. Advanced analysis techniques and machine learning is often a luxury or even a distraction from the prime directive - don't die. However, as a startup grows,the data requirements evolve and eventually the startup morphs into a larger company where data is a core competitive advantage that drives decision making and product features.

In this talk, I describe what this evolution looks like and provide a framework to evaluate the different data maturity levels that a company may be at. This framework can not only be applied to a growing company, it can also be applied to a team or department within an already established company.

Q&A

Question: Onboarding of data has never been easier, how do you control it and do it in a structured manner? I'm hearing about Data Product Managers with businesses - is this a role that you have in Canva?

Answer: For third-party data sources, there are typically tools and services that do handle it somewhat and give you structured data. However, the ELT approach mostly helps here, since you can load data into a warehouse and let the consumers tease out the structure they need.

We don’t have a Data PM at the moment at Canva. I think for use, the Data Analysts are producing “datasets” which are the product they provide internally, so they are effectively the Data PM. If we started to build products in a general sense with Data, we’d probably look at getting a Prod Mgr.


Question: As companies evolve (grow, pivot etc), do you see them iteratively go through these levels over and over again?

Answer: Very much so. Companies often iterate within and across levels.


Question: Can the maturity levels happen in parallel? ie. like level 3 - Analysis and 4 - learning?

Answer: Yes. I would put Canva at a solid Level 3, but sprinklings of Level 4 for example. Some teams are definitely at different levels. I think we would prefer if everyone tried to operate at higher levels, but it will depend if there is a data-literate PM or member on the team building the feature. We certainly see teams adding more analytics as they’ve launched something as they realize they have blindspots or have a desire for more data.


Question: You mentioned in level 2 too much data, not thinking through the whole architecture well when bringing multiple data sources in. I've always found it hard to account for a huge scope of data. I prefer to just start with one piece - somewhat think out the architecture for a smaller piece, and deal with consequences later (however large) as they arise. Any thoughts on how to improve this?

Answer: It really is a big challenge. The iterative approach is definitely the best start, but getting buy-in from peers / managers is also key. It is important to try and get understanding that you’re thinking about the best long-term, while also addressing the short-term needs. Saying “no” is tough, but often “not yet” works ok. Somebody who can explain why you want to start with only 1 or 2 datasets first, so you can go about getting a more long-term approach without too much cost. Reprocessing/remodeling etc 1 or 2 datasets isn’t too bad if you need to do it to guide a better long-term plan. After you’ve loaded 10s-100s of datasets, it becomes a big problem.


Question: Can you talk about how Canva has gone through the maturity levels and where it is at the moment?

Answer: At the moment, I place Canva at a solid Level 3 and with individual projects attempting Level 4. Some successful, some failures (but the failures are necessary for learning). When I joined Canva over 3 years ago, I’d say we were at Level 1 “Not enough data” and Level 2 “Too much data”. In some areas (mostly third-party data sources) we had huge gaps. In other areas (mostly our own 1st-party data) we had too much data (both in terms of scale and variety) to know where to look.


Question: Do you prefer building for maturity level, or building for business needs and increase maturity level as a side effect?

Answer: The way I think about it, is that the levels give you somewhere to start a conversation. I think if somebody tells you, we need “Level 4” AI as a business need, I think that’s an indication of some serious unrealistic expectations or misalignments in the business/organization. Unless the stakeholder sponsoring the “Level 4” business need is willing to go on the journey and level up all the levels preceding it, I think the project has a very great chance of failure.


Question: I’ve had a vague idea of data maturity levels and turned down job offers because a company was trying to hire analysts and data scientists when they were effectively at level 0/1. Hiring for a role before they have the maturity to need that role.

Anyone else use a data maturity model in assessing if they should take a job or not?

Answer: Joining a company at L0/1 is often great for learning (if you’ve got enough experience or support to learn). Otherwise, if you’ve not got time / motivation / experience for the challenges / pain of working in an environment where you’ll be hired as a DS / DA, but sometimes having to wear many other hats, then you probably want to think about L3 or above.

Greg Roodt

Data Engineering Lead
Canva

SkillsCasts
Other Years