YOW! Data is a two day conference that provides in-depth coverage of current and emerging technologies in the areas of Big Data, Analytics and Machine Learning.
The number of data generators (drones, cars, devices, home appliances, gaming consoles, online services, medical and wearables) is increasing rapidly and with that comes increased demand for smarter solutions for gathering, handling and analysing data. YOW! Data 2019 will bring together leading industry practitioners and applied researchers working on data driven technology and applications.
Excited? Share it!
Which Plane Woke Snowy the Cat?
Our new cat, Snowy, is waking early. She is startled by the noise of jets flying over our house.
This talk describes how common radio receivers can be configured to gather aircraft transponder signals. With an opensource data streaming framework (Apache Kafka) we can build a streaming data pipeline to rapidly process aircraft movements in real-time.
With data and spatial visualisations we can really determine which plane woke Snowy the cat.
Sketch algorithms
In this talk we will look at how to efficiently (in both space and time) summarize large, potentially unbounded, streams of data by approximating the underlying distribution using so-called sketch algorithms. The main approach we are going to be looking at is summarization via histograms. Histograms have a number of desirable properties: they work well in an on-line setting, are embarrassingly parallel, and are space-bound. Not to mention they capture the entire (empirical) distribution which is something that otherwise often gets lost when doing descriptive statistics. Building from that we will delve into related problems of sampling in a stream setting, and updating in a batch setting; and highlight some cool tricks such as capturing time-dynamics via data snapshotting. To finish off we will touch upon algorithms to summarize categorical data, most notably count-min sketch.
Simon Belak
Simon built his first computer out of Lego bricks and learned to program soon after. Emergence, networks, modes of thought, limits of language and expression are what makes him smile (and stay up at night). The combination of lisp and machine learning put him on the path of always striving to make himself redundant if not outright obsolete.
From Sparse Data-sets to Graphs: When Explicit Relationships Bridge the Gaps
The various ways we frame a problem have significant impact on how we approach it. From the questions we ask to the tools we use. A simple change can have great repercussions and be of significant benefit to how you tackle the issue at hand.
At EB Games, in trying to develop a customer segmentation/targeting model, we struggled with traditional clustering algorithms on tabular data. All the necessary information was there, but the resulting datasets were so sparse that results were difficult to come by. When we changed to a graph based model, it greatly increased the ease with which we could ask questions and add further detail. The power afforded to us by having explicit relationships meant that suggestions and ideas from subject matter experts were more easily translated into something that could be quantified and/or qualified.
In this presentation I will share the process and journey of this project and provide insights on the benefits gained from using a different structure to store and analyse your data.
Game Engines and Machine Learning: Training a Self-Driving Car Without a Car?
Are you a scientist who wants to test a research problem without building costly and complicated real-world rigs? A self-driving car engineer who wants to test their AI logic in a constrained virtual world? A data scientist who needs to solve a thorny real-world problem without touching a production environment? Have you considered AI problem solving using game engines?
No? This session will teach you how to solve AI and ML problems using the Unity game engine, and Google’s TensorFlow for Python, as well as other popular ML tools.
In this session, we’ll show you ML and AI problem solving with game engines. Learn how you could use a game engine to train, explore, and manipulate intelligence agents that learn.
Game engines are a great place to explore ML and AI. They’re wonderful constrained problem spaces, tiny little ecosystems for you to explore a problem in. Here you can learn how to use them even though you’re not a game developer, with no game development experience required!
In this session, we’ll look at:
- how video game engines are a perfect environment to constrain a problem and train an agent
- how easy it is to get started, using the Unity engine and Google’s TensorFlow for Python
- how to build up a model, and use it in the engine, to explore a particular idea or problem
- PPO (proximal policy optimisation) for generic but useful machine learning
- deep reinforcement learning, and how it lets you explore and study complex behaviours
Specifically, this session will:
- teach the very basics of the Unity game engine
- explore how to setup a scene in Unity for both training and use of a ML model
- show how to train a model, using TensorFlow (and Docker), using the Unity scene
- discuss the use of the trained model, and potential applications
- show you how to train AI agents in complicated scenarios and make the real world better by leveraging the virtual
We’ll explore fun, engaging scenarios, including virtual self-driving cars, bipedal human-like walking robots, and disembodied hands that can play tennis.
This session is for non-game developers to learn how they can use game technologies to further their understanding of machine learning fundamentals, and solve problems using a combination of open source tools and (sadly often not open source) game engines. Deep reinforcement learning using virtual environments is the beginning of an exciting new wave of AI.
It’s a bit technical, a bit creative.
New Technologies to the Rescue of Epileptics
Creating a device that can predict epileptic seizures to help patients regain autonomy in their everyday lives?
That's the amazing project we are working on with the association Aura.
Epilepsy is one of the most common neurologic diseases. It remains a huge scourge for patients and no definitive cure has yet been found. The seizures are so unpredictable they have disastrous consequences on the epileptics lives and autonomy.
We are creating a mobile app to detect and warn patient when a seizure is coming, using technologies such as Airflow, Docker, Grafana or Ansible.
We will cover the full data architecture of the project and will present the tremendous work that has been done by all the people who worked on this meaningful project.
This project is now in open-source, available for anyone to help and contribute.
#TechForGood #TechForCare
Making The Black Box Transparent: Lessons in Opacity
Deep Learning is all the rage now. It is powerful, it is cheap. To proponents of "explainable" machine learning however, this is not really good news - deep learning is essentially a black box that one can't look into.
To be sure, there are efforts to peek inside the black box to see what it's learning - saliency maps and various visualization tools are useful to understand what is going on in deep learning neural networks. The question of course, is whether it's worth it?
In this talk I shall cover the basics of looking into a deep neural network and share a different approach of looking into neural networks.
Look at GDPR from a Data Engineering Perspective
GDPR has been being discussed quite frequently in the past year, many businesses need to invest more into R&D to become GDPR compliant. This talk discusses about the concepts, challenges, analytical needs, ideal system setup and a possible roadmap for existing projects from a data engineering perspective. It is meant to be a starter of discussion around how we address the need of customer service, application debug, event sourcing and analytics within the GDPR context, rather than a step-by-step guide.
From Zero to Tensorflow: Building an Analytics Dept.
Day 1: one engineer vs. a heap of time-series data on a 1990s-era database
Four years on, there's 8 of us, we run TensorFlow analytics on a Hadoop cluster to detect subtle signs of a potential breakdown on earthmoving equipment. We've prevented million-dollar component failures, and reduced a lot of "parasite" stoppages.
This talk details the strategy and lessons learned from building an analytics department from scratch, in particular:
- Many analytics depts. were created as a "Flavour of the month". How do you approach this perception, survive and go beyond?
- Choosing the right projects to create a credible and sellable offering as quickly as possible to build your reputation.
- Expectation management, and choosing projects: Dealing with those who think "it won't work", and those who think you can solve all problems,
- Growing from a "start-up in a large company" to a more mature group. Change management, scaling, velocity, etc.
- Approach to R&D and launching new projects, dealing with the "shiny toys"
Practical Geometric Deep Learning in Python
Geometric Deep Learning (GDL) is a fast developing machine learning specialisation that uses the network structure underlying the data to improve learning outcomes. GDL has been successfully applied to problems in various domains with network-structured data, such as social science, medicine, media, finance, etc.
Inspired by the success of neural networks in domains such as computer vision and natural language processing, the core component driving GDL is the graph convolution operator. This operator is used as the building block for deep learning models applied to networks. This approach takes advantage of many algorithmic and computational developments from modern neural network research and practice – such as composability, optimisation, and end-to-end training – to improve predictive performance.
However, there is a lack of tools for geometric deep learning targeting data scientists and machine learning practitioners.
In response, CSIRO’s Data61 has developed StellarGraph, an open source Python library. StellarGraph implements a number of state-of-the-art methods for GDL with a clean and consistent API. Furthermore, StellarGraph is designed to make the application of GDL algorithms to network-structured data easy to integrate with existing machine learning workflows.
In this talk, we will start with an overview of GDL and its real-world applications. Then we will introduce StellarGraph with a focus on its design philosophy, API and analytics workflow. Finally, we will demonstrate StellarGraph’s flexibility and ease-of-use for developing solutions targeting important applications such as product recommendation and social network moderation. Lastly, we will touch on the challenges of designing and implementing a library for a fast evolving machine learning field.
Scaling Analytics as your Company Grows
Analysts are a bottleneck, they can't answer all the questions that are coming from their business users
Business users are heavily dependant on their analysts, as a result, when the analyst is not available they either wait for a long time or act according to their gut feeling
Analysts are feeling frustrated because they are underutilized. Most of their tasks require simple querying and dashboarding while they want to do data science
Does any of these sounds familiar? then you should join this talk.
Back in the days, when Atlassian had only a few hundreds of employees, we used to hire analysts to help the business teams with insights generation. As we grew, we hired more of them, but we came across these problems and we realized that this approach is not scalable.
During this talk, I will show how we solved these problems. We will see the Atlassian journey towards self serve analytics and data-driven culture.
Transparent Government: The Stories we can Tell with Data
There is an increasing and powerful global push to open up the trove of information governments generate, collect, and manage. There is a vast array of data ranging from; open data, big data from multiple sources, sensitive information about citizens, and complex information about businesses interactions with government, such as contracts and procurement, taxes and royalties.
This creates many opportunities to use this data to tell stories. To help citizens and the private sector understand what is happening across governments, means not just access to this data, but tools that allow people to dig through, analyse it, use diagrams and maps to make sense of it and to connect information in a way that is understandable, engaging and useful to the general public.
This talk shows some of the possibilities available today and looks at what may be possible as governments around the world open up. It shows some of the ground-breaking work done in Australia. The talk touches on issues around transparency, accountability, systems of protection, data ethics frameworks, and ultimately how to build trust.
Throughout, we'll see some of the work by Nook Studios building systems for government, including the ground-breaking Common Ground mining title information system, as well as tools that help link and connect information in a meaningful way using data pathways.
Working with Large Numbers of Non-Trivial ETL Pipelines
Data pipelines need to be flexible, modular and easily monitored. They are not just set-and-forget. The team that monitors a pipeline might not have developed it and may not be experts on the dataset. End users must have confidence in the output.
This talk is a practical walkthrough of a suggested pipeline architecture on AWS using Step functions, Spot instance, AWS Batch, Glue, Lambda and Data Dog.
I'll be covering techniques using AWS and DataDog, but many of the approaches are applicable in an Apache Airflow/Kibana environment.
Data Driven Diversity
Since January, we have RSVPd “yes” 26,610 times to tech Meetups in Brisbane*. That’s a lot of pizza.
Every Monday, I run a script which posts in the #meetup channel of the Brisbane Developers Slack group. It's a simple Node script that calls the Meetup API, and lists every tech event in Brisbane for the following week. The script was conceived from curiosity, a want to share information, and because I'm a stats nerd.
Apart from writing code, I also co-host two Brisbane tech Meetups: Women Who Code and CTO School. In 2018, these user groups have grown from a humble handful of regulars, to almost standing room only.
I will share with you a statistical analysis of a years worth of Brisbane tech Meetup data (updated for YOW! Data 2019), the secret life of a Meetup organiser, and how the transparency of information (such as speaker gender ratio) has started to affect change in this community, for the better.
*data from Jan 1 to Sept 30, 2018
Auto feature engineering - Rapid feature harvesting using DFS and data engineering techniques
As machine learning adoption permeates across many business models, so is the need to deliver models at a much faster rate. Feature engineering arguably is one of the core foundations of model development cycle. While approaches like deep learning tend to take a different approach to feature engineering, it might not be exaggerating to say that feature engineering is the core construct which can make or break a classical machine learning model. Automating feature engineering would immensely shorten the time to market classical machine learning models.
Deep Feature Synthesis (DFS) is an algorithm that is implemented in the FeatureTools python package. DFS helps in rapid harvesting of new features by taking a stacking approach on top of a relational data model. DFS also has first class support for time dimensions as a fundamental construct. Some of these factors make the feature tools package a compelling tool/library for data practitioners. However the base algorithm itself can be enriched in multiple ways to make it truly appealing for many other use cases. This session will present a high level summary of DFS algorithmic constructs followed by enhancements that can be done on featuretools library to enable it for many other use cases
Building a Scalable Data Science Pipeline at REA
REA Group is a multinational digital advertising company specializing in property, most well known for realestate.com.au.
REA has a 5+ year history of using machine learning to segment and profile consumer intent; for example determining if a user on our site is mostly likely a buyer, seller, renter or investor. While we have had success in applied data science, the time from ideation to a shipped product traditionally took a considerably amount of time.
This talk will explore how REA rebuilt its data science pipeline to optimise data scientist autonomy. The focus will be on the technical solutions and social challenges faced by engineering and data science teams.
Custom Continuous Deployment to Uncover the Secrets in the Genome
Reading the genome to search for the cause of a disease has improved the lives of many children enrolled in clinical trials. However, to convert research into clinical practice requires the ability to query large volumes of data and find the needle in the haystack efficiently. This is hampered by traditional server- and database-based approaches being too expensive and unable to scale with accumulating medical information.
We hence developed a serverless approach to exchange human genomic information between organisations. The framework was architected to provide instantaneous analysis of non-local data on demand, with zero downtime costs and minimal running costs.
We used Terraform to write the infrastructure, enabling rapid iteration and version control at the architecture level. In order to maintain governance over our infrastructure created in this way, we developed a custom Continuous Deployment service that built and securely maintained each project, providing visibility and security over the entire organisation’s cloud infrastructure.
How to Experiment Quickly
The ‘science’ in data science refers to the underlying philosophy that you don’t know what works for your business until you make changes and rigorously measure impact. Rapid experimentation is a fundamental characteristic of high functioning data science teams. They experiment with models, business processes, user interfaces, marketing strategies, and anything else they can get their hands on. In this talk I will discuss what data platform tooling and organizational designs support rapid experimentation in data science teams.
Techniques Used to Analyse the Affordability, Commutability and Demographics of Real Estate in School Catchment Areas
Bitcoin Ransomware Detection with Scalable Graph Machine Learning
Ransomware is a type of malware that has become a major threat, rising to 600 million attacks per year, and this cyber-crime is very often facilitated via cryptocurrency. While ransomware relies on pseudonymity to send and receive payments that are difficult to trace, the fact that all transactions on the bitcoin blockchain are written publicly presents an opportunity to develop an analytics pipeline to detect such activities.
Graph Machine Learning is a rapidly developing research area which combines entity attributes and network structure to improve machine learning outcomes. These techniques are becoming increasingly popular, often outperforming traditional approaches when the underlying data can be naturally represented as a graph.
This talk will highlight two main outcomes: 1) how a graph machine learning pipeline is formulated to detect bitcoin addresses that are suspected to be associated with ransomware, and 2) how this algorithm is scaled out to process over 1 billion transactions using Apache Spark.
Lessons from Building a Data Platform for Smart Cities
We've built a data platform for smart cities. This has been deployed in over a dozen cities, and we've learned a lot in the process, about:
- why data ingestion from IoT networks can range from trivial to very painful, and how to cope;
- how to architect the system to easily handle many different 'data domains';
- getting the architecture to work well including making additions of new data sources as simple as we can;
- approaches to analytics and visualisations that have been useful;
- why end-user analytics and visualisations are critical;
- how user permissions for smart city applications can be different to more 'normal' applications.
- and lots more
In the talk, I'll walk through the lessons learned and show off examples of the system in action.
The goal is to use the platform as an exemplar of the design principles, this is not a sales pitch for the tool itself.
Practical Learning To Learn
Gradient descent continues to be our main work horse for training neural networks. One recurring problem though is the large amount of data required. Meta learning frames the problem not as learning from a single large dataset, but learning how to learn from multiple related smaller datasets. In this talk we'll first discuss some key concepts around gradient descent; fine-tuning, transfer learning, joint training and catastrophic forgetting and compare them to how simple meta learning techniques can make optimisation feasible for much smaller datasets.
Taming the Beast: Automated Testing for Complex Data Pipelines
Massive datasets. Complex data pipelines. Machine learning. When faced with such a beast, how do you test it effectively? When your tests results are less "pass" and "fail", and more "sort of" and "not really", how do you automate testing?
Trish Khoo draws upon her experience in testing complex data systems to demonstrate proven strategies for testing in this field. Her experience working on ultra-large-scale systems at Google in Mountain View, California shaped her technical approach to testing which she applies in her work as a consultant today.
Is Agile Data Science a thing now?
How come there’s no standard text on how to operate a Data Science team? At its current scale this is a young practice without a widely accepted mode of operation. Because so many practitioners are housed in technology shops, we tend to align our delivery cycles with developers… and hence with the Agile framework.
I will argue that if a data team fits within Agile it is probably not performing data science but operational analytics—a separate and venerable practice, and a requisite for data science. To ‘do’ science we need a fair bit of leeway, although not a complete lack of boundaries. It’s a tricky balance.
In this talk I will share my experience as a data scientist in a variety of circumstances: in foundational, service, and advisory roles. I will also bring some parallels from my past life in scientific research to discuss how I think data science should be performed at scale. And I will share my current Agile-ish process at Atlassian.
Building Rome Every Day - Scaling ML Model Building Infrastructure
"I want to reset my password". "I ordered the wrong size". "These are not the droids I was looking for". Every day, a support agent fields thousands of these queries. Multiply that by the thousands of agents a company might have, and the sheer vastness of data being generated becomes hard to imagine. How can we make sense of it all? It seems a formidable task, but we have a formidable weapon in our arsenal—we have machine learning.
By combining deep learning, natural language processing and clustering techniques, we built a machine learning model that can take 100,000 tickets and efficiently cluster and summarise them into digestible topics. But that's only part of the challenge; we also had to scale it to build for 30,000 customers, in production, every day.
In this talk I'll share the story of Content Cues - Zendesk's latest Machine Learning product. It's the story of how we leveraged the power of AWS Batch to scale a model building platform. Of how we tackled challenges such as measuring how well an unsupervised model performs when it's not even clear what "well" means. Of how our team combined our pool of skills across data engineering, data science and product management to deliver a pipeline capable of building a thousand models for the price of a cup of coffee.
The Three-Rs of Data-Science - Repeatability, Reproducibility, and Replicability
Adaptation of data-science in industry has been phenomenal in the last 5 years. Primary focus of these adaptations has been about combining the three dimensions of machine-learning i.e. the ‘data’, the ‘model architecture’ and the ‘parameters’ to predict an outcome. Slight change in any of these dimensions has potential to skew the predicted outcomes. So how do we build trust with our models? And how do we manage the variances across multiple models trained on varied set of data, model-architectures and parameters? Why the three Rs i.e. “Repeatability, Reproducibility, and Replicability” may have a relevance in industry application of data-science?
This talk has following goals:
- Justify (with demonstrations) as to why “Repeatability, Reproducibility, and Replicability” is important in data-science even if the application is beyond experimental research and is geared towards industry applications.
- Discuss in detail the requirements around ensuring “Repeatability, Reproducibility, and Replicability” in data-science.
- Discuss ways to observe repeatability, reproducibility, and replicability with provenance and automated model management.
- Present various approaches and available tooling pertaining to provenance and model managements and compare and contrast them.
The Sceptical Data Scientist
More and more decisions are data driven now - and that’s awesome! Much better than ideologically driven, or personality driven, or “whatever mood management’s in today” driven. But it does mean we want to be confident of our analyses. And there’s a tendency to have deep faith in data science. “Look! I did a calculation! It must be true!” Numbers don’t lie. And maths is reliable.
But so much depends on the questions we ask, how we ask them, and how we test those results.
So how do we create a generation of sceptical data scientists? Whose first approach to a result is to challenge it. To try to disprove it.
We need to give them confidence in their skills, but teach them to doubt their own work.
Predictive Modelling for Online Advertising
No longer is the ‘spray and pray’ methodology for finding customers working. No more is spamming people with numerous, unsolicited emails effective. Never again will 'stalking' with clumsy banners be cutting edge. Today it’s all about a strategy based on finding the right prospects, using the right channels, at the right time AND making them feel like they found you – not the other way around.
I will walk attendees through Marketing Science in the era of big data. We’ll begin with defining an ‘ideal/value customer’ , and - spoiler alert - it is not set in stone, smart tracking elements and AI models will allow your company to create your own portraits of the perfect customer and adjust it as you learn more, then you will know every little thing about them – inside and out.
With this knowledge half the journey is complete. We then capture those customers – at the right time and at the right place. How? We work to understand their behaviour, we capture their signals, we leverage advertising platform optimisation models, and we let the magic of data science do its thing – and shine. Every competitive advertising platform today incorporates optimisation models. I have extensive experience with some of them (Facebook, Google, Instagram) and I want to share what I have learnt and how you can take advantage of a gigantic data science effort put into those smart machines. To illustrate all of this we’ll go through various approaches, & will conclude will the model I built; the one ultimately probed & used with data scientists – at one of the biggest online advertising platforms in the world.
The Magic of Unsupervised Learning: Teaching an AI to Understand Our World
No doubt that AI is the new kids on the block. From as simple as classifying hot-dog vs not hot-dog, recognising flower species and going towards science fiction realm in generating fake videos.
This talk will cover the problem with supervised learning which is what most of current AI technologies are based on and what is the promising trend towards the future of AI with unsupervised learning. As a use case, we will cover how image generation techniques such as Variational Auto-encoder extract knowledge from images in an unsupervised manner.
Engineering an Ethical AI System
To improve people’s well-being, we must improve the decisions made about them. Consequential decisions are increasingly being made by AI, like selecting who to recruit, who receives a home-loan or credit card, and how much someone pays for goods or services. AI systems have the potential to to make these decisions more accurately and at a far greater scale than humans. However, if AI decision-making is improperly designed it runs the risk of doing unintentional harm, especially to already disadvantaged members of society. Only by building AI systems that accurately estimate the real impact of possible outcomes on a variety of ethically relevant measures, rather than just accuracy or profit, can we ensure this powerful technology improves the lives of everyone.
This talk focuses on the anatomy of these ethically-aware decision-making systems, and some design principles to help the data scientists, engineers and decision-makers collaborating to build them. We motivate the discussion with a high-level simulation of the "selection" problem where individuals are targeted, based on relevant features, for an opportunity or an intervention. We detail the necessary considerations and the potential pitfalls when engineering an ethically-aware automated solution, from initial conception through to causal analysis, deployment and on-going monitoring.
CLASSIEfier: Using Machine Learning to Paint a Picture of Social Sector Trends
Tracking the flow of funding and other support to social sector organisations in Australia has historically been difficult because of inconsistencies in categorisation, or the absence of categorisation entirely. Our Community (Melbourne based social enterprise) developed CLASSIE to serve as a universal classification system for Australian social sector initiatives and entities. We are now developing a Machine learning algorithm to reduce or remove the need for manual (human) classification. Once released, CLASSIEfier will allow us to classify historical records on behalf of grantmakers and other social sector supporters, and reduce the need for human intervention in classification of current and future records. In a long term will allow us to answer fundamental questions such as: Where is the money going? Are we helping the areas in most need?
I will present the project scope and development of CLASSIEfier, highlighting my experiences using Machine Learning in the social sector. I will also list the difficulties of working with text and sensitive data, and the methodologies to identify and mitigate algorithmic biases.
Image Classification in a Noisy Fraudulent World - A Journey of Computational and Statistical Performance
Formbay's fraud detection system relies on classification of photographic evidence to verify solar installations. Over the last 10 years, Formbay has amassed over 10 million labelled images of solar installations. Image classification over Formbay's dataset sounds easy. Lots of data, apply neural networks and profit from automation! However with such a large dataset, there is room for lots of noise. Noise such as mislabelled images, overlapping classes, corrupted image data, imbalanced classes, rotational variance and more.
This presentation demonstrates how we built our Image Processing pipeline tackling these noise issues while addressing class/concept drift. First we'll examine the data-situation of Formbay when we started and our initial model. Then we'll address each statistical and computational problem we met and how we decided to address them, slowly evolving our data pipeline over time.
This presentation focuses on the complexities of engineering production ready ML systems which involve balancing between statistical ("how accurate") and computational performance ("how fast").
Artificial Intelligence and Augmented Reality: A Match Made in Heaven
We begin with a story, the story of how I went from a painter to an artist working in the latest technologies. I will discuss how I fell into this line of research in my masters and how I am hooked on it now. I will provide practical knowledge about developing AR applications, using Machine learning and marrying the two together. I will explore web GL and new mobile versions of machine learning frameworks and how they relate to modes of mixed reality.
Search at Scale: Using Machine Learning to Automate Content Metadata
For media organisations, reach is everything. Getting eyeballs and ears in front of content is their raison d'être.
Search plays a critical role in connecting audiences with t-1 content (yesterday's news, last week's podcast). However, with audience expectations conditioned by Google and others, it is challenging to deliver robust, scalable search that people actually want to use.
The relevance of your results is everything, and to produce relevant results you need good metadata for every object in your search index. With hundreds of thousands of content objects and an audience of millions, the ABC has unique challenges in this regard.
This talk will explore the ABC's use of Machine Learning (ML) to automatically generate meaningful metadata for pieces of content (audio/video/text), including AWS MLaaS for full transcripts of audio podcasts and a platform developed in-house for NLP tasks such as entity recognition and automated document summarisation, and image-related tasks such as segmentation and tagging.
Bootstrapping the Right Way
Bootstrap sampling is being touted as a simple technique that any hacker can easily employ to quantify the uncertainty of statistical estimates. However, despite its apparent simplicity, there are many ways to misuse bootstrapping and thereby draw wrong conclusions about your data and the world. This talk gives a brief overview of bootstrap sampling and discusses ways to avoid common pitfalls when bootstrapping your data.
Modern Time-Series Methods
Time-series, as a field of study, has largely focused on statistical methods that work well under strict assumptions. Specifically, when there is sufficient history, there is little meta-data and a well-formed auto-correlation structure. However, as an applied practitioner I know that most real-world time series problems violate these assumptions. This leaves us with an opportunity to use more modern time series methods, based on machine learning, to overcome these deficiencies.
This session is designed to briefly speak about the unique properties of time-series, how statistical methods work and how and why machine learning (and deep learning) methods can be used to improve accuracy.
Emerging Best Practices for Machine Learning Engineering
In this talk, I'lll walk through some of the emerging best practices for Machine Learning engineering and contrast them to those of traditional software development. I will be covering topics including Product Management; Research and Development; Deployment; QA and Lifecycle Management of Machine Learning projects.
My 5 Biggest Database Blunders
We've all made mistakes. With databases, mistakes are particularly costly because they lead to performance bottlenecks, deployment disasters, lost data and intractable technical debt. Join us and learn from my mistakes. You'll hear harrowing tales of schema design blunders that were never rectified, and where recursive SQL is a path to a dark place. You'll learn why databases make lousy queues, and what to use instead. You'll learn the perils of table locking and botched migrations that can cause downtime and data loss. You'll laugh at my futile attempt to tune queries after choosing the wrong database, and why certain workloads work well on some databases, but not on others. Whether you're new to database engineering, or have made all the same errors, hearing about my missteps will help you avoid mistakes in your own data engineering challenges.
Entity Resolution at Scale
Real world data is rarely clean: there are often corrupted and duplicate records, and even corrupted records that are duplicates! One step in data cleaning is entity resolution: connecting all of the duplicate records into the single underlying entity that they represent.
This talk will describe how we approach entity resolution, and look at some of the challenges, solutions and lessons learnt when doing entity resolution on top of Apache Spark, and scaling it to process billions of records.
Bringing Continuous Delivery to Big Data Applications
In this presentation I will talk about our experience at SEEK implementing Continuous- Integration & Delivery (CI/CD) in two of our Big Data applications.
I will talk about the Data Lake project and its use of micro-services to break down data ingestion and validation tasks, and how it enables us to deploy changes to production more frequently. Data enters SEEK’s data lake through a variety of sources, including AWS S3, Kinesis and SNS. We use a number of loosely coupled serverless microservices and Spark jobs to implement a multi-layer data ingestion and validation pipeline. Using the microservices architecture enables us to develop, test and deploy the components of the pipeline independently and while the pipeline is operating.
We use Test-Driven Development to define the behaviour of micro-services and verify that they transform the data correctly. Our deployment pipeline is triggered on each code check-in and deploys a component once its tests pass. The data processing pipeline is idempotent so if there is a bug or integration problem in a component we can fix it by replaying the affected data batches through the component.
In the last part of the talk, I’ll dive deeper into some of the challenges we solved to implement a CI/CD pipeline for our Spark applications written in Scala.
How Much Data do you _really_ need for Deep Learning?
A common assumption is that we need significant amounts of data in order to do deep learning. Many companies wanting to adopt AI find themselves stuck in the “data gathering” phase and as a result delaying the use of AI to gain competitive advantage in their business. But how much data is enough? Can we get by with less?
In this talk we will explore the impact on our results when we use different amounts of data to train a classification model. It is actually possible to get by with much less data than we might expect. We will discuss why this might be so, in which particular areas this applies, and how we can use these ideas to improve how we train, deploy and engage end-users in our models.
Noon van der Silk
Senior Software Engineer, Fix Planet Club
I'm a long-time programmer who has recently become extremely passionate and interested in the climate emergency. I've been working as a Haskell programmer for the last few years, after a bit of a diverse (programming) career in different fields from creative AI, teaching, and quantum computing. I spend most of my time reading books, and posting some small reviews - https://betweenbooks.com.au/ - and otherwise am really interested in community building, connecting people, kindness, and understanding how to build a sustainable business.
-
From Sparse Data-sets to Graphs: When Explicit Relationships Bridge the Gaps
Featuring Enrique Bustamante
The various ways we frame a problem have significant impact on how we approach it. From the questions we ask to the tools we use. A simple change can have great repercussions and be of significant benefit to how you tackle the issue at hand.
At EB Games, in trying to develop a customer...
ai-&-ml -
Data Driven Diversity
Featuring Larene Le Gassick
Since January, we have RSVPd “yes” 26,610 times to tech Meetups in Brisbane*. That’s a lot of pizza.
Every Monday, I run a script which posts in the #meetup channel of the Brisbane Developers Slack group. It's a simple Node script that calls the Meetup API, and lists every tech...
practice -
CLASSIEfier: Using Machine Learning to Paint a Picture of Social Sector Trends
Featuring Paola Oliva-Altamirano
Tracking the flow of funding and other support to social sector organisations in Australia has historically been difficult because of inconsistencies in categorisation, or the absence of categorisation entirely. Our Community (Melbourne based social enterprise) developed CLASSIE to serve as a...
ai-&-ml -
From Zero to Tensorflow: Building an Analytics Dept.
Featuring Antoine Desmet
Day 1: one engineer vs. a heap of time-series data on a 1990s-era database
practice
Four years on, there's 8 of us, we run TensorFlow analytics on a Hadoop cluster to detect subtle signs of a potential breakdown on earthmoving equipment. We've prevented million-dollar component failures, and reduced a lot... -
Bitcoin Ransomware Detection with Scalable Graph Machine Learning
Featuring Kevin Jung
Ransomware is a type of malware that has become a major threat, rising to 600 million attacks per year, and this cyber-crime is very often facilitated via cryptocurrency. While ransomware relies on pseudonymity to send and receive payments that are difficult to trace, the fact that all...
ai-&-ml -
Practical Geometric Deep Learning in Python
Featuring Pantelis Elinas
Geometric Deep Learning (GDL) is a fast developing machine learning specialisation that uses the network structure underlying the data to improve learning outcomes. GDL has been successfully applied to problems in various domains with network-structured data, such as social science, medicine,...
ai-&-ml -
Emerging Best Practices for Machine Learning Engineering
Featuring Lex Toumbourou
In this talk, I'lll walk through some of the emerging best practices for Machine Learning engineering and contrast them to those of traditional software development. I will be covering topics including Product Management; Research and Development; Deployment; QA and Lifecycle Management of...
practice -
How Much Data do you _really_ need for Deep Learning?
Featuring Noon van der Silk
A common assumption is that we need significant amounts of data in order to do deep learning. Many companies wanting to adopt AI find themselves stuck in the “data gathering” phase and as a result delaying the use of AI to gain competitive advantage in their business. But how much...
ai-&-ml -
Game Engines and Machine Learning: Training a Self-Driving Car Without a Car?
Featuring Paris Buttfield-Addison
Are you a scientist who wants to test a research problem without building costly and complicated real-world rigs? A self-driving car engineer who wants to test their AI logic in a constrained virtual world? A data scientist who needs to solve a thorny real-world problem without touching a...
ai-&-ml -
Making The Black Box Transparent: Lessons in Opacity
Featuring Xuanyi Chew
Deep Learning is all the rage now. It is powerful, it is cheap. To proponents of "explainable" machine learning however, this is not really good news - deep learning is essentially a black box that one can't look into.
To be sure, there are efforts to peek inside the black box to see what it's...
ai-&-ml -
Lessons from Building a Data Platform for Smart Cities
Featuring Simon Kaplan
We've built a data platform for smart cities. This has been deployed in over a dozen cities, and we've learned a lot in the process, about:
- why data ingestion from IoT networks can range from trivial to very painful, and how to cope;
- how to architect the system to easily handle many different...
-
Bringing Continuous Delivery to Big Data Applications
Featuring Reza Yousefzadeh
In this presentation I will talk about our experience at SEEK implementing Continuous- Integration & Delivery (CI/CD) in two of our Big Data applications.
I will talk about the Data Lake project and its use of micro-services to break down data ingestion and validation tasks, and how it...
engineering -
Building a Scalable Data Science Pipeline at REA
Featuring Justin Hamman
REA Group is a multinational digital advertising company specializing in property, most well known for realestate.com.au.
REA has a 5+ year history of using machine learning to segment and profile consumer intent; for example determining if a user on our site is mostly likely a buyer, seller,...
engineering -
Modern Time-Series Methods
Featuring Kale Temple
Time-series, as a field of study, has largely focused on statistical methods that work well under strict assumptions. Specifically, when there is sufficient history, there is little meta-data and a well-formed auto-correlation structure. However, as an applied practitioner I know that most...
ai-&-ml -
Transparent Government: The Stories we can Tell with Data
Featuring Mel Flanagan
There is an increasing and powerful global push to open up the trove of information governments generate, collect, and manage. There is a vast array of data ranging from; open data, big data from multiple sources, sensitive information about citizens, and complex information about businesses...
practice -
The Sceptical Data Scientist
Featuring Linda McIver
More and more decisions are data driven now - and that’s awesome! Much better than ideologically driven, or personality driven, or “whatever mood management’s in today” driven. But it does mean we want to be confident of our analyses. And there’s a tendency to have...
practice -
Working with Large Numbers of Non-Trivial ETL Pipelines
Featuring Jessica Flanagan
Data pipelines need to be flexible, modular and easily monitored. They are not just set-and-forget. The team that monitors a pipeline might not have developed it and may not be experts on the dataset. End users must have confidence in the output.
This talk is a practical walkthrough of a...
engineering -
Look at GDPR from a Data Engineering Perspective
Featuring Daniel Deng
GDPR has been being discussed quite frequently in the past year, many businesses need to invest more into R&D to become GDPR compliant. This talk discusses about the concepts, challenges, analytical needs, ideal system setup and a possible roadmap for existing projects from a data engineering...
practice -
Taming the Beast: Automated Testing for Complex Data Pipelines
Featuring Trish Khoo
Massive datasets. Complex data pipelines. Machine learning. When faced with such a beast, how do you test it effectively? When your tests results are less "pass" and "fail", and more "sort of" and "not really", how do you automate testing?
Trish Khoo draws upon her experience in testing complex...
practice -
Artificial Intelligence and Augmented Reality: A Match Made in Heaven
Featuring J. Rosenbaum
We begin with a story, the story of how I went from a painter to an artist working in the latest technologies. I will discuss how I fell into this line of research in my masters and how I am hooked on it now. I will provide practical knowledge about developing AR applications, using Machine...
ai-&-ml -
Predictive Modelling for Online Advertising
Featuring Diana Mozo-Anderson
No longer is the ‘spray and pray’ methodology for finding customers working. No more is spamming people with numerous, unsolicited emails effective. Never again will 'stalking' with clumsy banners be cutting edge. Today it’s all about a strategy based on finding the right...
ai-&-ml -
New Technologies to the Rescue of Epileptics
Featuring Robin Champseix
Creating a device that can predict epileptic seizures to help patients regain autonomy in their everyday lives?
engineering
That's the amazing project we are working on with the association Aura.
Epilepsy is one of the most common neurologic diseases. It remains a huge scourge for patients and no definitive... -
Search at Scale: Using Machine Learning to Automate Content Metadata
Featuring Gareth Seneque
For media organisations, reach is everything. Getting eyeballs and ears in front of content is their raison d'être.
Search plays a critical role in connecting audiences with t-1 content (yesterday's news, last week's podcast). However, with audience expectations conditioned by Google and...
ai-&-ml -
My 5 Biggest Database Blunders
Featuring Brad Urani
We've all made mistakes. With databases, mistakes are particularly costly because they lead to performance bottlenecks, deployment disasters, lost data and intractable technical debt. Join us and learn from my mistakes. You'll hear harrowing tales of schema design blunders that were never...
engineering -
Techniques Used to Analyse the Affordability, Commutability and Demographics of Real Estate in School Catchment Areas
Featuring Anthony I Joseph
School catchments, otherwise known as priority placement areas or intake zones are a zone where children are entitled to enrol in a public school. Recent media coverage has drawn attention to the increased demands for residential real estate within high performing school catchments. While school...ai-&-ml -
The Magic of Unsupervised Learning: Teaching an AI to Understand Our World
Featuring Agustinus Nalwan
No doubt that AI is the new kids on the block. From as simple as classifying hot-dog vs not hot-dog, recognising flower species and going towards science fiction realm in generating fake videos.
This talk will cover the problem with supervised learning which is what most of current AI...
ai-&-ml -
Scaling Analytics as your Company Grows
Featuring Itzik Feldman
Analysts are a bottleneck, they can't answer all the questions that are coming from their business users
Business users are heavily dependant on their analysts, as a result, when the analyst is not available they either wait for a long time or act according to their gut feeling
Analysts are...
practice -
Which Plane Woke Snowy the Cat?
Featuring Simon Aubury
Our new cat, Snowy, is waking early. She is startled by the noise of jets flying over our house.
This talk describes how common radio receivers can be configured to gather aircraft transponder signals. With an opensource data streaming framework (Apache Kafka) we can build a streaming data...
engineering -
Bootstrapping the Right Way
Featuring Yanir Seroussi
Bootstrap sampling is being touted as a simple technique that any hacker can easily employ to quantify the uncertainty of statistical estimates. However, despite its apparent simplicity, there are many ways to misuse bootstrapping and thereby draw wrong conclusions about your data and the world....
ai-&-ml -
Auto feature engineering - Rapid feature harvesting using DFS and data engineering techniques
Featuring Ananth Gundabattula
As machine learning adoption permeates across many business models, so is the need to deliver models at a much faster rate. Feature engineering arguably is one of the core foundations of model development cycle. While approaches like deep learning tend to take a different approach to feature...
ai-&-ml -
Is Agile Data Science a thing now?
Featuring Hercules Konstantopoulos
How come there’s no standard text on how to operate a Data Science team? At its current scale this is a young practice without a widely accepted mode of operation. Because so many practitioners are housed in technology shops, we tend to align our delivery cycles with developers… and...
practice -
Image Classification in a Noisy Fraudulent World - A Journey of Computational and Statistical Performance
Featuring Roger Qiu
Formbay's fraud detection system relies on classification of photographic evidence to verify solar installations. Over the last 10 years, Formbay has amassed over 10 million labelled images of solar installations. Image classification over Formbay's dataset sounds easy. Lots of data, apply neural...
ai-&-ml -
The Three-Rs of Data-Science - Repeatability, Reproducibility, and Replicability
Featuring Suneeta Mall
Adaptation of data-science in industry has been phenomenal in the last 5 years. Primary focus of these adaptations has been about combining the three dimensions of machine-learning i.e. the ‘data’, the ‘model architecture’ and the ‘parameters’ to predict an...
practice -
Entity Resolution at Scale
Featuring Huon Wilson
Real world data is rarely clean: there are often corrupted and duplicate records, and even corrupted records that are duplicates! One step in data cleaning is entity resolution: connecting all of the duplicate records into the single underlying entity that they represent.
This talk will describe...
ai-&-ml -
Engineering an Ethical AI System
Featuring Simon T. O'Callaghan
To improve people’s well-being, we must improve the decisions made about them. Consequential decisions are increasingly being made by AI, like selecting who to recruit, who receives a home-loan or credit card, and how much someone pays for goods or services. AI systems have the potential to...
practice -
Sketch algorithms
Featuring Simon Belak
In this talk we will look at how to efficiently (in both space and time) summarize large, potentially unbounded, streams of data by approximating the underlying distribution using so-called sketch algorithms. The main approach we are going to be looking at is summarization via histograms....
ai-&-ml -
Practical Learning To Learn
Featuring Mat Kelcey
Gradient descent continues to be our main work horse for training neural networks. One recurring problem though is the large amount of data required. Meta learning frames the problem not as learning from a single large dataset, but learning how to learn from multiple related smaller datasets. In...
ai-&-ml -
Custom Continuous Deployment to Uncover the Secrets in the Genome
Featuring Brendan Hosking
Reading the genome to search for the cause of a disease has improved the lives of many children enrolled in clinical trials. However, to convert research into clinical practice requires the ability to query large volumes of data and find the needle in the haystack efficiently. This is hampered by...
engineering -
Building Rome Every Day - Scaling ML Model Building Infrastructure
Featuring Dana Ma
"I want to reset my password". "I ordered the wrong size". "These are not the droids I was looking for". Every day, a support agent fields thousands of these queries. Multiply that by the thousands of agents a company might have, and the sheer vastness of data being generated becomes hard to...
engineering -
How to Experiment Quickly
Featuring Juliet Hougland
The ‘science’ in data science refers to the underlying philosophy that you don’t know what works for your business until you make changes and rigorously measure impact. Rapid experimentation is a fundamental characteristic of high functioning data science teams. They experiment...
practice
-
YOW! Data 2022
Two days - Online Conference
YOW! Data is an opportunity for data professionals to share their challenges and experiences while our speakers share the latest in best practices, techniques, and tools. The 2022 conference was online in a two-day event featuring invited international and Australian speakers sharing their...
data-machine-learning data data-science data-engineering machine-learning-ai -
YOW! Data 2021
Two days - Online Conference
YOW! Data is an opportunity for data professionals to share their challenges and experiences while our speakers share the latest in best practices, techniques, and tools.
The 2021 conference was online in a two-day event featuring invited international and Australian speakers sharing their...
data machine-learning-ai data-engineering data-science -
YOW! Data 2020
Three days - Online Conference
We're delighted to present an online version of YOW! Data in 2020, featuring selected invited speakers from our face to face conference. YOW! Data is an opportunity for data professionals to share their challenges and experiences while our speakers share the latest in best practices, techniques,...
architecture discovery machine-learning-ai data-engineering data-science -
YOW! Data 2018
Two days in Sydney
YOW! Data is a two day conference that provides in-depth coverage of current and emerging technologies in the areas of Big Data, Analytics and Machine Learning.
The number of data generators (drones, cars, devices, home appliances, gaming consoles, online services, medical and wearables) is...
architecture discovery machine-learning visualisation data-architecture concept algorithm design technique practice data -
YOW! Data 2017
Two days in Sydney
YOW! Data is a two day conference that provides in-depth coverage of current and emerging technologies in the areas of Big Data, Analytics and Machine Learning.
The number of data generators (drones, cars, devices, home appliances, gaming consoles, online services, medical and wearables) is...
data-science data -
YOW! Data 2016
Two days in Sydney
YOW! Data is a two day conference that provides in-depth coverage of current and emerging technologies in the areas of Big Data, Analytics and Machine Learning.
The number of data generators (drones, cars, devices, home appliances, gaming consoles, online services, medical and wearables) is...
data-science data