YOW! Data 2017

Topics covered at #yowdata

Monday, 18th - Tuesday, 19th September in Sydney

22 experts spoke.

YOW! Data is a two day conference that provides in-depth coverage of current and emerging technologies in the areas of Big Data, Analytics and Machine Learning.

The number of data generators (drones, cars, devices, home appliances, gaming consoles, online services, medical and wearables) is increasing rapidly and with that comes increased demand for smarter solutions for gathering, handling and analysing data. YOW! Data 2017 will bring together leading industry practitioners and applied researchers working on data driven technology and applications.

Excited? Share it!


Processing Data of Any Size with Apache Beam

Rewriting code as you scale is a terrible waste of time. You have perfectly working code, but it doesn’t scale. You really need code that works at any size, whether that’s a megabyte or a terabyte. Beam allows you to learn a single API and process data as it grows. You don’t have to rewrite at every step.

In this session, we will talk about Beam and its API. We’ll see how to Beam execute on Big Data or small data. We’ll touch on some of the advanced features that make Beam an interesting choice.

Jesse Anderson

Data Engineer, Creative Engineer and Managing Director
Big Data Institute

From Little Things, Big Data Grow - IoT at Scale

The Internet of Things (IoT) is really about the ubiquity of data, the possibility of humans extending their awareness and reach globally, or further.
IoT frees us from the tedium of physically monitoring or maintaining remote systems, but to be effective we must be able to rely on data being accessible but comprehensible.

This presentation covers three main areas of an IoT big data strategy

  • The Air Gap - options (from obvious to inventive) for connecting wireless devices to the internet
  • Tributaries - designing a scalable architecture for amalgamating IoT data flows into your data lake. Covers recommended API and message-bus architectures.
  • Management and visualisation - how to characterise and address IoT devices in ways that scale to continental populations. I will show some examples of large scale installations to which I've contributed and how to cope with information overload.

Christopher Biggs

Accelerando Consulting

Realizing the Promise of Portable Data Processing with Apache Beam

The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".

This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.

Davor Bonaci

Sr. Software Engineer
Google Inc.

What is the Most Common Street Name in Australia?

Finding the most common street name in Australia may sound relatively simple, but it quickly leads to other questions. What is a street name? Do The Avenue, The Grand Parade and The Serpentine all share the same name? And what is a street? Is the M5 Motorway a street? What about M5 Motorway Offramp?

This talk will answer these questions using Open Street Map and Python. In particular, reading in and manipulating Open Street Map data using geopandas; exploring the structure of Open Street Map and creating models for parsing street names. And finding the most common street name in Australia.

Rachel Bunder

Data Scientist
Solar Analytics

Energy Monitoring with Self-Taught Deep Networks

Energy disaggregation allows detection of individual electrical appliances from aggregated energy usage time series data. The insights of individual appliances are very useful for different energy-related applications, for example energy monitoring, demand response etc. Although it is very easy to collect large volume of energy usage data, inspecting and labelling time series is very tedious and expensive.

In this talk, I will present a solution to explore these unlabelled time-series data using two deep networks. The first RNN-based deep network extracts good representations of energy time series windows without much human intervention. By transferring these representations from unlabelled data to labeled data, the second deep network learns the model of targeted electrical appliance.

Sau Sheong Chang

Managing Director
SP Digital

A Geometric Approach towards Data Analysis and Visualisation

Beginning with the work of Bertin, visualisation scholars have attempted to systematically study and deconstruct visualisations in order to gain insights about their fundamental structure. More recently, the idea of deconstructing visualizations into fine-grained, modular units of composition also lies at the heart of graphics grammars. These theories provide the foundation for visualization frameworks and interfaces developed as part of ongoing research, as well as state-of-the-art commercial software, such as Tableau. In a similar vein, scholars like Tufte have long advocated to forego embellishments and decorations in favor of abstract and minimalist representations. They argue that such representations facilitate data analysis by communicating only essential information and minimizing distraction.

This presentation continues along such lines of thought, proposing that this pursuit naturally leads to a geometric approach towards data analysis and visualisation. Looking at data from a sufficiently high level of abstraction, one inevitably returns to fundamental mathematical concepts. As one of the oldest branches of mathematics, geometry offers a vast amount of knowledge that can be applied to the formal study of visualisations.

``Visualization is a method of computing. It transforms the symbolic into the geometric.'' (McCormick et al., 1987)

In other words, geometry is the mathematical link between abstract information and graphic representation. In order to graphically represent information, we assign to it a geometric form. In this presentation we will explore the nature of these mappings from symbolic to geometric representations. This geometric approach provides an alternative perspective towards analysing data. This perspective is inherently equipped with high-level abstractions and invites generalization. It enables the study of abstract geometric objects independent from a concrete presentation medium. Consequently, it allows to interpret data directly through geometric primitives and transformations.

The presentation illustrates the geometric approach using diverse examples and illustrations. In turn, we discuss the opportunities and challenges that arise from this perspective. For instance, a key benefit of this approach is that it allows to consider seemingly disparate visualization types in a unified framework. By systematically enumerating the design space of geometric representations, it is possible to trivially apply extensions and modifications, resulting in great expressiveness. The approach naturally extends to visualisation techniques for complex, multidimensional, multivariate data sets. However, the effectiveness of the resulting representations and cognitive challenges in the interpretation require careful consideration.

Daniel Filonik

Postdoctoral Fellow
EPICentre, UNSW Art and Design

Low Latency Polyglot Model Scoring using Apache Apex

Data science is fast becoming a complementary approach and process to solve business challenges today. The explosion of frameworks to help data scientists build models bears a testimony to this. However when a model needs to be turned into a production version in very low latency and enterprise grade environments, there are a very few choices with each one having their own strengths and weaknesses. Adding to this is the current disconnect between a data scientists world which is all about modelling and an engineers world which is about SLAs and service guarantees. A framework like Apache Apex can complement each of these roles and provide constructs for both these worlds. This would help enterprises to drastically cut down the cost of model deployment to production environments.

The talk will present Apache Apex as a framework that can enable engineers and data scientists to build low latency enterprise grade applications. We will cover the foundations of Apex that contribute to the low latency processing capabilities of the platform. Subsequently aspects of the platform that make it qualify as an enterprise grade platform are discussed. Finally, we will cover the main aspects of the title of the talk wherein models developed in Java, R and Python can co-exist in the same scoring application framework thus enabling a true polyglot framework.

Ananth Gundabattula

Sr. Architect
Commonwealth Bank of Australia

Apache Spark for Machine Learning on Large Data Sets

Apache Spark is a general purpose distributed computing framework for distributed data processing. With MLlib, Spark's machine learning library, fitting a model to a huge data set becomes very easy. Similarly, Spark's general purpose functionality enables application of a model across a large collection of observations. We'll walk through fitting a model to a big data set using MLlib and applying a trained scikit-learn model to a large data set.

Juliet Hougland

Data Vagabond
Bagged & Boosted

Writing Better R Code

Data scientists, analysts, and statisticians are passionate about the data, models, and insights but the code used to produce the results (in many cases) is left behind. We have very good understanding of our code base during the time when we are working on the project but most of the time we do not write the code for the "future me".

In this talk, I describe and explain common coding pitfalls in R and then introduce functional programming using functions from base R, purrr (part of tidyverse) and pipes as a preferred solution for creating robust and reusable R code. Between the topics, I briefly touch on controversial topics such as "loops are bad" and "pipes are the best"

Ondrej Ivanič

Sr. Data Scientist
Commonwealth Bank of Australia

Image Recognition for Non-Experts: From Google Cloud Vision to Tensorflow

Displaying an inappropriate ad on a website can be a major headache for an Ad network. Showing ads for a site’s major competitor, or ads in a category at odds with the site’s brand, for example, can cause embarrassment and lost revenue. With the selection of ads being largely algorithmic it can be hard to set up rules to make sure this doesn’t happen. You also don’t want your first awareness of the problem being a call from an angry CEO.

This talk shows how we built a system that uses image recognition to detect Ad Breaches. Our first version makes use of Google’s Cloud Vision API. The Cloud Vision API is a pre-trained service that recognises many categories of objects from images, along with some text recognition. I’ll discuss how to use the Cloud Vision API in your applications, what it is good at, what it is not.

I’ll then look at using transfer learning to improve our system’s ability to recognise Ad Breaches. I will look at how we can use the popular Tensorflow library to build our own image recognition model. Tensorflow comes with several pre-trained models for image recognition - using these I will show you how to build your own specialised image recognition models in a fraction of the time, and with a fraction of the input data, by re-using existing pre-trained layers from the best models out there. I’ll investigate whether we can train a model to detect potential ad breaches from a small set of examples.

Gareth Jones

Shine Solutions

Batch as a Special Case of Streaming

In this talk I will share my teams gruelling journey in attempting to migrate a batch like system into a streaming framework.

Walking through the various solutions that we tested using Flink, I'll be discussing each ones performance characteristics and bringing to light misconceptions in their designs.

Roman Kovalik

Big Data Engineer

Cloud Data Pipelines for Genomics from a Bioinformatician and a Developer

Dr.Bauer and her team have been working to build genome-scale data pipelines that address the computational challenges and limits present in today’s cancer genomic (bioinformatics) data workflows.

Dr. Bauer and her team have built solutions which use modern architectures, such as serverless (AWS Lambda) and also customised machine learning on Apache Spark. AWS Community Hero and cloud architect Lynn Langit is also collaborating with the CSIRO team to push solutions at the cutting edge of bioinformatic research which best utilise advances in cloud technologies..

In this demo-filled session Lynn and Denis will discuss and demonstrate some of the latest cloud data pipeline work that they’ve been working together to build out for the bioinformatics community.

Lynn Langit

Lynn Langit is an independent Cloud Architect and Developer. She works on genomic-scale cloud pipelines. Also Lynn is an author for LinkedIn Learning, having created 25 courses on cloud topics. For her technical education work, she has been awarded as an AWS Community Hero, Google Cloud Developer Expert and Microsoft Regional Director.

Beyond Relational: Applying Big Data Cloud Pipeline Patterns

In this full-day workshop, you will learn applied big data solution patterns. most often, but not always using the public cloud. We’ll cover Amazon Web Services and Google Cloud Platform, and work with in small groups to design data pipeline architectures for common scenarios.


Lynn Langit

Lynn Langit is an independent Cloud Architect and Developer. She works on genomic-scale cloud pipelines. Also Lynn is an author for LinkedIn Learning, having created 25 courses on cloud topics. For her technical education work, she has been awarded as an AWS Community Hero, Google Cloud Developer Expert and Microsoft Regional Director.

Scalable IOT with Apache Cassandra.

IOT and Event Based systems can process huge volumes of data. Which typically needs to be stored and read in near real time for event processing, in addition to being read in bulk to feed data hungry learning systems. Apache Cassandra provides a high performance, scalable, and fault tolerant database platform with excellent support for time series data models typically seen in IOT systems. It's millisecond (or better) latency can support systems that react to events in real time, while scalable bulk reads via batch processing systems such as Apache Hadoop and Apache Spark can support learning applications. These features, and more, make Cassandra an ideal persistence platform for modern data intensive, event driven, systems.

In this talk Aaron Morton, CEO at The Last Pickle, will discuss lessons learned using Cassandra for IOT systems. He will explain how Cassandra fits into the modern technology landscape and dive into data modelling for common IOT use cases, capacity planning for huge data loads, tuning for high performance, and integration with other data driven systems. Whether starting a new project, or deep into the weeds on an existing system, attendees will leave will leave with an understanding of how Apache Cassandra can help build robust infrastructure for IOT systems.

Aaron Morton

Aaron Morton is the Co Founder & Principal Consultant at The Last Pickle. A professional services company that works with clients to deliver and improve Apache Cassandra based solutions. He's based in New Zealand, is an Apache Cassandra Committer and a DataStax MVP for Apache Cassandra.

Video Game Analytics on AWS

This talk will cover how to use AWS technologies to build an analytics system for video games which can be used to analyse player behaviour in near real-time. This system enables developers to identify trends in player difficulties, ease of use, the highs and lows of player engagement and how to visualise these results in-game. This demo uses a serverless approach for data capturing, processing and serving using AWS Mobile Analytics, Apache Spark on DataBricks, Athena and Lambda technologies.

Representing data in game enables developers to see results in an environment they are already very familiar with and adjust level design to maximise engagement. Developers can use this information to track the effects of updated releases to easily identify if changes have had the intended effect. These same techniques can be applied in many scenarios including web tracking and click stream analytics.

Richard Morwood

Sr. Big Data Developer

Dipping into the Big Data River: Stream Analytics at Scale

This presentation explains the concept of Kappa and Lambda architectures and showcases how useful business knowledge can be extracted from the constantly flowing river of data.

It also demonstrates how a simple POC could be built in a day with only getting your toes wet by leveraging Docker and other technologies like Kafka, Spark and Cassandra.

Radek Ostrowski

Big Data Engineer

Learnings from Building Data Products at Zendesk

In this talk you will learn about the team structure and process for building Data Product from the lessons of one of the teams that builds Data Products at Zendesk. The Data Product team uses machine learning to build Data Products that will reduce cost of customer support for Zendesk's 100,000 odd customers.

This talk will explain the journey of the Data Product team to date - its structure and how it has evolved, challenges as well as successes and failures.

Bob Raman

Engineering Manager

Introduction to Apache Amaterasu (Incubating): A CD Framework for your Big Data Pipelines

In the last few years, the DevOps movement has introduced ground breaking approaches to the way we manage the lifecycle of software development and deployment. Today organisations aspire to fully automate the deployment of microservices and web applications with tools such as Chef, Puppet and Ansible. However, the deployment of data-processing pipelines remains a relic from the dark-ages of software development.

Processing large-scale data pipelines is the main engineering task of the Big Data era, and it should be treated with the same respect and craftsmanship as any other piece of software. That is why we created Apache Amaterasu (Incubating) - an open source framework that takes care of the specific needs of Big Data applications in the world of continuous delivery.

In this session, we will take a close look at Apache Amaterasu (Incubating) a simple and powerful framework to build and dispense pipelines. Amaterasu aims to help data engineers and data scientists to compose, configure, test, package, deploy and execute data pipelines written using multiple tools, languages and frameworks.
We will see what Amaterasu provides today, and how it can help existing Big Data application and demo some of the new bits that are coming in the near future.

Yaniv Rodenski

Software Developer

The Future of Art

Most people are aware of the impact machine learning will have on jobs, on the future of research and autonomous machines, but few seem to be aware of the future role machine learning could play in the creative arts, in visual art and music. What will art be like when artists and musicians routinely work collaboratively with machines to create new and interesting artworks? What can we learn from art created using neural networks and what can we create? From the frivolous to the beautiful what does art created by computers look like and where can it take us?

This talk will explore magenta in tensorflow and neural style in caffe, google deep dream, next Rembrandt, and convolutional neural networks. I will look into some of the beautiful applications of machine learning in art and some of the ridiculous ones as well.

J. Rosenbaum

J. Rosenbaum

Cast a Net Over your Data Lake

As the variety of data continues to expand, the need for different kinds of analytics is increasing – big data is no longer just about the volume, but also about its increasing diversity. Unfortunately, there is no one-size-fits-all approach to analytics – no magic pill that will get your organization the insight it needs from data. Graph analytics offers a toolset to visualize your diverse data and to build more accurate predictive models by uncovering non-obvious inter-connections among your data sources.

In this talk we will discuss some use cases for graph analytics and walk through a particular scenario to find power-users for a promotion campaign. We will also cover machine learning approaches which can assist you in constructing graphs from diverse data sources.

Natalia Ruemmele

Data Scientist
Data61, CSIRO

Covariate Shift - Challenges and Good Practice

A fundamental assumption in supervised machine learning is that both the training and query data are drawn from the same population/distribution. However, in real-life applications this is very often not the case as the query data distribution is unknown and cannot be guaranteed a-priori. Selection bias in collecting training samples will change the distribution of the training data from that of the overall population. This problem is known as covariate shift in the machine learning literature, and using a machine learning algorithm in this situation can result in spurious and often over-confident predictions.

Covariate shift is only detectable when we have access to query data. Visualization of training and query data would be helpful to gain an initial impression. Machine learning models can be used to detect covariate shift. For example, Gaussian Process could model the similarity between each query point from feature space of training data. One-class SVMs could detect outliers of training data. Both strategies detect query points that live in a different domain of the feature space from the training dataset.

We suggest two strategies to mitigate covariate shift: re-weighting training data, and active learning with probabilistic models.

First, re-weighting the training data is the process of matching distribution statistics between the training and query sets in feature space. When the model is trained (and validated) on re-weighted data, it is expected to generalise better to query data. However, significant overlap between training and query datasets is required.

Secondly, there may be a situation where we can acquire the labels of a small portion of the query set, potentially at great expense, to reduce the effects of covariate shift. Probabilistic models are required in this case because they indicate the uncertainty in their prediction. Active learning enables us to optimally select small subsets of query points that aim to maximally shrink the uncertainty in our overall prediction.

Joyce Wang

Software Engineer
Data61, CSIRO

The Network, The Kingmaker: Distributed Tracing and Zipkin

Adding Zipkin instrumentation into a codebase makes it possible to create one tracing view across an entire platform. This is the often eluded "correlation identifier" that's recommended by Microservices but has so few solid open sourced solutions available. This is an aspect to monitoring of distributed platforms akin to the separate concerns in aggregation of metrics and logs.

This talk will use the use case of extending Apache Cassandra's tracing: to use Zipkin so to demonstrate a single tracing view across an entire system. From browser and HTTP, through a distributed platform, and into the Database down to seeks on disk. Put together it makes easy to identify which queries to a particular service took the longest and to trace back how the application made them.

This presentation will raise the requirements and expectations DevOps has on their infrastructural tools. For people that want to take their infrastructural tools to the next level, where the network is known as the kingmaker.

Mick Semb Wever

The Last Pickle

Looking Behind Microservices to Brewer's Theorem, Externalised Replication,and Event Driven Architecture

Scaling data is difficult, scaling people even more so.

Today Microservices makes it possible to effectively scale both data and people by taking advantage of bounded contexts and Conway's law.
But there's still a lot more theory that's coming together in our adventures in dealing with ever more data. Some of these ideas and theories are just history repeating, while others are newer concepts.

These ideas can be seen in many Microservices platforms, within the services' code but also in the surrounding infrastructural tools we become ever more reliant upon.

Mick'll take a dive into it using examples and offer recommendations after seven years of coding Microservices around 'big data' platforms. The presentation will be relevant to people wanting to move beyond REST based asynchronous platforms, to eventually consistent asynchronous designs that aim towards the goal of linear scalability and 100% availability.

Mick Semb Wever

The Last Pickle

Metrivour - Recording and Analyzing Metrics from the Electric Power Network

Metrivour is a metrics recording and analytics system we developed for electric power operations and planning. The metrics are physical quantities such as voltage, current and power in an electric power network.

Some aspects of the system are familiar. The storage backend is a Cassandra database cluster (often used for metrics). The implementation consists of services written in Java and scala.

Other aspects are distinctive. The system has an analytics engine and query language that are designed for purpose.

The goal is to reduce volumes of noisy, irregular transducer measurements to regular time series of reasonable dimensions. This enables the next level of analysis to be performed by standard tools.

Arnold deVos

Background Signal Pty Ltd

Other Years