2 DAY CONFERENCE

YOW! Data 2016

Topics covered at #yowdata

Thursday, 22nd - Friday, 23rd September in Sydney

23 experts spoke.
Overview

YOW! Data is a two day conference that provides in-depth coverage of current and emerging technologies in the areas of Big Data, Analytics and Machine Learning.

The number of data generators (drones, cars, devices, home appliances, gaming consoles, online services, medical and wearables) is increasing rapidly and with that comes increased demand for smarter solutions for gathering, handling and analysing data. YOW! Data 2017 will bring together leading industry practitioners and applied researchers working on data driven technology and applications.

Excited? Share it!

Programme

Text Classification: Defining Targeted in Targeted Digital Advertising

The talk describes the ideas and conclusions that were obtained after 5 years of applying Text Classification for the Domain of Digital (aka Programmatic, RTB) Advertising for the purpose of building targeted audiences to advertise to. I will talk about the importance of Text Classification for targeting. Both academic and technical aspects of Text Classification in application for advertisement will be described. At the end of the talk a summary of what Text Classification can achieve will be presented. The talk may be useful for developers and for business people who run RTB advertisement companies. Developers may gain some technical knowledge. Those who are more interested in making their own companies more competitive will learn whether building audience for targeting in-house is a way to go.



Elena Akhmatova

NLP & Big Data Scientist
Suncorp Group


Moving Forward Under the Weight of all that State

Keeping systems up to date is an inherently complex challenge. It is greatly complicated by integration challenges, organisational complexity and increasingly a massive amount of state. This talk will explore some of the drivers and patterns and present some approaches we are undertaking to address this problem in a sustainable manner.



Quinton Anderson

Head of Engineering
Commonwealth Bank


Infrastructure for Smart Cities: Bridging Research and Production

This talk will explore our process for taking research algorithms into production as part of large-scale IoT systems. This will include our experiences developing a condition monitoring system for the Sydney Harbour Bridge, and case studies into some of the challenges we have faced. It will also cover general IoT challenges such as bandwidth limits, weatherproofing, and hardware lock-in, and how we have addressed them.



Ben Barnes

Software Engineer
Data61 | CSIRO


Big Data, Little Data, Fast Data, Slow… Understanding the Potential Value of Click-Stream Data Processing

Digital event data (or click-stream data) is the collective record of users’ interactions with a website or other application. With the growing popularity of collecting this data has come waves of hype about particular approaches and tools for processing and analysing it. From spreadsheets, to dynamic reporting tools, to data warehouses, to massively parallel architectures to real time processing. Which processing approach is right for you?

As for any data system, the collection of digital event data is fundamentally about supporting, or even completely automating, our decisions and this has important implications for how it is processed. This presentation will examine some of these core considerations in formulating an approach to the processing of this data. What decisions does it support? Who uses the data? And increasingly importantly, as users become more aware of the value exchange for the data they provide, how to ensure that this data ultimately provides for a better service?



Sarah Bolt

Senior Data Scientist
ABC


Property Recommendations for all Australians

We would like to share our journey and experiences in building a large scale recommendation engine at REA. Attendees will learn about choosing the right algorithms, architecture and toolset for a highly-scalable recommender system.



Glenn Bunker

Data Science Manager
realestate.com.au


The Best Data Isn’t Data: Why Experiments Are The Future of Data Science

Data is technically the plural of datum, which in Latin is the neuter past participle of dare, which means “to give”. Thus data means “givens”. Indeed, the overwhelming majority of data being analysed out there is given, i.e., the analyst can’t change it. It’s “just there” for you to analyse. You can slice and dice it, model it, act based on it, but you very likely didn’t control even partially the process that gave rise to it. In this talk I’ll try to convince you that, although this given, passive data is important, the real game changer for the future of data science is to combine it with the best possible data, which is not given at all: active data that arises as the outcome of carefully designed experiments. Just as experiments have propelled science to realms unattainable had it focused exclusively on passive observations, I expect the same to happen with data science. I’ll illustrate my arguments with real-world case studies from Ambiata’s experience in serving our clients.



Tiberio Caetano

Chief Scientist
Ambiata


Stomping on Big Data using Google’s BigQuery

Managing infrastructure, worrying about scalability, and waiting for queries to finish executing are some of the biggest challenges when working with massive volumes of data. One solution is to outsource the heavy lifting to someone else, thereby allowing you to spend more time on actually analyzing, and drawing insights out of your data. It other words, look to harnessing the cloud to solve big data problems.

BigQuery is a SaaS tool from Google that is designed to make it easy for us to get up and running without the need to care about any operational overheads. It has a true zero-ops model. BigQuery’s bloodline traces back to Dremel, which was the inspiration for many open sources projects such as Apache Drill. Using a massively parallel processing, tree, and columnar storage architecture, your queries will run on thousands of cores inside Google’s data centres without you spinning up a single VM. This talk will cover its core features, cost model, available APIs, and caveats. Finally, there will be a live demo of BigQuery in action.



Pablo Caif

Senior Software Engineer
Shine Technologies


The Why and How of Why in a World of What

All data exists in context, and understanding that context is key to unlocking its potential. In this talk you will learn how we consciously and unconsciously influence the context of data, and how qualitative and quantitative methods can be combined to better interpret and extract insights from data.

The presentation will cover:

  • What data in context means
  • How bias and interpretation affect data collection, data analysis, and the design of data-driven applications
  • The importance of combining quantitative data with qualitative insights; data tells you “what”, whereas qualitative insights tell you “why”
  • Lessons learned from five years and over 15 data-driven projects
  • A framework for connecting the “why” to the “what”


Hilary Cinis

User Experience and Design Group Leader
Data61


Data Visualisation for Analysts

The demand for analytical skills has increased rapidly in recent years, and with data analysts generating and analysing large quantities of data, the visualisation and communication of the output has never been more important for data analysts. A skilled data analyst can not only synthesise information into a logical framework and summarise it in to a meaningful format, but can also communicate the output or results of the analysis in a well laid out visual, infographic or visual representation of the data. Hear from data modelling analyst, and author of “Using Excel for Business Analysis”, Danielle Stein Fairhurst as we study the principles of good design and how to convert your data into powerful visuals which tell a story to communicate the message uncovered by your analysis.



Danielle Stein Fairhurst

Principal Consultant
Plum Solutions Pty Ltd


Data Analytics for Accelerated Materials Discovery

Data analytics and machine learning are at the centre of social, marketing, healthcare and manufacturing research. In material discovery, they play a fundamental role to successfully tackle the exponential increase in size and complexity of functional materials. This presentation will discuss how data analytics tools can drastically accelerate materials discovery and reveal intrinsic relationships between structural features and functional properties of novel materials. Multivariate statistics techniques and simple decision tree predictors can identify design principles from high throughput data on candidate materials. Meanwhile, more complex deep learning models are calibrated on the performance of small set of materials and later generalise to identify high-performing candidates across large virtual material libraries. It will be demonstrated that data-driven predictors can rapidly discriminate among potential candidate materials at a fraction of the traditional cost, whilst providing new opportunities to understand structure-performance paradigms for novel material applications.



Michael Fernandez

Research Scientist
Data61


The Why and How of Why in a World of What

All data exists in context, and understanding that context is key to unlocking its potential. In this talk you will learn how we consciously and unconsciously influence the context of data, and how qualitative and quantitative methods can be combined to better interpret and extract insights from data.

The presentation will cover:

  • What data in context means
  • How bias and interpretation affect data collection, data analysis, and the design of data-driven applications
  • The importance of combining quantitative data with qualitative insights; data tells you “what”, whereas qualitative insights tell you “why”
  • Lessons learned from five years and over 15 data-driven projects
  • A framework for connecting the “why” to the “what”


Cam Grant

Senior UX Designer
Data 61


Intelligence Amplification with Artificial Intelligence

This talk is an introduction to intelligence augmentation as a framework for delivering systems and products. We will discuss a high level survey and examples of technology, architectural and UX patterns for systems built on data analytics that can be brought together to deliver magical user experiences



Daniel Harrison

Analytics Technology and Product Consultant
Lever Analytics


Keeping RAFT afloat – Cloud Scale Distributed Consensus

Strong consistency for cloud scale systems is typically viewed as too hard and too expensive. This talk provides an overview of how the implementing of the new distributed consensus algorithm, RAFT, using high performance methods and the Aeron network library, enables the low cost processing of over 100M transactions per day.



Philip Haynes

Software Architect
ThreatMetrix


Lake, Swamp or Puddle: Data Quality at Scale

Data is a powerful tool. Data-driven systems leveraging modern analytical and predictive techniques can offer significant improvements over static or heuristic driven systems. The question is: how much can you trust your data?

Data collection, processing and aggregation is a challenging task. How do we build confidence in our data? Where did the data come from? How was it generated? What checks have or should be applied? What is affected when it all goes wrong?

This talk looks at the mechanics of maintaining data-quality at scale. Firstly looking at bad-data, what it is and where it comes from. Then diving into the techniques required to detect, avoid and ultimately deal with bad-data. At the end of this talk the audience should come away with an idea of how to design quality data-driven systems that ultimately build confidence and trust rather than inflate expectations.



Mark Hibberd

CTO
Kinesis


Don’t Give the Network a Function, Teach the Network how to Function!

Organizations are increasingly prone to outsource network functions to the cloud, aiming to reduce the cost and the complexity of maintaining network infrastructures. At the same time, however, outsourcing implies that sensitive network policies, such as firewall rules, are revealed to the cloud provider. In this talk, I will walk you through investigation of the use of a few cryptographic primitives for processing outsourced network functions, so that the provider does not learn any sensitive information.

I will present a cryptographic treatment of privacy-preserving outsourcing of network functions, introducing security definitions as well as an abstract model of generic network functions, and then propose a few instantiations using homomorphic encryption and public-key encryption with keyword search. This will be an illustration of things you should not do if you are after high performance Function Outsourcing. On the other hand however, that shows that it is feasible if Performance, as in run time performance, is not critical.

I will then presents SplitBox, an efficient system for privacy-preserving processing of network functions that are outsourced as software processes to the cloud. Specifically, cloud providers processing the network functions do not learn the network policies instructing how the functions are to be processed. First, I will present an abstract model of a generic network function based on match-action pairs. We assume that this function is processed in a distributed manner by multiple honest-but-curious cloud service providers. Then, I will describe in detail the SplitBox system for private network function virtualization and present a proof-of-concept implementation on FastClick, an extension of the Click modular router, using a firewall as a use case. This PoC achieves a throughput of over 2 Gbps with 1 kB-sized packets on average, traversing up to 60 firewall rules.



Dali Kaafar

Senior Principal Researcher
Data61


Property Recommendations for all Australians

We would like to share our journey and experiences in building a large scale recommendation engine at REA. Attendees will learn about choosing the right algorithms, architecture and toolset for a highly-scalable recommender system.



Ben Kuai

Senior Data Engineer
realestate.com.au


Data Analytics Without Seeing the Data

Today, we first need to collect data before we can analyse them. This not only creates privacy concerns but also security risks for the collector. For many use cases we really only want the analysis and data collection becomes the necessary evil.

In this talk we describe some of the fundamental techniques which allow us to calculate with encrypted data, as well as protocols for distributed analysis and associated security models. We will use some of the standard algorithms, such as logistic regression, to highlight the differences to conventional big-data analytics frameworks.

Finally, we will discuss the architecture and some interesting implementation details of our N1 Analytics Platform which is one of the few emerging industry strength implementations in this space. We will present some performance and scalability measures we collected from initial customer trials.



Max Ott

Senior Principal Engineer
Data61


Big Data Feedback Architectures

Want to harness the real power of big data? Then you’ll need to build an architecture capable of closing the feedback loop through machine learning. In this presentation, I’ll share knowledge gathered from designing streaming big data systems for mobile advertising, where every minute taken off the feedback loop translates to real dollars.

The inherent challenge is balancing technology maturity, hardware cost, and the needs of machine learning. The streaming technology we used is similar to Apache Spark, and gave a serious competitive edge in dealing with several hundred thousand auctions per second. By combining the power of Hadoop, Cassandra, Hive, and Pig, we managed to build a cost-effective solution capable of handling massive incoming traffic, tens of thousand user-data enrichments per second, and maintaining zero loss of business-critical data.



Christian Rolf

Developer
Atlassian


Automating Data Integration with Machine Learning

The world of data is a messy and unstructured place, making it difficult to gain value from data. Things get worse when the data resides in different sources or systems. Before we can perform any analytics in such a case, we need to combine the sources and build a unified view of the data. To handle this situation, a data scientist would typically go through each data source, identify which data is of interest, and define transformations and mappings which unify these data with other sources. This process usually includes writing lots of scripts with potentially overlapping code – a real headache in the everyday life of a data scientist! In this talk we will discuss how machine learning techniques and semantic modelling can be applied to automate the data integration process.



Natalia Rümmele

Data Scientist
Data61


Infrastructure for Smart Cities: Bridging Research and Production

This talk will explore our process for taking research algorithms into production as part of large-scale IoT systems. This will include our experiences developing a condition monitoring system for the Sydney Harbour Bridge, and case studies into some of the challenges we have faced. It will also cover general IoT challenges such as bandwidth limits, weatherproofing, and hardware lock-in, and how we have addressed them.



Sandy Taylor

Software Engineer
Data61


Intermediate Datasets and Complex Problems

We often build data processing systems by starting from simple cases and slowly adding functionality. Usually, a system like this is thought of as a set of operations that take raw data and create meaningful output. These operations tend to organically grow into monoliths, which become very hard to debug and reason about. As such a system expands, development of new features tends to slow down and the cost of maintenance dramatically increases. One way to manage this complexity is to produce denormalized intermediate datasets which can then be reused for both automated processes and ad hoc querying. This separates the knowledge of how the data is connected from the process of extracting information, and allows these parts to be tested separately and more thoroughly. While there are disadvantages to this approach, there are many reasons to consider it. If this technique applies to you, it makes the hard things easy, and the impossible things merely hard.



Jeffrey Theobald

Sr. Software Engineer
Zendesk


Fast Big Data – Enabling Financial Oversight

For the last decade, there has been increased concern about the integrity of capital markets. The crash of 2008-2009 and follow legal actions and press have created an image of a world of high-frequency traders who can leverage their computer power to manipulate markets. Technical talks on performance which is critical in finance, further characterize finance as hooked on speed/low latency. One gets the impression that fast data leads to a fast buck at public expenses. However, fast big data also enables the good guys!

We discuss how fast big data is being used in the financial industry to ensure good governance and protect consumers and businesses who depend on the integrity of financial markets. We discuss the better decisions enabled by algorithms; improved testing practices for algorithms; oversight of markets through surveillance; protection against cyber threats; and the use of data forensics to tell the true story of transactions past.



Dave Thomas

Co-Chair Conferences Program & Technical Advisory Board
YOW!


Unit Testing Data

“Can I trust this data?” When asked this question it can be a difficult task to objectively measure and answer. Similar to how unit tests have provided metrics for code coverage and bug regressions, this talk aims to show techniques and recipes developed to quantify data sanitisation and coverage. It also demonstrates an extensible design pattern that allows further tests to be developed.

If you can write a query, you can write data unit tests. These strategies have been implemented at Invoice2go in their ETL pipeline for the last 2 years to detect data regressions in their Amazon Redshift data warehouse.



Josh Wilson

Senior Data Engineer
Invoice2go


SkillsCasts
Other Years