The proliferation of Big Data systems means that there is an increasing amount of data available to Data Scientists but relatively little of it is collected in a controlled fashion, instead it is purely observational.
Pearl’s do-calculus offers a way, given a causal model, to get the benefits of randomised controlled trials from purely observational data. This paper proposes a theoretical solution to the problem of combining data from heterogenous sources using different selection criteria, and outlines how to correct for confounding bias and selection bias.
- Causal inference and the data-fusion problem (Barenboim and Pearl, 2016) - http://ftp.cs.ucla.edu/pub/stat_ser/r450-reprint.pdf
- Video - author Elias Barenboim on causal data science and data fusion - http://www.cs.columbia.edu/streaming/2019-Spr/elias_bareinboim.mp4
- Blog - Adam Kelleher on correcting for selection bias (part of a longer series on causal inference) - https://medium.com/@akelleh/how-do-you-correct-selection-bias-d781a9b12de2
Background reading (for wider context, not specifically covered in the session):
Preview chapters of Causal Inference in Statistics: A Primer
Chapter 1 - Preliminaries: Statistical and Causal Models https://media.wiley.com/product_data/excerpt/46/11191868/1119186846-9.pdf
Chapter 4 - Counterfactuals and their applications http://bayes.cs.ucla.edu/PRIMER/
Causality - Chapter 1: http://bayes.cs.ucla.edu/BOOK-99/ch1.pdf
A note about the Journal Club format:
- The sessions usually start with a 5-10 minute introduction to the paper by the topic volunteer, followed by splitting into smaller groups to discuss the paper and other materials. We finish the session by coming together for about 15 minutes to discuss what we have learned as a group and ask questions around the room.
- There is no speaker at Journal Club. One of the community has volunteered their time to suggest the topic and start the session, but most of the discussion comes from within the groups.
- You will get more benefit from the session if you read the paper or other materials in advance. We try to provide (where we can find them) accompanying blog posts, relevant code and other summaries of the topic to serve as entry points.
- If you don't have time to do much preparation, please come anyway. You will probably have something to contribute, and even if you just end up following the other discussions, you can still learn a lot.
- It's OK just to read the blog post or watch the video :)
- We don't have spare copies of the paper during the session, so please print out your own if you want a hard copy for discussion. For digital copies, you are welcome to use your laptops/tablets/phones during the session.