Top blue bar image Department of Computer Science

Causal Data Mining: Identifying Causal Effects at Scale


Computer Science

By: Amit Sharma
Postdoctoral Researcher
From: Microsoft Research
When: Monday, July 17, 2017
4:00 PM - 5:00 PM
Where: Duncan Hall
Abstract: Identifying causal effects is an integral part of scientific inquiry, spanning a wide range of questions such as understanding behavior in online systems, effect of social policies, or risk factors for diseases. However, current data mining and machine learning methods focus on prediction, often ignoring the goal of causal inference. This is partly because inferring causality from observed data is hard unless we make strong assumptions about the data-generating process. In this talk, I will show that we can use properties of the observed data to test many of the strong assumptions, thus enabling a data mining framework for estimating causal effects. The key idea is to look for naturally occurring variations in the data---"natural" experiments--- that resemble an actual experiment. I will present two such methods. The first utilizes auxiliary data from large-scale systems to automate the search for natural experiments. Applying it to estimate the additional activity caused by Amazon's recommendation system, I find over 20,000 natural experiments, an order of magnitude more than those in past work. These experiments indicate that less than half of the click-throughs typically attributed to the recommendation system are causal; the rest would have happened anyways. The second method proposes a general test for validating natural experiments in observed data, widely considered to be an impossible problem. Results from the test show that many natural experiments used in recent studies from a premier economics journal are likely invalid. More generally, the proposed framework presents a viable way of doing causal inference in large-scale datasets with minimal assumptions.
Amit Sharma
His research focuses on understanding the underlying mechanisms that shape people's activities online, with a particular emphasis on the effect of recommendation systems and social influence. More generally, his work contributes to methods for causal inference from large-scale data, combining
principles from Bayesian graphical models, data mining and machine learning. He completed his Ph.D. in computer science at Cornell University. He has received a Best Paper Honorable Mention Award at the 2016 ACM Conference on
Computer Supported Cooperative Work and Social Computing (CSCW), the 2012 Yahoo! Key Scientific Challenges Award and the 2009 Honda Young Engineer and Scientist Award.