Why is it important?
Randomized controlled experiments are the gold standard for determining causality, but sometimes such experiments cannot be carried out. Many companies are collecting a lot of data. Although the trust is low, observational causality research can still be used to evaluate causality. If online control experiments are not possible, it is useful to know possible experiments and common traps.
If users change their mobile phones from iPhone to Samsung, what is the impact on product participation? How many users will come back after forced exit? If coupons are introduced as part of the business model, what impact will it have on revenue? For all these problems, the goal of analysis is to find causality. This requires comparing the results of the intervention population with those of the non-intervention population. "The principle of causal inference" (Varian, 2065,438+06) is:
This also shows that comparing the actual influence (the influence on the treated population) with the counterfactual influence (the influence on the untreated population) is the key to determine the causal relationship.
Controlled experiment is the gold standard to evaluate causality, because in the case of random distribution of samples, the first item is the observed difference between treatment and control, and the second item is that the expected value is zero.
However, it is sometimes impossible to conduct properly controlled experiments. These situations include:
In the above cases, the best method is usually to use a variety of methods with low evidence level to evaluate the effect, that is, to use a variety of methods to answer questions, including small-scale user experience research, investigation and observation research. Please refer to chapter 10 for other technical introduction.
In this chapter, our focus is to estimate the causal effect of observational research, which we call observational causal research. Some books, such as Shadish. (200 1) The term "observational (causal) study" refers to the study without intervention objects, and "quasi-experimental design" refers to the study in which samples are allocated to different intervention groups but the allocation is not random. For more information, see Varian (20 16) and Angrist and Pischke (20 14, 2009). Please note that we distinguish observational causal research from more general observational or retrospective data analysis. Both of them are based on historical log data, and the goal of observing causality research is to try to approach the result of causality as close as possible. As discussed in chapter 10, retrospective data analysis has different objectives, including summarizing distribution, discovering the universality of certain behavior patterns, analyzing possible indicators and finding hypotheses that can be tested in controlled experiments.
The study of observing causality faces the following challenges:
Interrupt sequence (ITS) is a quasi-experimental design, in which the variables in the system can be controlled, but the intervention scheme cannot be controlled and intervened at random. But the same people are used to control and treat, and over time, it will change people's experience.
Specifically, it will measure interventions several times over a period of time to create a model, which can provide counterfactual estimates for the indicators of interest after intervention. After the intervention, many measurements were made, and the therapeutic effect was defined as the average difference between the actual value of the index and the predicted value of the model (Charles and Melvin 2004, 130). An extension of simple ITS is to introduce intervention and then reverse it. You can choose to repeat this process several times. For example, using various therapeutic interventions, the influence of police helicopter surveillance on burglary was evaluated. In the past few months, several surveillance measures have been implemented and cancelled. Every time helicopter surveillance is implemented, the number of burglaries will decrease. Every time surveillance is cancelled, the number of burglaries increases (Charles and Melvin 2004). In the online environment, a similar example is to understand the impact of online advertising on the access of search-related websites. Please note that complex modeling may be needed to infer the impact of intervention, and Bayesian structure time series analysis (Charles and Melvin 2004) can be used.
A common problem in observational causality research is that when there is some mixed influence, it is necessary to eliminate the influence of this interference. The most common confusion of ITS is based on the influence of time, because experiments have to be compared at different time points. Seasonality is an obvious example, but other potential system changes can also cause confusion. Going back and forth many times helps to reduce this possibility. Another problem when using ITS is the user experience: will users notice that their experience is flipped back and forth? If so, this lack of consistency may annoy or frustrate users to some extent, which may be caused by inconsistency rather than change.
Staggered experimental design is a design that is often used to evaluate the changes of ranking algorithms (such as searching in search engines or websites) (Chapelle et al., 2012; Radlinski and lasswell, 20 13). Suppose there are two sorting algorithms X and Y in an interleaving experiment. Algorithm x will display the results in this order, and algorithm y will display the results. For example, interleaving experiments will disperse mixed results and delete duplicate results.
One way to evaluate the algorithms is to compare the click-through rates of the results of the two algorithms. Although this is a powerful experimental design, its applicability is limited, because the results must be homogeneous. If the first result usually takes up more space or affects other areas of the page, the result will be more complicated.
Regression Discontinuous Design (RDD) is a method that can be used as long as there is a clear threshold to determine the intervention population. Based on this threshold, we can reduce the selection bias by identifying the group just below the threshold as a control and comparing it with the group just above the threshold.
For example, when winning a scholarship, it is easy to identify the close winner (Ji Weite and Campbell 1960). If the threshold of scholarship is 80 points, then it is considered that the treatment group with just over 80 points is similar to the control group with just under 80 points. But when the participant may influence the intervention imposed on him, this assumption will be violated; For example, whether "treatment" is suitable for passing, but students can persuade teachers to "pass with mercy" (McCrary 2008). (Students interfere with students' grading)
An example of using RDD is to evaluate the impact of drinking on death: Americans over 2 1 year old can legally drink alcohol, so we can view death by birthday, as shown in figure 1 1.2. "The risk of mortality will suddenly break out on the birthday of 2 1 year ... Compared with the baseline level, the number of deaths on that day will increase by 100~ 150 cases. The skyrocketing rate of 2 1 year seems to be not the usual birthday party effect. If this peak only reflects birthday parties, other birthdays of similar age should have similar changes (20, 22 years old), but this did not happen (Angrist and Pischke, 20 14).
As in the above example, a key issue is confounding factors. In RDD, threshold discontinuity may be contaminated by other factors sharing the same threshold. For example, a study on the influence of alcohol chose the legal age of 2 1 as the threshold, and this fact may also be polluted, because this is also the legal age for legal gambling. (2 1 year-old is the legal age for drinking and gambling, which cannot be distinguished)
RDD is most commonly used when an algorithm generates a score and something happens based on the threshold of the score. Please note that when this happens in software, although one option is to use RDD, this situation can easily make it suitable for randomized controlled experiments or some mixture of the two (Owen and Varian 20 18).
Tool variable (IV) is a technique that attempts to approximate random distribution. Specifically, the goal is to determine a tool that enables us to approximate random distribution (which occurs naturally in natural experiments) (Angrist and Pischke 20 14, Pearl 2009).
For example, in order to analyze the income difference between veterans and non-veterans, the Vietnam War conscription lottery is similar to randomly assigning individuals to join the army; Seats in charter schools are allocated by lottery, so it may be a good choice for some studies. In these two examples, drawing lots does not guarantee attendance, but it has a great influence on attendance. Then two-stage least squares regression model is usually used to estimate the effect.
Sometimes, "better than random" natural experiments may happen. Medicine allows identical twins to study twins as natural experiments (Harden et al., 2008; Meg 20 14). When studying social networks or peer-to-peer networks, it may be challenging to conduct control experiments, because the communication between members may make the effect not limited by the treatment population. However, notification queues and message delivery order are types of natural experiments that can be used to understand the impact of intervention.
Another method here is to build a comparable "control and intervention" population, which usually subdivides users according to common interference factors, similar to stratified sampling. The purpose of this is to ensure that the comparison between the control population and the treatment population will not be caused by the change of population structure. For example, if we are studying the exogenous changes in the influence of users who switch from Windows to iOS, we need to make sure that we have not measured the demographic differences of the population.
We can further adopt this method by adopting the propensity score matching (PSM), which does not match the unit on the covariant, but matches a number: the constructed propensity score (Rosenbaum and Rubin 1983, Imbens and Rubin 20 15). This method has been used in online space, for example, to evaluate the impact of online advertising activities (Chan et al., 20 10). The main problem of PSM is that only the observed covariates are considered, so unmeasurable factors may lead to hidden deviations. Judy Pearl (352, 2009) wrote, "rosenbaum and Rubin ..................................................................................................................................................... What they don't realize, however, is that it is not enough to just warn people of the dangers they can't realize. " King and Nielsen (20 18) claim that PSM "results are often contrary to the expected goals, thus aggravating imbalance, inefficiency, model dependence and prejudice."
For all these methods, the key problem is confounding factors.
Many of the above methods focus on how to find the control group as similar as possible to the treatment group. In view of this, one way to measure the intervention effect is the difference (DD or DID). Assuming the same trend, the difference is attributed to the intervention. In particular, these populations "may be different without treatment, but they will develop in parallel" (Angrist and Pischke 20 14).
Experiments based on geographical location usually use this technology. You want to know the function of TV advertisements. Put a TV advertisement in one DMA and compare it with another DMA. As shown in the figure, the treatment group is changed at time T 1. Treatment and control were measured at T2 before and after T 1. It is assumed that the difference between the attention indicators (such as OEC) in the two periods in the control group is to capture external factors (such as seasonality, economic strength and inflation), thus presenting the opposite facts to the actual situation. The curative effect is estimated as the difference of related indexes minus the control difference of this index in the same period.
Please note that this method can be applied even if the outside changes without intervention. For example, when the minimum wage in New Jersey changes, researchers who want to study its impact on the employment level of fast food restaurants compare it with the situation in eastern Pennsylvania, which has many similarities with New Jersey (Card and Krueger 1994).
Although it is sometimes the best choice to observe the study of causality, we should pay attention to some pitfalls (for a more detailed list, please see Xin Ren et al. (20 15)). As mentioned above, the main trap of observing causality research, no matter what method is adopted, is an unexpected mixed factor, which will affect the measurement effect and the influence of causality on interest changes. Because of these mixed factors, the study of observing causality needs great efforts to produce reliable results. In addition, there are many refutations to the study of observational causality (please refer to this column "Refutation to the study of observational causality" and chapter 17 later in this chapter).
A common confusion is the unrecognized reason. For example, in humans, the size of the palm is closely related to life expectancy: on average, the smaller the palm, the longer the life expectancy. However, the common reason for smaller palms and longer life span is gender: women have smaller palms and longer life span (about 6 years in the United States).
For another example, many products, including Microsoft Office 365, usually have a lower churn rate of users who encounter more bugs! But intuitively, it is definitely not a bug that makes users prefer products. This correlation is caused by the following common reasons: users who often use this product will see more errors and lower turnover rate. For function owners, it is not uncommon that the user churn rate of new functions is low, but this does not necessarily mean that new functions can retain users. Perhaps it is often heavy users who use new features. These users are very tired and frustrated. What is the reason? In these cases, in order to evaluate whether the new features can really reduce customer churn, it is necessary to conduct comparative experiments (new users and old users are analyzed separately).
Another trap to be aware of is false or deceptive association. Deceptive correlation may be caused by strong outliers. For example, as shown in figure 1 1.5, marketing companies can claim that their energy drinks are highly correlated with sports performance, and imply causality: your sports table will be improved after drinking our energy products (Orlin 20 16).
False correlations can almost always be found (Vigen 20 18). When we test many hypotheses, when we have no intuition to reject the statement of causality, as we did in the above example, we may believe it. For example, if someone tells you that he has found a factor that is strongly related to being killed by a poisonous spider (r = 0.86), then you may be inclined to take action on this information. But this factor is the length of words in the national spelling bee test, as shown in the figure, so you certainly won't try to shorten the length of words in order to reduce the mortality rate, which is unreasonable.
In the real world, even if cautious measures are taken, there is no guarantee that observational causal research does not contain other factors that may affect the results. The quasi-experimental method of trying to compare counterfactual and establish causality needs to make many assumptions, any of which may be wrong and some of which are implicit. Wrong assumptions may lead to the lack of internal validity of the experiment, while inappropriate assumptions and their limitations will also affect the external validity of the study. As stated in chapter 1, intuition helps to improve the quality of assumptions, but intuition cannot rule out all possible problems. Therefore, the scientific gold standard for establishing causality is still controlled experiment.
To deduce causality from observed data, we need several hypotheses that cannot be tested and are easy to violate. Although many randomized controlled experiments later confirmed many observational causality studies (Concato, Shah and Horwitz 2000), other experiments were refuted. Ioannidis (2005) evaluated the results of high citation rate studies; His research includes six observational causal studies, five of which cannot be repeated. Stanley Young and Alan Yoshida (20 19) used the observational causal study (i.e. non-controlled) and randomized clinical trials, which were considered to be more reliable, to compare the published medical results, and these results were statistically significant. None of the 52 results in 12 papers can be repeated in randomized controlled trials. In 5 of the 52 cases, it has statistical significance in the opposite direction to the study of causality. Their conclusion is: "Any statement from observational studies is likely to be wrong."
An example in the online field is how to measure the effect of online advertising, in other words, whether online advertising leads to the increase of brand activities or even the increase of user participation. It is usually necessary to observe causal research to measure the effect, because intervention (advertising) and effect (user registration or participation) are usually located in different positions, so they are in different control ranges. Lewis, Rao and Reiley(20 1 1) compared the effect of online advertising estimated by observational causality research with the "gold standard" control experiment, and found that observational causality research greatly overestimated the effect. Specifically, they conducted three experiments.
First, show advertisements to users. The research question is: How many users have used brand-related keywords displayed in advertisements to search? Through the observational causal study of 50 million users, including three regression analysis with controlled variables, it is estimated that the improvement range is 87 1% to 1 198%. This estimate is several orders of magnitude higher than the 5.4% increase measured in the control experiment. The puzzling factor is that users visit Yahoo! Common reasons for. Actively visit Yahoo! Some users are more likely to see display ads and execute Yahoo! Search. The exposure of advertisements is highly positively correlated with search behavior, but the causal influence of display advertisements on search is very small. (I don't quite understand the original meaning, attached to the original)
Next, the website shows the video to the user. The question is whether these videos will lead to increased activities. Recruiting users through Amazon Mechanical Turk, half of them are exposed to 30-second video advertisements promoting Yahoo.com services (that is, "treatment"), and the other half are exposed to political video advertisements ("control"), in order to measure whether users' access to Yahoo has increased. The researchers conducted two analyses:
Finally, at Yahoo! A series of advertisements were displayed to users on the website. The purpose is to evaluate whether users who watch advertisements are more likely to register on competitors' websites on the day they watch advertisements. The observational causality study compared the users who watched advertisements on the same day with their behaviors a week ago, while the control experiment visited Yahoo! Compared users who have seen and have not seen advertisements. According to the conclusion of causal research, compared with the previous week, users who have seen advertisements are more likely to register on competitors' websites on the day they see advertisements. However, from the control experiment, the experimenter observed whether watching advertisements had no effect on the behavior of the subjects. The result is similar to the mistake of customer churn we discussed earlier: active users are more likely to be more active. Here, activity is a confusing factor.
This is just a story. A recent comparative study also found that the accuracy of observational causality research is not as good as that of online controlled experiment (Gordon et al.20 18). We provide more stories about https://bit.ly/experimentGuideRefutedObservationalStudies,, showing common unknown reasons, time-sensitive confounding factors, population differences leading to lack of external validity and other examples. Be careful when using observational causal research.