Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
We will now explore formal methodologies that can help us derive causal effects from observational data. These methodologies will ultimately allow us to answer the question raised by Simpson’s Paradox. The process of determining the size of a causal effect from observational data can be divided into two steps, namely identification, and estimation.
Identification analysis is about determining whether or not a causal effect can be established from the observed data. This requires a formal causal model or at least partial knowledge of how the data was generated. In this chapter, all causal assumptions for identification are expressed explicitly in the form of a Directed Acyclic Graph (DAG) (Pearl 1995, 2009). They represent our complete causal understanding of the DGP for the system we are studying.
Where do we get such causal assumptions? We would like to say that advanced algorithms can generate causal assumptions from data. That is not the case, unfortunately. Causal assumptions do still require human expert knowledge or, more generally, theory. In practice, this means that we need to build (or draw) a causal graph of our domain. Then, we can examine this graph against formal criteria, which determine whether the effect is identifiable or not.
It is important to realize that the absence of causal assumptions cannot be compensated for by clever statistical techniques or by providing more data. So, recognizing that a causal effect is not identifiable brings the effect analysis to an abrupt halt.
But if the causal effect is identifiable, we can proceed to estimate the effect size. The same criteria that determine identifiability do also tell us how to perform the effect estimation. With that, we can utilize the available observational data and estimate the causal effect. Depending on the complexity of the domain, the effect estimation can bring a new set of challenges. However, in the context of Simpson’s Paradox, the effect estimation will be very straightforward.
We need to understand some important properties before encoding our causal knowledge in a DAG. We learned in Chapter 2 that Bayesian networks use DAGs for the qualitative description of the Joint Probability Distribution.
In the causal context, however, the arcs in a DAG explicitly state causality instead of only representing direct probabilistic dependencies in a Bayesian network. We now designate a DAG with a causal semantic as a Causal DAG (CDAG) to highlight this distinction.
A DAG has three basic configurations in which nodes can be connected. Graphs of any size and complexity can be broken down into these basic graph structures. While these basic structures show direct dependencies/causes explicitly, there are more statements contained in them, albeit implicitly. In fact, we can read all marginal and conditional associations that exist between the nodes.
Why are we even interested in associations? Isn’t all this about understanding causal effects? It is essential to understand all associations in a system because, in non-experimental data, all we can do is observe associations, some of which represent non-causal relationships. Our objective is to identify causal effects from associations.
This DAG represents an indirect connection of A on B via C.
A Directed Arc represents a potential causal effect. The arc direction indicates the assumed causal direction, i.e., “A → C ” means “A causes C .”
A Missing Arc encodes the definitive absence of a direct causal effect, i.e., no arc between A and B means no direct causal relationship exists between A and B and vice versa. As such, a missing arc represents an assumption.
Implication for Causality
A has a potential causal effect on B intermediated by C.
Implication for Association
Marginally (or unconditionally), A and B are dependent. This means that without knowing the exact value of C, learning about A informs us about B and vice versa, i.e., the path between the nodes is unblocked, and information can flow in both directions.
Conditionally on C, i.e., by setting Hard Evidence on (or observing) C, A, and B become independent. In other words, by “hard”-conditioning on C, we block the path from A to B and from B to A. Thus, A and B are conditionally independent, given C:
Hard Evidence means that there is no uncertainty regarding the value of the observation or evidence. If uncertainty remains regarding the value of C, the path will not be entirely blocked, and an association will remain between A and B.
The second configuration has C as the common parent of A and B.
Implication for Causality
C is the common cause of both A and B.
Implication for Association
In terms of association, this structure is absolutely equivalent to the Indirect Connection. Thus, A and B are marginally dependent but conditionally independent given C (by setting Hard Evidence on C):
The final structure has a common child C, with A and B being its parents. This structure is called a “V-Structure.” In this configuration, the common child C is also known as a “collider.”
Implication for Causality
A and B are the direct causes of C.
Implication for Association
Marginally (or unconditionally), A and B are independent, i.e., there is no information flow between A and B. Conditionally on C — with any kind of evidence — A and B become dependent. If we condition on the collider C, information can flow between A and B, i.e., conditioning on C opens the information flow between A and B:
Even introducing a minor change in the distribution of C, e.g., from no observation (“color unknown”) to a very vague observation (“it could be anything, but it is probably not purple”), opens the information flow.
For purposes of formal reasoning, this type of connection is of special significance. Conditioning on C facilitates inter-causal reasoning, often referred to as the ability to “explain away” the other cause, given that the common effect is observed (see Inter-Causal Reasoning in Chapter 4).
To begin the encoding of our causal knowledge in the form of a CDAG, we draw three nodes, which represent X (Treatment), Y (Outcome), and Z (Gender). For now, we are only using the qualitative part of the network, i.e., we are not considering probabilities.
The absence of further nodes means that we assume that there are no additional variables in the Data-Generating Process (DGP), either observable or unobservable. Unfortunately, this is a very strong assumption that cannot be tested. We need to have a justification purely on theoretical grounds to make such an assumption.
In the next step, we must encode our causal assumptions regarding this domain. Given our background knowledge of this domain, we state that Z causes X and Y and that X causes Y.
This means that we believe that gender is a cause of taking the treatment and has a causal effect on the outcome, too. We also assume that the treatment has a potential causal effect on the outcome.
Having accepted these causal assumptions, we now wish to identify the causal effect of X on Y. The question is whether this is possible on the basis of this causal graph and the available observational data for these three variables. Before we can answer this question, we need to think about what this CDAG specifically implies. Recall the types of structures that can exist in a DAG (see Structures Within a DAG). As it turns out, we can find all three of the basic structures in this example:
Indirect Connection: Z causes Y via X
Common Parent: Z causes X and Y
Common Child: Z and X cause Y
Randomized experiments have always been the gold standard for establishing causal effects. For instance, in the drug approval process, controlled experiments are mandatory. Without first having established and quantified the treatment effect, and any associated side effects, no new drug could win approval by the Federal Drug Administration.
However, in many other domains, experiments are not feasible, be it for ethical, economic, or practical reasons. For example, it is clear that a government could not create two different tax regimes to evaluate their respective impact on economic growth. Neither would it be possible to experiment with two different levels of carbon emissions to measure a warming effect on the global climate.
“So, what does our existing data say?” would be an obvious question from policymakers, especially given today’s high expectations concerning Big Data. Indeed, in lieu of experiments, we can attempt to find instances in which the proposed policy already applies (by some assignment mechanism) and compare those to other instances in which the policy does not apply.
However, as we will see in this chapter, performing causal inference on the basis of observational data requires an extensive range of assumptions, which can only come from theory, i.e., domain-specific knowledge. Despite all the wonderful advances in analytics in recent years, data alone, even Big Data, cannot prove the existence of causal effects.
Today, we can openly discuss how to perform causal inference from observational data. For the better part of the 20th century, however, the prevailing opinion had been that speaking of causality without experiments is unscientific. Only towards the end of the century, this opposition had slowly eroded (Rubin 1974, Holland 1986), which subsequently led to numerous research efforts spanning philosophy, statistics, computer science, information theory, etc. The Potential Outcomes Framework has played an important role in this evolution of thought.
Although there is no question about the common-sense meaning of “cause and effect,” for formal analysis, we require a precise mathematical definition. In the fields of social science and biostatistics, the potential outcomes framework is a widely accepted formalism for studying causal effects (the potential outcomes framework is also known as the counterfactual model, the Rubin model, or the Neyman-Rubin model). Rubin (1974) defines "causal effect" as follows:
The individual-level causal effect (ICE) is defined as the difference between the individual’s two potential outcomes, i.e.,
The question is, how can we move from what we can measure, i.e., the naive association, to the quantity of interest, i.e., the causal effect? Determining whether we can measure causation from association is known as identification analysis.
Remarkably, the conditions under which we can measure causal effects from observational data are very similar to those that justify causal inference in randomized experiments. A pure random selection of treated and untreated individuals does indeed remove any potential selection bias and leaves the confounding factor distributions identical in the sup-populations, thus allowing the estimation of the effect of the treatment alone. This condition is known as “ignorability,” which can be formally written as:
\eqalign{ & ACE|X = E[{Y_1}|X] - E[{Y_0}|X] \cr & = E[{Y_1}|A,X] - E[{Y_0}|A,X] \cr & = E[Y|T = 1,X] - E[Y|T = 0,X] \cr & = S|X \cr}
The difficulty that most investigators experience in comprehending what “ignorability” means, and what judgment it summons them to exercise, has tempted them to assume that it is automatically satisfied, or at least is likely to be satisfied if one includes in the analysis as many covariates as possible. The prevailing attitude is that adding more covariates can cause no harm (Rosenbaum 2002, p. 76) and can absolve one from thinking about the causal relationships among those covariates, the treatment, the outcome, and, most importantly, the confounders left unmeasured (Rubin 2009).
The absence of hard-and-fast criteria makes ignorability a potentially dangerous concept for practitioners.
In a DAG, a path is a sequence of non-intersecting, adjacent arcs, regardless of their direction.
A causal path can be any path from cause to effect, in which all arcs are directed away from the cause and pointed toward the effect.
A non-causal path can be any path between cause and effect in which at least one of the arcs is oriented from effect to cause.
Our example contains both types of paths:
This distinction between causal and non-causal paths is critically important for identification.
The Adjustment Criterion (Shpitser et al., 2010) is perhaps the most intuitive among several graphical identification criteria. The Adjustment Criterion states that a causal effect is identified if we can adjust for a set of variables such that:
What does “adjust for” mean in practice? “Adjusting for a variable” can stand for any of the following operations, which all introduce information on a variable:
Controlling
Conditioning
Stratifying
Matching
At this point, the adjustment technique is irrelevant. Rather, we only need to determine which variables, if any, need to be adjusted in order to block the non-causal paths while keeping the causal paths open. Revisiting both paths in our CDAG, we can now examine which ones are open or blocked:
In this example, the Adjustment Criterion can be met by blocking the non-causal path X ← Z → Y by means of adjusting for Z. In other words, adjusting for Z allows identifying the causal effect from X to Y. From now on, we will often refer to such variables Z as Confounders.
Thus far, we have assumed that our example has no unobserved (also called hidden or latent) variables. However, if we had reason to believe that there is another variable, U, which appears to be relevant on theoretical grounds but was not recorded in the dataset, identification could no longer be possible. Why? Let us assume U is a hidden common cause of X and Y. By adding this unobserved variable U, a new non-causal path appears between X and Y via U.
Given that U is hidden, there is no way to adjust for it, and, therefore, we have an open, non-causal path that cannot be blocked. Hence, the causal effect is no longer identifiable, and thus, it can no longer be estimated without bias.
This highlights how easily identification can be “ruined.” Once again, we can only justify the absence of unobserved variables on theoretical grounds.
Δημόκριτος έλεγε βούλεσθαι μάλλον μίαν ευρείν αιτιολογίαν ή την Περσών βασιλείαν εαυτού γενέσθαι.” (“Democritus used to say that ‘he prefers to discover a causality rather than become a king of Persia’.”) — Democritus, according to a late testimony of Dionysius, Bishop of Alexandria, by Eusebius of Caesarea in Præparatio Evangelica (Εὑαγγελικὴ προπαρασκευή)
Bayesian Belief networks and modern causality analysis are intimately tied to the seminal works of Judea Pearl. It is presumably fair to say that one of the “unique selling points” of Bayesian networks is their capability to perform causal inference. However, we do want to go beyond merely demonstrating the mechanics of causal inference. Rather, we want to establish under what conditions causal inference can be performed. More specifically, we want to see the assumptions required to perform causal inference with non-experimental data.
To approach this topic, we need to break with the pattern established in the earlier chapters of this book. Instead of starting with a case study, we start off at a higher level of abstraction. First, we discuss in theoretical terms what is required for performing causal identification, estimation, and inference. Once these fundamentals are established, we can proceed to discuss the methods, along with their limitations, including Directed Acyclic Graphs and Bayesian networks. These techniques can help us distinguish causation from association when working with non-experimental data.
This chapter was prepared in collaboration with Felix Elwert on the basis of his course, Causal Inference with Graphical Models.
In this chapter, we discuss causality mostly on the basis of a “toy problem,” i.e., a simplified and exaggerated version of a real-world challenge. As such, the issues we raise about causality may appear somewhat contrived. Additionally, the constant practical use of causal inference in our daily lives may make our discussion seem somewhat artificial.
To highlight the importance of causal inference on a large scale, we want to consider how and under what conditions big decisions are typically made. Major government or business initiatives generally call for extensive studies to anticipate the consequences of actions not yet taken. Such studies are often referred to as “policy analysis” or “impact assessment”:
What can be the source of such predictive powers? Policy analysis must discover a causal mechanism that links a proposed action/policy to a potential consequence/impact. Unfortunately, experiments are typically out of the question in this context. Rather, impact assessments—from non-experimental observations alone—must determine the existence and the size of a causal effect.
Given the sheer number of impact analyses performed and their tremendous weight in decision-making, one would like to believe that there has been a long-established scientific foundation with regard to (non-experimental) causal effect identification, estimation, and inference. Quite naturally, as decision-makers quote statistics in support of policies, the field of statistics comes to mind as the discipline that studies such causal questions.
However, casual observers may be surprised to hear that causality has been anathema to statisticians for the longest time.
"Considerations of causality should be treated as they always have been treated in statistics, preferably not at all..." (Speed, 1990).
The repercussions of this chasm between statistics and causality can still be felt today. Judea Pearl highlights this unfortunate state of affairs in the preface of his book Causality:
"… I see no greater impediment to scientific progress than the prevailing practice of focusing all our mathematical resources on probabilistic and statistical inferences while leaving causal considerations to the mercy of intuition and good judgment." (Pearl, 1999)
Rubin (1974) and Holland (1986), who introduced the counterfactual (potential outcomes) approach to causal inference, can be credited with overcoming statisticians’ traditional reluctance to engage causality. However, it will take many years for this fairly recent academic consensus to fully reach the world of practitioners, which is one of our key drivers for promoting Bayesian networks.
Please read the following subchapters in sequence to receive a complete explanation.
“Intuitively, the causal effect of one treatment, , over another, , for a particular unit and an interval of time from to is the difference between what would have happened at time if the unit had been exposed to initiated at and what would have happened at if the unit had been exposed to initiated at : ‘If an hour ago I had taken two aspirins instead of just a glass of water, my headache would now be gone,’ or because an hour ago I took two aspirins instead of just a glass of water, my headache is now gone.’ Our definition of the causal effect of versus treatment will reflect this intuitive meaning.”
In this quote, we altered the original variable name to and to in order to be consistent with the nomenclature in the remainder of this chapter. is commonly used in the literature to denote the treatment condition.
Potential outcome of individual given treatment (e.g., taking two Aspirins)
Potential outcome of individual given treatment (e.g., drinking a glass of water)
Given that we cannot rule out differences between individuals (effect heterogeneity), we define the average causal effect as the unweighted arithmetic mean of the individual-level causal effects:
denotes the expected value, i.e., the unweighted arithmetic mean.
The challenge is that (treatment) and (non-treatment) can never be both observed for the same individual at the same time. We can only observe treatment or non-treatment, but not both.
So, where does this leave us? What we can produce easily is the “naive” estimator of association between the “treated” and the “untreated” sub-populations:
For notational convenience, we omit the index because we are now referring to sub-populations and not to an individual.
Because the sub-populations in the treated and untreated groups contain different individuals, is not necessarily a measure of causation, in contrast to.
We must check whether there were any conditions under which the measure of association, , equals the measure of causation, . As a matter of fact, this would be the case if the sub-populations were comparable with respect to all confounders, i.e., the factors that could also influence the outcome.
This means that the potential outcomes, , and must jointly be independent of the treatment assignment, . This condition of ignorability holds in an ideal experiment. Unfortunately, this condition is very rarely met in observational studies. However, conditional ignorability may hold, which refers to ignorability within subgroups of the domain defined by the values of (note that can be a vector).
In words, conditional on variables , , and are jointly independent of , the assignment mechanism. If conditional ignorability holds, we can utilize the estimator, , to estimate the average causal effect, .
How can we select the correct set of variables among all variables in a system? How do we know that such variables are observed or even exist in a domain? This is what makes the concept of ignorability highly problematic in practice. Pearl (2009) states:
It is self-evident that causal arcs have implications in terms of causation. However, as we pointed out earlier in this chapter (see ), there are also implications regarding the association of variables. This will perhaps become clearer as we introduce the concepts of “causal path” and “non-causal path.”
Non-Causal Path: X ← Z → Y ()
Causal Path: X → Y ()
All non-causal paths () between treatment and outcome are “blocked” (non-causal relationships prevented).
All causal paths () from treatment to outcome remain “open” (causal relationships preserved).
First, we look at the non-causal path () in our CDAG: X ← Z → Y. This implies that there is an indirect association between X and Y via Z that has to be blocked by adjusting for Z.
Next is the causal path () in our CDAG: X → Y. It consists of a single arc from X to Y, which is open by default and cannot be blocked.
Readers may be familiar with the expression “controlling for confounders.” What is important to bear in mind is that not all covariates in a system are Confounders! Recall Judea Pearl’s warning about ignorability and the risk of treating every covariate as a Confounder (see ).
“Impact assessment, simply defined, is the process of identifying the future consequences of a current or proposed action.” ()
“Policy assessment seeks to inform decision-makers by predicting and evaluating the potential impacts of policy options.” ()
Let us summarize what we have so far: First, we have observational data from our domain. Second, we developed a theory about the DGP, i.e., the causal relationships in the domain. Both together will serve as the basis for estimating the causal effect. Before we do that, we should contemplate a very literal interpretation of these causal relationships.
If this causal graph is a correct representation of how the domain works, then every relationship between a pair of variables holds independently. Thus, the causal graph represents autonomous relationships between parent nodes and child nodes. It is as if each node were “listening for instructions” from its parents and only from its parents: the child node’s values are solely determined by the value of its parents, not of any other nodes in the system. Also, these relationships remain invariant regardless of any values that other nodes take on.
Let us now consider an outside intervention on X. Thus, rather than "listening" to its parent Z, X is now entirely determined by an external force and set to specific values, e.g., X=1 or X=0. This external intervention breaks the natural relationship between X and Z. Thus, Z no longer influences X. However, Z → Y and X → Y remain unaffected, and the original “natural” values of Z are not affected either.
What is the significance of all this? The idea is that intervening on X is like trying out, or simulating, what would happen if treatment were to be applied universally to the entire population — or withheld universally. Isn’t this the causal effect we are interested in? In other words, computing the causal effect is like simulating outside interventions on the treatment variable X.
How does this help us? By simulating an intervention, we “mutilate” the graph. This new graph looks like we had severed the arc going into the treatment variable X. This operation is what Judea Pearl has rather colorfully named “graph surgery” or “graph mutilation.”
Applying Graph Surgery allows us to transform a causal graph that represents a joint probability distribution P of observational data, i.e., pre-intervention distribution, into a new mutilated graph that represents the joint probability distribution Pm of the same variables under a simulated intervention, i.e., post-intervention distribution.
where
Returning to the original version of the CDAG, without the hidden variable, we are now ready to proceed with the estimation. However, this CDAG is only a qualitative representation of our theory about the DGP. We now need to consider this graph as a model representing the joint probability distribution of our three variables P(X, Y, Z).
We do not yet need to determine what this probability function is; we simply need to consider this graph as a non-parametric probability function linking X, Y, and Z. This will help us understand what it means to adjust for Z to estimate the causal effect.
How can we translate the abstract concept of Graph Surgery into something that can compute actual numerical values? In fact, we can work directly with graphs — in the form of Bayesian networks — and use BayesiaLab to perform Graph Surgery and simulate interventions.
However, before we illustrate that in the next section of this chapter, we want to formally conclude the line of reasoning that connects the pre-intervention distribution P to the post-intervention distribution Pm and introduce the Adjustment Formula. We paraphrase Pearl, Glymour, and Jewell (2016), p. 56f. to develop this formula.
In our example, we can easily estimate the pre-intervention distribution P from the available data, but we need the post-intervention distribution Pm to calculate the causal effect. The key lies in recognizing that Pm shares two essential properties with P.
Furthermore, X and Z are marginally independent in the mutilated graph. This means that the conditional probability distribution of Z given X in the mutilated graph is the same as the marginal probability distribution of Z in the pre-intervention graph:
Since the Adjustment Criterion is satisfied in the mutilated graph, we have the following:
By conditioning on Z and summing over all values z, we obtain:
Furthermore, X and Z are independent in the mutilated graph:
Using the two invariance equations above, we obtain what is known as the Adjustment Formula. It expresses the post-intervention distribution exclusively in terms of the pre-intervention distribution:
The Adjustment Formula computes the association between X and Y for each value z (or strata of z∈Z) and then produces a weighted average. On this basis, we can now estimate the Average Causal Effect (ACE):
We know that by performing a randomized experiment, we obtain an unbiased estimate of the causal effect of the treatment. More specifically, through randomization, we randomly split the patient population into two sub-populations and forced the first group to receive treatment, and withheld the treatment from the second group. Through the random assignment of the treatment, we ensure that there is no association between Z and X. Also, all other properties remain unaffected by the randomization of treatment, including the distribution of Z, the relationship between Z and Y, and the relationship between X and Y.
As a result, graph surgery can be seen as a “randomization after the fact.” However, we need to realize that performing graph surgery can only achieve quasi-randomization with regard to observed and known confounders, in our case Z. A randomized experiment, however, can make treatment independent of all other confounders, observed, unobserved, and unknown. Thus, randomized experiments remain the gold standard for establishing causal effects.
All our efforts in estimating causal effects through adjustment or graph surgery are merely an attempt to mimic the properties of a randomized experiment. Unfortunately, we can never measure how close we are to achieving this goal. We can only be disciplined with our assumptions and make a causal claim based on that.
Simpson’s Paradox Resolved
Returning to Simpson’s Paradox, the equation
gives us the answer to our question of whether we need to look at the aggregate data table or the gender-specific data table for determining the true causal effect of treatment on the outcome: “Conditioning on Z and summing over all values z” means that we need to utilize the gender-specific table. More specifically, we need to compute the association between X and Y for each value of Z, i.e., each stratum of z ∈ Z, and then calculate the weighted average. This estimation method is also known as stratification.
Aggregate Table
Gender-Specific Table
The ACE turns out to be negative, i.e., it has the opposite sign of what we would have inferred by merely looking naively at the association between treatment and outcome. This illustrates that a bias in the estimation of an effect can be more than just a nuisance for the analyst. Bias can reverse the sign of the effect! In our example, the treatment under study would kill people instead of healing them. The good news is that we have a theory stipulating that gender is a confounder, and this variable is observed. If it were not recorded in our dataset (hidden variable), we would not be able to compute the causal effect of treatment. We can also imagine situations where we do not know that confounders exist and, therefore, do not measure them. This can lead to substantially wrong estimations of causal effects and lead to policies with catastrophic consequences.
We now introduce Causal Effect Estimation by means of Likelihood Matching. Given the simplicity of Simpson’s Paradox example, the need for yet another estimation method may not be immediately apparent. The advantages of Likelihood Matching will only become clear as we study a more complex domain, such as the marketing mix example of the next chapter. However, the current example makes it easy to explain Likelihood Matching.
In statistics, matching refers to the technique that makes confounder distributions of the treated and untreated sub-populations as similar as possible to each other. As such, applying matching to variables qualifies as adjustment, and we can use it with the objective of keeping causal paths open and blocking non-causal paths. In the Simpson’s Paradox example, matching is fairly simple as we only need to match a single binary variable, i.e., Gender. That will meet our requirement for adjustment and block the only non-causal path in our model.
As our terminology of “blocking paths by matching” may not be understood outside the world of graphical models and Bayesian networks, we can offer a more intuitive interpretation of matching, which our example can illustrate very well.
Because of the self-selection phenomenon of our population, Gender distribution is a function of Treatment. In other words, of those who are treated, 75% turn out to be male. Among untreated patients, only 25% are male.
Given that we know that Gender has a causal effect on Outcome (men are more likely to recover than women, with or without treatment), and given this difference in gender composition, comparing the outcomes between the treated and non-treated patients is clearly not an apples-to-apples comparison.
We can propose a common-sense solution to this predicament. How about searching for a subset of patients within treated and non-treated groups, which have an identical gender mix, as illustrated below?
In statistical matching, this process typically involves the selection of units in such a way that comparable groups are created:
In practice, this can be more challenging as the observed units typically have more than just a single binary attribute. So, the idea of matching has to be extended to higher dimensions, and the observed units need to be matched on a range of attributes, including both continuous and discrete variables.
However, matching observations exactly with regard to all covariates is rarely feasible. For instance, patients are characterized by dozens or even hundreds of attributes and comorbidities. Finding two matching patients would be difficult enough, but finding populations with many matching pairs of patients would presumably be impossible.
So, how does randomization do it? Actually, randomization does not guarantee identical populations but rather ensures that the distributions of confounders are balanced between the populations under study. So, to pursue balanced confounders instead of pursuing perfect matches, numerous similarity measures have been proposed for matching.
However, matching on the Propensity Score requires the score itself to be estimated first. Conventional models only represent the outcome variable as a function of the treatment variable and the confounders, i.e., P(Y|X, Z). If we need to understand the relationship between the treatment and the confounders, i.e., P(X|Z), we have to estimate this separately. This usually means fitting a function, such as a regression, that models the propensity score, PS=P(X|Z). For binary treatment variables, logistic regression is a common choice for the functional form.
With BayesiaLab's Likelihood Matching, we do not directly match the underlying observations. Rather we match the distributions of the relevant nodes on the basis of the joint probability distribution represented by the Bayesian network. In our example, we need to ensure that the gender compositions of untreated and treated groups are the same, i.e., a 50/50 gender mix. This theoretically ideal condition is shown in the Monitors below.
However, the actual distributions reveal the inequality of gender distributions for the untreated and the treated.
How can we overcome this? By simply using Probabilistic Evidence to set a 50/50 gender mix, i.e., the marginal distribution of Gender, upon setting evidence on Treatment. We can also right-click on the Monitor for Gender and select Fix Probabilities from the Contextual Menu. This will automatically use Probabilistic Evidence to set the current marginal distribution after each conditioning on Treatment, or any other node.
With Fix Probabilities applied, the distribution of Gender in its Monitor is highlighted in purple.
Other than colors, nothing appears to have changed. However, once we set the values Treatment=False and Treatment=True, we see that the distribution for Gender does not change. We can also observe that the corresponding posterior probabilities for Outcome are the same as those obtained with Graph Surgery.
With that, we can once again calculate the Average Causal Effect:
Note that the mutilated graph achieves what was stipulated by the Adjustment Criterion: the non-causal path () X ← Z → Y does not exist any longer, as required by the Adjustment Criterion, and, given the autonomy of the other arcs, the causal path () X → Y remains unblocked.
This new graph can tell us what happens to Y when we intervene and set X to a specific value, i.e., . Note the do-operator! With this mutilated graph, we can compute the quantity of interest, the Average Causal Effect (ACE):
First, the marginal distribution remains invariant under the intervention because the process of determining Z is unaffected by removing the arc Z → X. In our example, this means that the share of men and women must remain the same before and after the intervention:
Secondly, the conditional probability distribution remains invariant under the intervention because the process that determines how Y responds to X and Z stays the same, regardless of whether X changes naturally or through external intervention. We can state this formally as follows:
The concept of Propensity Score Matching has become a particularly popular method (). Instead of matching individuals on their high-dimensional attributes, we would match observations by their probability of treatment, i.e., P(X=1|Z), which is known as the Propensity Score. Rubin and Rosenbaum have shown that matching on the propensity score achieves a balance of the covariate distributions.
For now, Likelihood Matching applied to Simpson's Paradox may not seem like a breakthrough method. Conceptually and practically, it appears to be another form of adjustment. The fundamental advantages of Likelihood Matching will become clear in the context of the next chapter, .
We now understand that Graph Surgery and Adjustment are equivalent. However, with Bayesian networks, we can go beyond the metaphor and—quite literally—perform graph surgery. In this section, we create a Bayesian network to represent Simpson’s Paradox example and then perform graph surgery to estimate the causal effect.
We have already defined a causal graph earlier when we encoded our causal assumptions regarding this domain. We can reuse this causal understanding for building a causal Bayesian network in BayesiaLab.
The Genetic Grid Layout algorithms are particularly useful for causal networks. We can, therefore, define one of these two algorithms as the one associated with the shortcut P via Preferences > Window > Preferences > Automatic Layout > Layout Algorithm Associated with Shortcut.
Parameter Estimation
Previously, we acquired the data needed for Parameter Estimation via the Data Import Wizard. Now we will use the Associate Data Wizard for the same purpose. Whereas the Data Import Wizard generates new nodes from columns in a database, the Associate Data Wizard links columns of data with existing nodes. This way, we can “fill” our qualitative network with data and then perform Parameter Estimation to generate the quantitative part of the network. We now show the corresponding steps in detail.
We start the Associate Data Wizard from the main menu: Data > Associate Data Source > Text File
Given that the Associate Data Wizard mirrors the Data Import Wizard in most of its options, we omit to describe them here. We merely show the screens for reference as we click Next to progress through the wizard.
The last step shows how the variables in the dataset will be associated with the nodes of the network. If the column names in the dataset perfectly match the existing node names, BayesiaLab automatically creates an association. However, this is not the case in our example. Therefore, we have to manually define the association by iteratively selecting each Dataset Variable and Network Node, and then clicking on the right arrow.
Upon clicking on the right arrow, BayesiaLab brings up a screen for defining the association between the values used in the dataset and the states of the node. Again, the state names of our nodes do not correspond exactly to the values used in the dataset. So we have to manually define the association by iteratively selecting each Dataset Value and Network State and then clicking on the right arrow.
Once this is done for all three variables, the Associate Data Wizard displays how the columns in the dataset are associated with the nodes of the network.
Upon clicking Finish, we are prompted whether we want to view the Associate Report.
The Database icon in the lower right-hand corner of the main window indicates that our network now has a database associated with its structure. We now use this data to estimate the parameters of the network: Learning > Parameter Estimation.
Once the parameters estimated, there are no longer any warning symbols tagged onto the nodes.
We now have a fully specified Bayesian network. By opening the Node Editor of Outcome, for instance, we see that the CPT is indeed filled with probabilities.
Upon clicking on the Occurrences tab, we can see the counts that were used by the Maximum Likelihood Estimation to derive these probabilities.
Recall that distinguishing between causal and non-causal paths is crucial for the application of the Adjustment Criterion. BayesiaLab can help us review the paths that are present in the graph. Given that we already understand the paths, showing the formal path analysis with BayesiaLab is merely for reference.
Once we define Outcome as Target Node and switch into the Validation Mode (F5), we can examine all possible paths to the Target Node in this network. We select Treatment and then select Main Menu > Analysis > Visual > Graph > Influence Paths to Target
.
Then, BayesiaLab displays a pop-up window with the Influence Paths report. Selecting any of the listed paths shows the corresponding arcs in the Graph Panel. Causal paths are shown in blue; non-causal paths are pink.
It is easy to see that this automated path analysis can be particularly helpful with more complex networks. In any case, the result confirms our previous manual path analysis, which means that we need to adjust for Gender to block the non-causal path between Treatment and Outcome.
For instance, the screenshot below shows the prior distributions (left) and the posterior distributions (right) given the observation Treatment = True.
As expected, the target variable Outcome changes upon setting this evidence. However, Gender, changes as well, even though we know that the treatment cannot possibly change the gender of a patient. What we observe here is a manifestation of the non-causal path: Treatment ← Gender → Outcome. These probabilities are obviously perfectly correct from the observational point of view: in the observed population of 1,200 individuals, three times as many men as women took the treatment.
For causal inference, however, we need a network that computes all probabilities under an intervention scenario. As we learned, Graph Surgery transforms the original causal network representing the pre-intervention distribution into a new, mutilated network that yields the post-intervention distribution.
In BayesiaLab, Graph Surgery is automated. After right-clicking the Monitor of the node Treatment, we select Intervention from the Contextual Menu.
The activation of the Intervention Mode for this node is highlighted by the blue background of the Treatment's Monitor and the arrow symbols (→) in the Treatment's badge.
By double-clicking a state of Treatment, we now set an Intervention and no longer an Observation.
By intervening on Treatment, BayesiaLab applies Graph Surgery and removes the inbound arc into Treatment.
Recall the formula that computes the Average Causal Effect (ACE):
We can take it directly as a set of instructions and compare the probability of Outcome=Recovered under do(Treatment=False) and do(Treatment=True). Note that the distribution of Gender remains the same pre- and post-intervention.
Simpson’s Paradox illustrates the implications of falsely assuming ignorability. This will lead us to abandon the idea of ignorability and, along with it, the potential outcomes framework and replace it with a formal identification and estimation process that relies on graphical models.
This is an important exercise as it illustrates how an incorrect interpretation of an association can produce bias. The word “bias” may not necessarily strike fear into our hearts. In our common understanding, “bias” implies “inclination” and “tendency,” and it is perhaps not a particularly forceful expression. Hence, we may not be overly troubled by a warning about bias. However, Simpson’s Paradox shows how bias can lead to catastrophically wrong estimates.
A hypothetical disease equally affects men and women. An observational study finds that a certain treatment is linked to an increase in the recovery rate among all treated patients from 40 to 50%. Based on the study, this new treatment is widely recognized as beneficial and subsequently promoted as a new therapy.
We can imagine a headline along the lines of “New Therapy Increases Recovery Rate by 10%.” However, when examining patient records by gender, the recovery rate for male patients—upon treatment—decreases from 70% to 60%; for female patients, the recovery rate declines from 30% to 20%. Men are, therefore, more likely to recover than women, with or without treatment.
So, is this new treatment effective overall or not? This puzzle can be resolved by realizing that, in this observed population, there was an unequal application of the treatment to men and women, i.e., some type of self-selection occurred. More specifically, 75% of male patients and only 25% of female patients received the treatment. Although the reason for this imbalance is irrelevant for inference, one could imagine that the side effects of this treatment are much more severe for females, who thus seek alternative therapies. As a result, there is a greater share of men among treated patients. Given that men have a better a priori recovery prospect with this type of disease, the recovery rate of all treated patients increases. So, what is the true causal effect of this treatment?
Our particular manifestation of Simpson’s Paradox is not very far-fetched, but it is still fictional. Therefore, we must rely on synthetic data to make this problem domain tangible for our study efforts. We generate 1,200 observations by sampling from the joint probability distribution of the original Data-Generating Process (DGP). Needless to say, for this dataset to be a suitable example for non-experimental observations like we would find in them under real-world conditions, the true DGP is not known but merely an assumption.
Our synthetic dataset consists of three variables with two discrete states each:
X (Treatment): Yes (1)/No (0)
Y (Outcome): Recovered (1)/Not Recovered (0)
Z (Gender): Male (1)/Female (0)
The following table shows a preview of the first ten rows of the dataset:
As we illustrated in the context of the , we manually create the nodes and draw the arcs on BayesiaLab’s Graph Panel. We choose to use long names for the nodes instead of X, Y, and Z. Letters were very convenient for formulas, but long names increase the readability of Bayesian networks. To further help with interpretation, we also associate images with each node and display them as Badges. Then, we use View > Layout > Genetic Grid Layout > Top-Down Repartition to obtain a layout that takes into account the direction of the arcs and define layers accordingly.
The yellow warning symbols remind us that the probability tables associated with the nodes have yet to be defined. At this point, we could set the parameters based on our knowledge of all the probabilities in this domain. Instead, we utilize the available data and use BayesiaLab’s Parameter Estimation to establish the quantitative part of this network via Maximum Likelihood Estimation. We have been using Parameter Estimation extensively in this book, either implicitly or explicitly, for instance, in the context of structural learning and missing values estimation (see ).
This prompts us to select the text file containing our observational data (). Upon selecting the file, BayesiaLab brings up the first screen of the Associate Data Wizard.
Before proceeding to the effect estimation, we bring up the Monitors of all three nodes and compare the probabilities reported by the network with the , which gave rise to the paradox.
Thus, we obtain an Average Causal Effect of −0.1, which agrees with what we previously computed with the .
We will use an example that appears trivial on the surface but which has produced countless instances of false inference throughout the history of science. Due to its counterintuitive nature, this example has become widely known as Simpson’s Paradox ().
Patient Recovered
Treatment
Yes
No
Yes
50%
50%
No
40%
60%
Patient Recovered
Gender
Treatment
Yes
No
Male
Yes
60%
40%
No
70%
30%
Female
Yes
20%
80%
No
30%
70%
Now that we have seen how to estimate the Average Causal Effect by manually interacting with the BayesiaLab's Monitors, with both Graph Surgery and Likelihood Matching, we will use the BayesiaLab's Direct and Total Effect functions to compute causal effects automatically for a set of variables. But first, we present a slightly more complex version of Simpson's Paradox to illustrate these features (see Example: Simpson's Paradox).
Our updated story contains four additional dimensions:
Treatment Availability: the treatment is not always available;
Side Effects: the treatment may produce severe side effects;
Efficacy: some patients do not respond to the drug;
Litigation: the families of patients who died may sue the pharmaceutical company that had provided the treatment.
The manually designed CDAG shown below describes this new domain qualitatively:
Next, we describe the quantitative part of the domain. First, we state that Treatment Availability is 75%.
We also assume that the treatment may have Side Effects, which are much more frequent for females. The following conditional probability table quantifies this direct causal dependency on Gender:
Patients decide whether or not to take the treatment based on two criteria, Treatment Availability and Side Effects. The dependencies are described in the following table. It states that if the treatment is unavailable, patients cannot have the treatment, which is deterministic (and obvious). However, if the treatment is available, those patients who do not have any risk of experiencing side effects will always choose the treatment, while those at risk will be unlikely to submit to the treatment:
Furthermore, the Efficacy of the treatment depends on Drug Administration plus some hidden factors that render the treatment ineffective in 20% of patients:
The Target Node, Outcome, is defined by Gender and Efficacy. In this context, "not recovered" means that the patient died — hence the grim illustration attached to the icon.
Finally, half of the families of those patients who took the treatment and died are pursuing litigation. More specifically, these families are suing the pharmaceutical company that provided the treatment.
We now list the paths between each variable and the target variable Outcome by using Main Menu > Analysis > Visual > Graph > Influence Paths to Target
.
Recall the Adjustment Criterion, which stipulates that we must keep all of a variable's causal paths to the target variable open and simultaneously block all its non-causal paths for estimating its causal effect.
To illustrate BayesiaLab's Total and Direct Effects functions with Graph Surgery, we set all nodes but Outcome to Intervention Mode.
Notice the arrow symbols (→) in the badges of the nodes that are set to Intervention Mode.
Before using BayesiaLab's automated tools for computing causal effects, we manually estimated the causal effect of our main variable of interest, Drug Administration, by using the Monitors.
Setting a piece of Evidence in Intervention Mode simulates an intervention on Drug Administration and mutilates the graph, as shown below, which meets the Adjustment Criterion by blocking the non-causal path (cf. Path Analysis, #6).
The Average Causal Effect of Drug Administration on Outcome, mediated by Efficacy, is -0.08.
We have seen in Chapter 8 that BayesiaLab estimates Total Effects as the derivatives of Total Effect Curves. These curves are based on the Posterior Mean Values of the Target Node given Mean Values from the interval of the variable under study. While the variables are in Intervention Mode, the Posterior Mean Values are computed based on the mutilated graph.
We can plot these curves with Main Menu > Analysis > Visual > Target > Target's Posterior > Curves > Total Effects
.
For generating this graph, we can set a number of options:
The x-axis, Variable Delta Means, represents the difference between the Mean Value generated for the analysis (here, Hard Evidence/Intervention on the states of the variable under study) and its Prior Mean Value. The y-axis represents the difference between the PosteriorMean Value of Outcome and its PriorMean Value.
If we do not specifically associate numerical values with symbolic states, BayesiaLab uses the state index. In our example,
False is 0, and True is 1.
Male is 0, and Female is 1.
Not Recovered is 0, and Recovered is 1.
We see that Side Effects is the only variable with a positive causal effect. We also notice that Litigation has no causal effect.
Given that all variables are binary, the corresponding curves are linear. Therefore, the curves' derivatives will be perfect summaries of the Total Effect Curves: Main Menu > Analysis > Report > Target > Total Effects on Target
:
The Total Effect is the derivative computed at (0, 0) in the previous Target Mean Analysis graph, i.e., the slope of the curve. The Standardized Total Effect is the Total Effect times the ratio between the standard deviation of the variable and the standard deviation of the Target Node.
The arrow symbols (→) in the results table indicate that Intervention Mode was active on all nodes, triggering Graph Surgery upon each observation/intervention during the estimation of the effects.
Gender is the variable with the strongest Total Effect. It is negative because of the index values of the states. Females (1) are recovering at a lower rate than Males (0).
Note that there are two paths from Gender to Outcome (paths #1 and #2 illustrated in the previous section), and they are both causal. Gender is indeed a root node, i.e., it has no parents, meaning the Adjustment Criterion is fulfilled by default.
The Total Effect measures the effects of these two causal paths: the direct path (#1) and the indirect path (#2), represented by the dashed blue arcs below.
Now suppose we are interested in estimating the effect of the direct paths only. This would require blocking not only the non-causal paths but also the indirect causal paths. This is the role of BayesiaLab's Direct Effect functions. The only difference between Direct and Total Effect functions is that, by default, all other nodes are held constant during the estimation of the variable's Direct Effect.
We generate the Direct Effect Curves with Main Menu > Analysis > Visual > Target > Target's Posterior > Curves > Direct Effects
, using the same parameters as those previously used for Total Effects:
Given that all nodes are in Intervention Mode, the only variables with Direct Effects are the Parents of Outcome. Indeed, intervening on all nodes to hold them constant triggers Graph Surgery and generates the mutilated graph below:
The function Main Menu > Analysis > Report > Target > Direct Effects on Target
allows us to compute the Direct Effects, the single-point estimates of these curves:
The Direct Effect is the slope of the Direct Effect Curve between the endpoints of the variable interval.
The Standardized Direct Effect is the Direct Effect times the ratio between the standard deviation of the variables and the standard deviation of the Target Node.
The Elasticity is the Direct Effect times the ratio between the range of the variable and the range of the Target Node.
The Contribution is the Standardized Direct Effect divided by the total sum of Standardized Direct Effects.
Non-Confounders
By default, BayesiaLab's Direct Effect functions measure a variable's effect by holding all other variables constant. However, we can use the predefined class Non_Confounder to define the nodes we do not want to control.
In our example, the main variable of interest, Drug Administration, has no direct effect. The post-treatment variable Efficacy mediates its causal effect, and the Direct Effect analysis blocks the path. We must therefore use the predefined class Non_Confounder (Efficacy's Contextual Menu > Properties > Classes > Add > Predefined Class > Non_Confounders
) to prevent BayesiaLab from holding Efficacy constant and allow the estimation of the mediated causal effect. The new mutilated graph below is then used for estimating the Direct Effects:
Drug Administration's Direct Effect now equals the Average Causal Effect we manually computed with the Monitors. You can also note that we no longer analyze the effect of the Non-ConfounderEfficacy.
Now suppose we want to use Likelihood Matching instead of Graph Surgery. We first set back all nodes in Observation Mode via the monitors' Contextual Menus.
The nodes of interest are the nodes for which we want to estimate the causal effect on the Target Node. We call them Treatments or Drivers.
In the previous section, we assumed that all nodes were of interest and set them in Intervention Mode. With Likelihood Matching, the workflow is less straightforward. For each Driver, we need to analyze the paths to the Target (cf. Path Anlaysis) to define the set of nodes that need to be controlled for to block the non-causal paths and let the causal paths open. Note that these nodes' sets may differ for each Driver, requiring us to perform multiple Total Effect analyses to avoid conflicting adjustments.
The first step is then to define our nodes of interest. In the Augmented Simpson Paradox, the main variable of interest is obviously Drug Administration, but for illustrative purposes, let's consider Gender as well.
We have seen in the Path Analysis section that there are two paths between Gender and Outcome, both causal (#1 and #2). Thus, there is no variable to adjust for to estimate the Total Effect.
The Path Analysis indicates that there are also two paths between Drug Administration and Outcome, one causal (#7) and one non-causal (#6): Drug Administration ← Side Effects ← Gender → Outcome. So we need to adjust for Side Effects, or for Gender, to block this path. This is in contradiction to the analysis of Gender's effect. We cannot estimate the Total Effects of Gender and Drug Administration in the same analysis with Likelihood Matching!
So let's start with Gender. We select the node, go to Main Menu > Analysis > Report > Target > Total Effects on Target
, and confirm that we want to perform the analysis on the selected node only:
For Drug Administration, let's suppose we choose to adjust for Side Effects. We right-click on its associated Monitor and select Fix Probabilities from its Contextual Menu.
Then, we select the node Drug Administration, go to Analysis > Report > Target > Total Effects on Target, and confirm that we want to perform the analysis on the selected node only:
Now let's look at the workflow for estimating Direct Effects with Likelihood Matching, i.e., how to assess the effects of the direct paths only. Remember that, by default, BayesiaLab's Direct Effect functions measure a variable's effect by holding all variables constant except those associated with the predefined class Non_Confounder.
Holding a variable constant with Graph Surgery implies the deletion of its entering arcs. Thus, there is no risk of biasing the estimation of Direct Effects. In the Likelihood Matching case, this risk exists because we set evidence on the variable to adjust for it. Indeed, controlling for descendants of the Target Node (e.g., Litigation) automatically biases the estimate.
While we previously added Efficacy to the Non_Confounder class to let it mediate the effect of Drug Administration, we must also add Litigation to prevent its adjustment.
Notice that there is no conflict in this analysis:
Gender
Controlling for Side Effects, Drug Administration allows to cut the indirect causal path (#2);
Controlling for Treatment Availability has no impact;
Not controlling for Efficacy has no effect as path #2 is already blocked;
Not controlling for Litigation prevents to bias the estimation of the effect;
Drug Administration
Controlling for Side Effects, Gender allows to cut the non-causal path (#4);
Controlling for Treatment Availability has no impact;
Not controlling for Efficacy allows to let the information flows from Drug Administration to Outcome;
Not controlling for Litigation prevents to bias the estimation of the effect.
We can, therefore, select our two nodes of interest, use Analysis > Report > Target > Direct Effects on Target, and confirm that we want to perform the analysis on the selected nodes only:
Before concluding this chapter, let's summarize the main characteristics of Graph Surgery and Likelihood Matching:
Graph Surgery
requires a fully specified Causal Bayesian Network;
uses the mutilated Causal Bayesian Network for causal inference;
Likelihood Matching
requires the causal analysis of the domain to define the variables that need to be adjusted for to block the non-causal paths and let the causal paths open;
uses the Bayesian network to carry out probabilistic inference with the adjusted variables. Note that this network does not have to be causal! It just needs to represent the Joint Probability Distribution of the domain.
This last point is especially important. It is indeed sometimes challenging, if not impossible, to design the fully specified Causal Bayesian Network. However, BayesiaLab offers a wide range of machine-learning algorithms that we can use to induce a network that represents the Joint Probability Distribution. Hence, we only need to have a limited amount of causal knowledge to define the variables that have to be adjusted for.
For example, suppose we machine-learned the network below with Main Menu > Learning > Supervised Learning > Augmented Naive Bayes
:
The main architecture of the network is Naïve, i.e., the Target Node is the parent of all nodes. Therefore, this Bayesian network is clearly not causal. If we were to use Graph Surgery, we would not find any total or direct effects (see the corresponding mutilated graph below when estimating the Direct Effects with Efficacy and Litigation defined as Non_Confounder).
However, Likelihood Matching returns the correct estimations for the Total Effects with two separate analyses.
One analysis for Gender, without adjusting for any variables:
And one analysis for Drug Administration, by holding constant Side Effects:
As for Direct Effects, the analysis can be carried out for both variables with the current definition of Non_Confounders.
This chapter highlights how much effort is required to derive causal effect estimates from observational data. Simpson’s Paradox illustrates how much can go wrong even in simple circumstances. Given such potentially serious consequences, it is a must for policy analysts to examine all aspects of causality formally. To paraphrase Judea Pearl, we must not leave causal considerations to the mercy of intuition and good judgment. Fortunately, causality has emerged from its pariah status in recent decades, which has allowed tremendous progress in theoretical research and practical tools: “[…] practical problems relying on casual information that long were regarded as either metaphysical or unmanageable can now be solved using elementary mathematics” (Pearl, 1999).
The causal paths are highlighted in blue (), and the non-causal paths (i.e., paths with at least one "backward" arrow ←) are shown in pink ():