1 of 12

Chapter 10: Causal Effect Identification and Estimation

Introduction

Δημόκριτος έλεγε βούλεσθαι μάλλον μίαν ευρείν αιτιολογίαν ή την Περσών βασιλείαν εαυτού γενέσθαι.” (“Democritus used to say that ‘he prefers to discover a causality rather than become a king of Persia’.”) — Democritus, according to a late testimony of Dionysius, Bishop of Alexandria, by Eusebius of Caesarea in Præparatio Evangelica (Εὑαγγελικὴ προπαρασκευή)

Bayesian Belief networks and modern causality analysis are intimately tied to the seminal works of Judea Pearl. It is presumably fair to say that one of the “unique selling points” of Bayesian networks is their capability to perform causal inference. However, we do want to go beyond merely demonstrating the mechanics of causal inference. Rather, we want to establish under what conditions causal inference can be performed. More specifically, we want to see the assumptions required to perform causal inference with non-experimental data.

To approach this topic, we need to break with the pattern established in the earlier chapters of this book. Instead of starting with a case study, we start off at a higher level of abstraction. First, we discuss in theoretical terms what is required for performing causal identification, estimation, and inference. Once these fundamentals are established, we can proceed to discuss the methods, along with their limitations, including Directed Acyclic Graphs and Bayesian networks. These techniques can help us distinguish causation from association when working with non-experimental data.

This chapter was prepared in collaboration with Felix Elwert on the basis of his course, Causal Inference with Graphical Models.

Motivation: Causality for Policy Assessment and Impact Analysis

In this chapter, we discuss causality mostly on the basis of a “toy problem,” i.e., a simplified and exaggerated version of a real-world challenge. As such, the issues we raise about causality may appear somewhat contrived. Additionally, the constant practical use of causal inference in our daily lives may make our discussion seem somewhat artificial.

To highlight the importance of causal inference on a large scale, we want to consider how and under what conditions big decisions are typically made. Major government or business initiatives generally call for extensive studies to anticipate the consequences of actions not yet taken. Such studies are often referred to as “policy analysis” or “impact assessment”:

What can be the source of such predictive powers? Policy analysis must discover a causal mechanism that links a proposed action/policy to a potential consequence/impact. Unfortunately, experiments are typically out of the question in this context. Rather, impact assessments—from non-experimental observations alone—must determine the existence and the size of a causal effect.

Given the sheer number of impact analyses performed and their tremendous weight in decision-making, one would like to believe that there has been a long-established scientific foundation with regard to (non-experimental) causal effect identification, estimation, and inference. Quite naturally, as decision-makers quote statistics in support of policies, the field of statistics comes to mind as the discipline that studies such causal questions.

However, casual observers may be surprised to hear that causality has been anathema to statisticians for the longest time.

"Considerations of causality should be treated as they always have been treated in statistics, preferably not at all..." (Speed, 1990).

The repercussions of this chasm between statistics and causality can still be felt today. Judea Pearl highlights this unfortunate state of affairs in the preface of his book Causality:

"… I see no greater impediment to scientific progress than the prevailing practice of focusing all our mathematical resources on probabilistic and statistical inferences while leaving causal considerations to the mercy of intuition and good judgment." (Pearl, 1999)

Rubin (1974) and Holland (1986), who introduced the counterfactual (potential outcomes) approach to causal inference, can be credited with overcoming statisticians’ traditional reluctance to engage causality. However, it will take many years for this fairly recent academic consensus to fully reach the world of practitioners, which is one of our key drivers for promoting Bayesian networks.

Key Concepts, Methods, and Practical Implementation

Please read the following subchapters in sequence to receive a complete explanation.

Sources of Causal Information

Causal Inference by Experiment

Randomized experiments have always been the gold standard for establishing causal effects. For instance, in the drug approval process, controlled experiments are mandatory. Without first having established and quantified the treatment effect, and any associated side effects, no new drug could win approval by the Federal Drug Administration.

Causal Inference from Observational Data and Theory

However, in many other domains, experiments are not feasible, be it for ethical, economic, or practical reasons. For example, it is clear that a government could not create two different tax regimes to evaluate their respective impact on economic growth. Neither would it be possible to experiment with two different levels of carbon emissions to measure a warming effect on the global climate.

“So, what does our existing data say?” would be an obvious question from policymakers, especially given today’s high expectations concerning Big Data. Indeed, in lieu of experiments, we can attempt to find instances in which the proposed policy already applies (by some assignment mechanism) and compare those to other instances in which the policy does not apply.

However, as we will see in this chapter, performing causal inference on the basis of observational data requires an extensive range of assumptions, which can only come from theory, i.e., domain-specific knowledge. Despite all the wonderful advances in analytics in recent years, data alone, even Big Data, cannot prove the existence of causal effects.

Historical Context

Today, we can openly discuss how to perform causal inference from observational data. For the better part of the 20th century, however, the prevailing opinion had been that speaking of causality without experiments is unscientific. Only towards the end of the century, this opposition had slowly eroded (Rubin 1974, Holland 1986), which subsequently led to numerous research efforts spanning philosophy, statistics, computer science, information theory, etc. The Potential Outcomes Framework has played an important role in this evolution of thought.

Potential Outcomes Framework

Although there is no question about the common-sense meaning of “cause and effect,” for formal analysis, we require a precise mathematical definition. In the fields of social science and biostatistics, the potential outcomes framework is a widely accepted formalism for studying causal effects (the potential outcomes framework is also known as the counterfactual model, the Rubin model, or the Neyman-Rubin model). Rubin (1974) defines "causal effect" as follows:

The individual-level causal effect (ICE) is defined as the difference between the individual’s two potential outcomes, i.e.,

The question is, how can we move from what we can measure, i.e., the naive association, to the quantity of interest, i.e., the causal effect? Determining whether we can measure causation from association is known as identification analysis.

Ignorability

Remarkably, the conditions under which we can measure causal effects from observational data are very similar to those that justify causal inference in randomized experiments. A pure random selection of treated and untreated individuals does indeed remove any potential selection bias and leaves the confounding factor distributions identical in the sup-populations, thus allowing the estimation of the effect of the treatment alone. This condition is known as “ignorability,” which can be formally written as:

\eqalign{ & ACE|X = E[{Y_1}|X] - E[{Y_0}|X] \cr & = E[{Y_1}|A,X] - E[{Y_0}|A,X] \cr & = E[Y|T = 1,X] - E[Y|T = 0,X] \cr & = S|X \cr}

The difficulty that most investigators experience in comprehending what “ignorability” means, and what judgment it summons them to exercise, has tempted them to assume that it is automatically satisfied, or at least is likely to be satisfied if one includes in the analysis as many covariates as possible. The prevailing attitude is that adding more covariates can cause no harm (Rosenbaum 2002, p. 76) and can absolve one from thinking about the causal relationships among those covariates, the treatment, the outcome, and, most importantly, the confounders left unmeasured (Rubin 2009).

The absence of hard-and-fast criteria makes ignorability a potentially dangerous concept for practitioners.

Example: Simpson's Paradox

Introduction

Simpson’s Paradox illustrates the implications of falsely assuming ignorability. This will lead us to abandon the idea of ignorability and, along with it, the potential outcomes framework and replace it with a formal identification and estimation process that relies on graphical models.

This is an important exercise as it illustrates how an incorrect interpretation of an association can produce bias. The word “bias” may not necessarily strike fear into our hearts. In our common understanding, “bias” implies “inclination” and “tendency,” and it is perhaps not a particularly forceful expression. Hence, we may not be overly troubled by a warning about bias. However, Simpson’s Paradox shows how bias can lead to catastrophically wrong estimates.

Does the Treatment Kill Patients?

A hypothetical disease equally affects men and women. An observational study finds that a certain treatment is linked to an increase in the recovery rate among all treated patients from 40 to 50%. Based on the study, this new treatment is widely recognized as beneficial and subsequently promoted as a new therapy.

We can imagine a headline along the lines of “New Therapy Increases Recovery Rate by 10%.” However, when examining patient records by gender, the recovery rate for male patients—upon treatment—decreases from 70% to 60%; for female patients, the recovery rate declines from 30% to 20%. Men are, therefore, more likely to recover than women, with or without treatment.

So, is this new treatment effective overall or not? This puzzle can be resolved by realizing that, in this observed population, there was an unequal application of the treatment to men and women, i.e., some type of self-selection occurred. More specifically, 75% of male patients and only 25% of female patients received the treatment. Although the reason for this imbalance is irrelevant for inference, one could imagine that the side effects of this treatment are much more severe for females, who thus seek alternative therapies. As a result, there is a greater share of men among treated patients. Given that men have a better a priori recovery prospect with this type of disease, the recovery rate of all treated patients increases. So, what is the true causal effect of this treatment?

Synthetic Data

Our particular manifestation of Simpson’s Paradox is not very far-fetched, but it is still fictional. Therefore, we must rely on synthetic data to make this problem domain tangible for our study efforts. We generate 1,200 observations by sampling from the joint probability distribution of the original Data-Generating Process (DGP). Needless to say, for this dataset to be a suitable example for non-experimental observations like we would find in them under real-world conditions, the true DGP is not known but merely an assumption.

Our synthetic dataset consists of three variables with two discrete states each:

X (Treatment): Yes (1)/No (0)
Y (Outcome): Recovered (1)/Not Recovered (0)
Z (Gender): Male (1)/Female (0)

The following table shows a preview of the first ten rows of the dataset:

Data Download

Causal Identification and Estimation

We will now explore formal methodologies that can help us derive causal effects from observational data. These methodologies will ultimately allow us to answer the question raised by Simpson’s Paradox. The process of determining the size of a causal effect from observational data can be divided into two steps, namely identification, and estimation.

Identification

Identification analysis is about determining whether or not a causal effect can be established from the observed data. This requires a formal causal model or at least partial knowledge of how the data was generated. In this chapter, all causal assumptions for identification are expressed explicitly in the form of a Directed Acyclic Graph (DAG) (Pearl 1995, 2009). They represent our complete causal understanding of the DGP for the system we are studying.

Where do we get such causal assumptions? We would like to say that advanced algorithms can generate causal assumptions from data. That is not the case, unfortunately. Causal assumptions do still require human expert knowledge or, more generally, theory. In practice, this means that we need to build (or draw) a causal graph of our domain. Then, we can examine this graph against formal criteria, which determine whether the effect is identifiable or not.

It is important to realize that the absence of causal assumptions cannot be compensated for by clever statistical techniques or by providing more data. So, recognizing that a causal effect is not identifiable brings the effect analysis to an abrupt halt.

Estimation

But if the causal effect is identifiable, we can proceed to estimate the effect size. The same criteria that determine identifiability do also tell us how to perform the effect estimation. With that, we can utilize the available observational data and estimate the causal effect. Depending on the complexity of the domain, the effect estimation can bring a new set of challenges. However, in the context of Simpson’s Paradox, the effect estimation will be very straightforward.

Identification and Estimation Methods

Causal DAGs

We need to understand some important properties before encoding our causal knowledge in a DAG. We learned in Chapter 2 that Bayesian networks use DAGs for the qualitative description of the Joint Probability Distribution.

In the causal context, however, the arcs in a DAG explicitly state causality instead of only representing direct probabilistic dependencies in a Bayesian network. We now designate a DAG with a causal semantic as a Causal DAG (CDAG) to highlight this distinction.

Structures Within a DAG

A DAG has three basic configurations in which nodes can be connected. Graphs of any size and complexity can be broken down into these basic graph structures. While these basic structures show direct dependencies/causes explicitly, there are more statements contained in them, albeit implicitly. In fact, we can read all marginal and conditional associations that exist between the nodes.

Why are we even interested in associations? Isn’t all this about understanding causal effects? It is essential to understand all associations in a system because, in non-experimental data, all we can do is observe associations, some of which represent non-causal relationships. Our objective is to identify causal effects from associations.

Indirect Connection

This DAG represents an indirect connection of A on B via C.

A Directed Arc represents a potential causal effect. The arc direction indicates the assumed causal direction, i.e., “A → C ” means “A causes C .”
A Missing Arc encodes the definitive absence of a direct causal effect, i.e., no arc between A and B means no direct causal relationship exists between A and B and vice versa. As such, a missing arc represents an assumption.

Implication for Causality

A has a potential causal effect on B intermediated by C.

Implication for Association

Marginally (or unconditionally), A and B are dependent. This means that without knowing the exact value of C, learning about A informs us about B and vice versa, i.e., the path between the nodes is unblocked, and information can flow in both directions.

Conditionally on C, i.e., by setting Hard Evidence on (or observing) C, A, and B become independent. In other words, by “hard”-conditioning on C, we block the path from A to B and from B to A. Thus, A and B are conditionally independent, given C:

Hard Evidence means that there is no uncertainty regarding the value of the observation or evidence. If uncertainty remains regarding the value of C, the path will not be entirely blocked, and an association will remain between A and B.

Common Parent

The second configuration has C as the common parent of A and B.

Implication for Causality

C is the common cause of both A and B.

Implication for Association

In terms of association, this structure is absolutely equivalent to the Indirect Connection. Thus, A and B are marginally dependent but conditionally independent given C (by setting Hard Evidence on C):

Common Child (Collider)

The final structure has a common child C, with A and B being its parents. This structure is called a “V-Structure.” In this configuration, the common child C is also known as a “collider.”

Implication for Causality

A and B are the direct causes of C.

Implication for Association

Marginally (or unconditionally), A and B are independent, i.e., there is no information flow between A and B. Conditionally on C — with any kind of evidence — A and B become dependent. If we condition on the collider C, information can flow between A and B, i.e., conditioning on C opens the information flow between A and B:

Even introducing a minor change in the distribution of C, e.g., from no observation (“color unknown”) to a very vague observation (“it could be anything, but it is probably not purple”), opens the information flow.

For purposes of formal reasoning, this type of connection is of special significance. Conditioning on C facilitates inter-causal reasoning, often referred to as the ability to “explain away” the other cause, given that the common effect is observed (see Inter-Causal Reasoning in Chapter 4).

Creating a CDAG Representing Simpson’s Paradox

To begin the encoding of our causal knowledge in the form of a CDAG, we draw three nodes, which represent X (Treatment), Y (Outcome), and Z (Gender). For now, we are only using the qualitative part of the network, i.e., we are not considering probabilities.

The absence of further nodes means that we assume that there are no additional variables in the Data-Generating Process (DGP), either observable or unobservable. Unfortunately, this is a very strong assumption that cannot be tested. We need to have a justification purely on theoretical grounds to make such an assumption.

In the next step, we must encode our causal assumptions regarding this domain. Given our background knowledge of this domain, we state that Z causes X and Y and that X causes Y.

This means that we believe that gender is a cause of taking the treatment and has a causal effect on the outcome, too. We also assume that the treatment has a potential causal effect on the outcome.

Having accepted these causal assumptions, we now wish to identify the causal effect of X on Y. The question is whether this is possible on the basis of this causal graph and the available observational data for these three variables. Before we can answer this question, we need to think about what this CDAG specifically implies. Recall the types of structures that can exist in a DAG (see Structures Within a DAG). As it turns out, we can find all three of the basic structures in this example:

Indirect Connection: Z causes Y via X
Common Parent: Z causes X and Y
Common Child: Z and X cause Y

Graphical Identification Criteria

Causal and Non-Causal Paths

In a DAG, a path is a sequence of non-intersecting, adjacent arcs, regardless of their direction.

A causal path can be any path from cause to effect, in which all arcs are directed away from the cause and pointed toward the effect.
A non-causal path can be any path between cause and effect in which at least one of the arcs is oriented from effect to cause.

Our example contains both types of paths:

This distinction between causal and non-causal paths is critically important for identification.

Adjustment Criterion and Identification

The Adjustment Criterion (Shpitser et al., 2010) is perhaps the most intuitive among several graphical identification criteria. The Adjustment Criterion states that a causal effect is identified if we can adjust for a set of variables such that:

What does “adjust for” mean in practice? “Adjusting for a variable” can stand for any of the following operations, which all introduce information on a variable:

Controlling
Conditioning
Stratifying
Matching

At this point, the adjustment technique is irrelevant. Rather, we only need to determine which variables, if any, need to be adjusted in order to block the non-causal paths while keeping the causal paths open. Revisiting both paths in our CDAG, we can now examine which ones are open or blocked:

In this example, the Adjustment Criterion can be met by blocking the non-causal path X ← Z → Y by means of adjusting for Z. In other words, adjusting for Z allows identifying the causal effect from X to Y. From now on, we will often refer to such variables Z as Confounders.

Unobserved Confounders

Thus far, we have assumed that our example has no unobserved (also called hidden or latent) variables. However, if we had reason to believe that there is another variable, U, which appears to be relevant on theoretical grounds but was not recorded in the dataset, identification could no longer be possible. Why? Let us assume U is a hidden common cause of X and Y. By adding this unobserved variable U, a new non-causal path appears between X and Y via U.

Given that U is hidden, there is no way to adjust for it, and, therefore, we have an open, non-causal path that cannot be blocked. Hence, the causal effect is no longer identifiable, and thus, it can no longer be estimated without bias.

This highlights how easily identification can be “ruined.” Once again, we can only justify the absence of unobserved variables on theoretical grounds.

Effect Estimation

Estimation

Returning to the original version of the CDAG, without the hidden variable, we are now ready to proceed with the estimation. However, this CDAG is only a qualitative representation of our theory about the DGP. We now need to consider this graph as a model representing the joint probability distribution of our three variables P(X, Y, Z).

We do not yet need to determine what this probability function is; we simply need to consider this graph as a non-parametric probability function linking X, Y, and Z. This will help us understand what it means to adjust for Z to estimate the causal effect.

Estimation Methods

General Methods

Methods in BayesiaLab

Graph Surgery

Let us summarize what we have so far: First, we have observational data from our domain. Second, we developed a theory about the DGP, i.e., the causal relationships in the domain. Both together will serve as the basis for estimating the causal effect. Before we do that, we should contemplate a very literal interpretation of these causal relationships.

If this causal graph is a correct representation of how the domain works, then every relationship between a pair of variables holds independently. Thus, the causal graph represents autonomous relationships between parent nodes and child nodes. It is as if each node were “listening for instructions” from its parents and only from its parents: the child node’s values are solely determined by the value of its parents, not of any other nodes in the system. Also, these relationships remain invariant regardless of any values that other nodes take on.

Let us now consider an outside intervention on X. Thus, rather than "listening" to its parent Z, X is now entirely determined by an external force and set to specific values, e.g., X=1 or X=0. This external intervention breaks the natural relationship between X and Z. Thus, Z no longer influences X. However, Z → Y and X → Y remain unaffected, and the original “natural” values of Z are not affected either.

What is the significance of all this? The idea is that intervening on X is like trying out, or simulating, what would happen if treatment were to be applied universally to the entire population — or withheld universally. Isn’t this the causal effect we are interested in? In other words, computing the causal effect is like simulating outside interventions on the treatment variable X.

How does this help us? By simulating an intervention, we “mutilate” the graph. This new graph looks like we had severed the arc going into the treatment variable X. This operation is what Judea Pearl has rather colorfully named “graph surgery” or “graph mutilation.”

Applying Graph Surgery allows us to transform a causal graph that represents a joint probability distribution P of observational data, i.e., pre-intervention distribution, into a new mutilated graph that represents the joint probability distribution Pm of the same variables under a simulated intervention, i.e., post-intervention distribution.

where

Adjustment Formula

How can we translate the abstract concept of Graph Surgery into something that can compute actual numerical values? In fact, we can work directly with graphs — in the form of Bayesian networks — and use BayesiaLab to perform Graph Surgery and simulate interventions.

However, before we illustrate that in the next section of this chapter, we want to formally conclude the line of reasoning that connects the pre-intervention distribution P to the post-intervention distribution Pm and introduce the Adjustment Formula. We paraphrase Pearl, Glymour, and Jewell (2016), p. 56f. to develop this formula.

In our example, we can easily estimate the pre-intervention distribution P from the available data, but we need the post-intervention distribution Pm to calculate the causal effect. The key lies in recognizing that Pm shares two essential properties with P.

Furthermore, X and Z are marginally independent in the mutilated graph. This means that the conditional probability distribution of Z given X in the mutilated graph is the same as the marginal probability distribution of Z in the pre-intervention graph:

Since the Adjustment Criterion is satisfied in the mutilated graph, we have the following:

By conditioning on Z and summing over all values z, we obtain:

Furthermore, X and Z are independent in the mutilated graph:

Using the two invariance equations above, we obtain what is known as the Adjustment Formula. It expresses the post-intervention distribution exclusively in terms of the pre-intervention distribution:

The Adjustment Formula computes the association between X and Y for each value z (or strata of z∈Z) and then produces a weighted average. On this basis, we can now estimate the Average Causal Effect (ACE):

We know that by performing a randomized experiment, we obtain an unbiased estimate of the causal effect of the treatment. More specifically, through randomization, we randomly split the patient population into two sub-populations and forced the first group to receive treatment, and withheld the treatment from the second group. Through the random assignment of the treatment, we ensure that there is no association between Z and X. Also, all other properties remain unaffected by the randomization of treatment, including the distribution of Z, the relationship between Z and Y, and the relationship between X and Y.

As a result, graph surgery can be seen as a “randomization after the fact.” However, we need to realize that performing graph surgery can only achieve quasi-randomization with regard to observed and known confounders, in our case Z. A randomized experiment, however, can make treatment independent of all other confounders, observed, unobserved, and unknown. Thus, randomized experiments remain the gold standard for establishing causal effects.

All our efforts in estimating causal effects through adjustment or graph surgery are merely an attempt to mimic the properties of a randomized experiment. Unfortunately, we can never measure how close we are to achieving this goal. We can only be disciplined with our assumptions and make a causal claim based on that.

Simpson’s Paradox Resolved

Returning to Simpson’s Paradox, the equation

gives us the answer to our question of whether we need to look at the aggregate data table or the gender-specific data table for determining the true causal effect of treatment on the outcome: “Conditioning on Z and summing over all values z” means that we need to utilize the gender-specific table. More specifically, we need to compute the association between X and Y for each value of Z, i.e., each stratum of z ∈ Z, and then calculate the weighted average. This estimation method is also known as stratification.

Aggregate Table

Gender-Specific Table

The ACE turns out to be negative, i.e., it has the opposite sign of what we would have inferred by merely looking naively at the association between treatment and outcome. This illustrates that a bias in the estimation of an effect can be more than just a nuisance for the analyst. Bias can reverse the sign of the effect! In our example, the treatment under study would kill people instead of healing them. The good news is that we have a theory stipulating that gender is a confounder, and this variable is observed. If it were not recorded in our dataset (hidden variable), we would not be able to compute the causal effect of treatment. We can also imagine situations where we do not know that confounders exist and, therefore, do not measure them. This can lead to substantially wrong estimations of causal effects and lead to policies with catastrophic consequences.

Causal Effect Estimation in BayesiaLab with Graph Surgery

We now understand that Graph Surgery and Adjustment are equivalent. However, with Bayesian networks, we can go beyond the metaphor and—quite literally—perform graph surgery. In this section, we create a Bayesian network to represent Simpson’s Paradox example and then perform graph surgery to estimate the causal effect.

We have already defined a causal graph earlier when we encoded our causal assumptions regarding this domain. We can reuse this causal understanding for building a causal Bayesian network in BayesiaLab.

The Genetic Grid Layout algorithms are particularly useful for causal networks. We can, therefore, define one of these two algorithms as the one associated with the shortcut P via Preferences > Window > Preferences > Automatic Layout > Layout Algorithm Associated with Shortcut.

Parameter Estimation

Previously, we acquired the data needed for Parameter Estimation via the Data Import Wizard. Now we will use the Associate Data Wizard for the same purpose. Whereas the Data Import Wizard generates new nodes from columns in a database, the Associate Data Wizard links columns of data with existing nodes. This way, we can “fill” our qualitative network with data and then perform Parameter Estimation to generate the quantitative part of the network. We now show the corresponding steps in detail.

We start the Associate Data Wizard from the main menu: Data > Associate Data Source > Text File

Given that the Associate Data Wizard mirrors the Data Import Wizard in most of its options, we omit to describe them here. We merely show the screens for reference as we click Next to progress through the wizard.

The last step shows how the variables in the dataset will be associated with the nodes of the network. If the column names in the dataset perfectly match the existing node names, BayesiaLab automatically creates an association. However, this is not the case in our example. Therefore, we have to manually define the association by iteratively selecting each Dataset Variable and Network Node, and then clicking on the right arrow.

Upon clicking on the right arrow, BayesiaLab brings up a screen for defining the association between the values used in the dataset and the states of the node. Again, the state names of our nodes do not correspond exactly to the values used in the dataset. So we have to manually define the association by iteratively selecting each Dataset Value and Network State and then clicking on the right arrow.

Once this is done for all three variables, the Associate Data Wizard displays how the columns in the dataset are associated with the nodes of the network.

Upon clicking Finish, we are prompted whether we want to view the Associate Report.

The Database icon in the lower right-hand corner of the main window indicates that our network now has a database associated with its structure. We now use this data to estimate the parameters of the network: Learning > Parameter Estimation.

Once the parameters estimated, there are no longer any warning symbols tagged onto the nodes.

We now have a fully specified Bayesian network. By opening the Node Editor of Outcome, for instance, we see that the CPT is indeed filled with probabilities.

Upon clicking on the Occurrences tab, we can see the counts that were used by the Maximum Likelihood Estimation to derive these probabilities.

Path Analysis

Recall that distinguishing between causal and non-causal paths is crucial for the application of the Adjustment Criterion. BayesiaLab can help us review the paths that are present in the graph. Given that we already understand the paths, showing the formal path analysis with BayesiaLab is merely for reference.

Once we define Outcome as Target Node and switch into the Validation Mode (F5), we can examine all possible paths to the Target Node in this network. We select Treatment and then select Main Menu > Analysis > Visual > Graph > Influence Paths to Target.

Then, BayesiaLab displays a pop-up window with the Influence Paths report. Selecting any of the listed paths shows the corresponding arcs in the Graph Panel. Causal paths are shown in blue; non-causal paths are pink.

It is easy to see that this automated path analysis can be particularly helpful with more complex networks. In any case, the result confirms our previous manual path analysis, which means that we need to adjust for Gender to block the non-causal path between Treatment and Outcome.

Observational Inference

For instance, the screenshot below shows the prior distributions (left) and the posterior distributions (right) given the observation Treatment = True.

As expected, the target variable Outcome changes upon setting this evidence. However, Gender, changes as well, even though we know that the treatment cannot possibly change the gender of a patient. What we observe here is a manifestation of the non-causal path: Treatment ← Gender → Outcome. These probabilities are obviously perfectly correct from the observational point of view: in the observed population of 1,200 individuals, three times as many men as women took the treatment.

Causal Inference

For causal inference, however, we need a network that computes all probabilities under an intervention scenario. As we learned, Graph Surgery transforms the original causal network representing the pre-intervention distribution into a new, mutilated network that yields the post-intervention distribution.

In BayesiaLab, Graph Surgery is automated. After right-clicking the Monitor of the node Treatment, we select Intervention from the Contextual Menu.

The activation of the Intervention Mode for this node is highlighted by the blue background of the Treatment's Monitor and the arrow symbols (→) in the Treatment's badge.

By double-clicking a state of Treatment, we now set an Intervention and no longer an Observation.

By intervening on Treatment, BayesiaLab applies Graph Surgery and removes the inbound arc into Treatment.

Recall the formula that computes the Average Causal Effect (ACE):

We can take it directly as a set of instructions and compare the probability of Outcome=Recovered under do(Treatment=False) and do(Treatment=True). Note that the distribution of Gender remains the same pre- and post-intervention.

Causal Effect Estimation in BayesiaLab with Likelihood Matching

We now introduce Causal Effect Estimation by means of Likelihood Matching. Given the simplicity of Simpson’s Paradox example, the need for yet another estimation method may not be immediately apparent. The advantages of Likelihood Matching will only become clear as we study a more complex domain, such as the marketing mix example of the next chapter. However, the current example makes it easy to explain Likelihood Matching.

Intuition for Matching

In statistics, matching refers to the technique that makes confounder distributions of the treated and untreated sub-populations as similar as possible to each other. As such, applying matching to variables qualifies as adjustment, and we can use it with the objective of keeping causal paths open and blocking non-causal paths. In the Simpson’s Paradox example, matching is fairly simple as we only need to match a single binary variable, i.e., Gender. That will meet our requirement for adjustment and block the only non-causal path in our model.

As our terminology of “blocking paths by matching” may not be understood outside the world of graphical models and Bayesian networks, we can offer a more intuitive interpretation of matching, which our example can illustrate very well.

Because of the self-selection phenomenon of our population, Gender distribution is a function of Treatment. In other words, of those who are treated, 75% turn out to be male. Among untreated patients, only 25% are male.

Given that we know that Gender has a causal effect on Outcome (men are more likely to recover than women, with or without treatment), and given this difference in gender composition, comparing the outcomes between the treated and non-treated patients is clearly not an apples-to-apples comparison.

We can propose a common-sense solution to this predicament. How about searching for a subset of patients within treated and non-treated groups, which have an identical gender mix, as illustrated below?

In statistical matching, this process typically involves the selection of units in such a way that comparable groups are created:

In practice, this can be more challenging as the observed units typically have more than just a single binary attribute. So, the idea of matching has to be extended to higher dimensions, and the observed units need to be matched on a range of attributes, including both continuous and discrete variables.

However, matching observations exactly with regard to all covariates is rarely feasible. For instance, patients are characterized by dozens or even hundreds of attributes and comorbidities. Finding two matching patients would be difficult enough, but finding populations with many matching pairs of patients would presumably be impossible.

So, how does randomization do it? Actually, randomization does not guarantee identical populations but rather ensures that the distributions of confounders are balanced between the populations under study. So, to pursue balanced confounders instead of pursuing perfect matches, numerous similarity measures have been proposed for matching.

However, matching on the Propensity Score requires the score itself to be estimated first. Conventional models only represent the outcome variable as a function of the treatment variable and the confounders, i.e., P(Y|X, Z). If we need to understand the relationship between the treatment and the confounders, i.e., P(X|Z), we have to estimate this separately. This usually means fitting a function, such as a regression, that models the propensity score, PS=P(X|Z). For binary treatment variables, logistic regression is a common choice for the functional form.

Likelihood Matching

With BayesiaLab's Likelihood Matching, we do not directly match the underlying observations. Rather we match the distributions of the relevant nodes on the basis of the joint probability distribution represented by the Bayesian network. In our example, we need to ensure that the gender compositions of untreated and treated groups are the same, i.e., a 50/50 gender mix. This theoretically ideal condition is shown in the Monitors below.

However, the actual distributions reveal the inequality of gender distributions for the untreated and the treated.

How can we overcome this? By simply using Probabilistic Evidence to set a 50/50 gender mix, i.e., the marginal distribution of Gender, upon setting evidence on Treatment. We can also right-click on the Monitor for Gender and select Fix Probabilities from the Contextual Menu. This will automatically use Probabilistic Evidence to set the current marginal distribution after each conditioning on Treatment, or any other node.

With Fix Probabilities applied, the distribution of Gender in its Monitor is highlighted in purple.

Other than colors, nothing appears to have changed. However, once we set the values Treatment=False and Treatment=True, we see that the distribution for Gender does not change. We can also observe that the corresponding posterior probabilities for Outcome are the same as those obtained with Graph Surgery.

With that, we can once again calculate the Average Causal Effect:

Example: Augmented Simpson's Paradox

Now that we have seen how to estimate the Average Causal Effect by manually interacting with the BayesiaLab's Monitors, with both Graph Surgery and Likelihood Matching, we will use the BayesiaLab's Direct and Total Effect functions to compute causal effects automatically for a set of variables. But first, we present a slightly more complex version of Simpson's Paradox to illustrate these features (see Example: Simpson's Paradox).

Augmented Simpson's Paradox

Our updated story contains four additional dimensions:

Treatment Availability: the treatment is not always available;
Side Effects: the treatment may produce severe side effects;
Efficacy: some patients do not respond to the drug;
Litigation: the families of patients who died may sue the pharmaceutical company that had provided the treatment.

The manually designed CDAG shown below describes this new domain qualitatively:

Next, we describe the quantitative part of the domain. First, we state that Treatment Availability is 75%.

We also assume that the treatment may have Side Effects, which are much more frequent for females. The following conditional probability table quantifies this direct causal dependency on Gender:

Patients decide whether or not to take the treatment based on two criteria, Treatment Availability and Side Effects. The dependencies are described in the following table. It states that if the treatment is unavailable, patients cannot have the treatment, which is deterministic (and obvious). However, if the treatment is available, those patients who do not have any risk of experiencing side effects will always choose the treatment, while those at risk will be unlikely to submit to the treatment:

Furthermore, the Efficacy of the treatment depends on Drug Administration plus some hidden factors that render the treatment ineffective in 20% of patients:

The Target Node, Outcome, is defined by Gender and Efficacy. In this context, "not recovered" means that the patient died — hence the grim illustration attached to the icon.

Finally, half of the families of those patients who took the treatment and died are pursuing litigation. More specifically, these families are suing the pharmaceutical company that provided the treatment.

Path Analysis

We now list the paths between each variable and the target variable Outcome by using Main Menu > Analysis > Visual > Graph > Influence Paths to Target.

Recall the Adjustment Criterion, which stipulates that we must keep all of a variable's causal paths to the target variable open and simultaneously block all its non-causal paths for estimating its causal effect.

Graph Surgery

To illustrate BayesiaLab's Total and Direct Effects functions with Graph Surgery, we set all nodes but Outcome to Intervention Mode.

Notice the arrow symbols (→) in the badges of the nodes that are set to Intervention Mode.

Average Causal Effect

Before using BayesiaLab's automated tools for computing causal effects, we manually estimated the causal effect of our main variable of interest, Drug Administration, by using the Monitors.

Setting a piece of Evidence in Intervention Mode simulates an intervention on Drug Administration and mutilates the graph, as shown below, which meets the Adjustment Criterion by blocking the non-causal path (cf. Path Analysis, #6).

The Average Causal Effect of Drug Administration on Outcome, mediated by Efficacy, is -0.08.

Total Effects

We have seen in Chapter 8 that BayesiaLab estimates Total Effects as the derivatives of Total Effect Curves. These curves are based on the Posterior Mean Values of the Target Node given Mean Values from the interval of the variable under study. While the variables are in Intervention Mode, the Posterior Mean Values are computed based on the mutilated graph.

We can plot these curves with Main Menu > Analysis > Visual > Target > Target's Posterior > Curves > Total Effects.

For generating this graph, we can set a number of options:

The x-axis, Variable Delta Means, represents the difference between the Mean Value generated for the analysis (here, Hard Evidence/Intervention on the states of the variable under study) and its Prior Mean Value. The y-axis represents the difference between the PosteriorMean Value of Outcome and its PriorMean Value.

If we do not specifically associate numerical values with symbolic states, BayesiaLab uses the state index. In our example,

False is 0, and True is 1.
Male is 0, and Female is 1.
Not Recovered is 0, and Recovered is 1.

We see that Side Effects is the only variable with a positive causal effect. We also notice that Litigation has no causal effect.

Given that all variables are binary, the corresponding curves are linear. Therefore, the curves' derivatives will be perfect summaries of the Total Effect Curves: Main Menu > Analysis > Report > Target > Total Effects on Target:

The Total Effect is the derivative computed at (0, 0) in the previous Target Mean Analysis graph, i.e., the slope of the curve. The Standardized Total Effect is the Total Effect times the ratio between the standard deviation of the variable and the standard deviation of the Target Node.

The arrow symbols (→) in the results table indicate that Intervention Mode was active on all nodes, triggering Graph Surgery upon each observation/intervention during the estimation of the effects.

Gender is the variable with the strongest Total Effect. It is negative because of the index values of the states. Females (1) are recovering at a lower rate than Males (0).

Note that there are two paths from Gender to Outcome (paths #1 and #2 illustrated in the previous section), and they are both causal. Gender is indeed a root node, i.e., it has no parents, meaning the Adjustment Criterion is fulfilled by default.

The Total Effect measures the effects of these two causal paths: the direct path (#1) and the indirect path (#2), represented by the dashed blue arcs below.

Direct Effects

Now suppose we are interested in estimating the effect of the direct paths only. This would require blocking not only the non-causal paths but also the indirect causal paths. This is the role of BayesiaLab's Direct Effect functions. The only difference between Direct and Total Effect functions is that, by default, all other nodes are held constant during the estimation of the variable's Direct Effect.

We generate the Direct Effect Curves with Main Menu > Analysis > Visual > Target > Target's Posterior > Curves > Direct Effects, using the same parameters as those previously used for Total Effects:

Given that all nodes are in Intervention Mode, the only variables with Direct Effects are the Parents of Outcome. Indeed, intervening on all nodes to hold them constant triggers Graph Surgery and generates the mutilated graph below:

The function Main Menu > Analysis > Report > Target > Direct Effects on Target allows us to compute the Direct Effects, the single-point estimates of these curves:

The Direct Effect is the slope of the Direct Effect Curve between the endpoints of the variable interval.
The Standardized Direct Effect is the Direct Effect times the ratio between the standard deviation of the variables and the standard deviation of the Target Node.
The Elasticity is the Direct Effect times the ratio between the range of the variable and the range of the Target Node.
The Contribution is the Standardized Direct Effect divided by the total sum of Standardized Direct Effects.

Non-Confounders

By default, BayesiaLab's Direct Effect functions measure a variable's effect by holding all other variables constant. However, we can use the predefined class Non_Confounder to define the nodes we do not want to control.

In our example, the main variable of interest, Drug Administration, has no direct effect. The post-treatment variable Efficacy mediates its causal effect, and the Direct Effect analysis blocks the path. We must therefore use the predefined class Non_Confounder (Efficacy's Contextual Menu > Properties > Classes > Add > Predefined Class > Non_Confounders) to prevent BayesiaLab from holding Efficacy constant and allow the estimation of the mediated causal effect. The new mutilated graph below is then used for estimating the Direct Effects:

Drug Administration's Direct Effect now equals the Average Causal Effect we manually computed with the Monitors. You can also note that we no longer analyze the effect of the Non-ConfounderEfficacy.

Likelihood Matching

Now suppose we want to use Likelihood Matching instead of Graph Surgery. We first set back all nodes in Observation Mode via the monitors' Contextual Menus.

Nodes of Interest: Treatments/Drivers

The nodes of interest are the nodes for which we want to estimate the causal effect on the Target Node. We call them Treatments or Drivers.

In the previous section, we assumed that all nodes were of interest and set them in Intervention Mode. With Likelihood Matching, the workflow is less straightforward. For each Driver, we need to analyze the paths to the Target (cf. Path Anlaysis) to define the set of nodes that need to be controlled for to block the non-causal paths and let the causal paths open. Note that these nodes' sets may differ for each Driver, requiring us to perform multiple Total Effect analyses to avoid conflicting adjustments.

The first step is then to define our nodes of interest. In the Augmented Simpson Paradox, the main variable of interest is obviously Drug Administration, but for illustrative purposes, let's consider Gender as well.

Total Effects

We have seen in the Path Analysis section that there are two paths between Gender and Outcome, both causal (#1 and #2). Thus, there is no variable to adjust for to estimate the Total Effect.

The Path Analysis indicates that there are also two paths between Drug Administration and Outcome, one causal (#7) and one non-causal (#6): Drug Administration ← Side Effects ← Gender → Outcome. So we need to adjust for Side Effects, or for Gender, to block this path. This is in contradiction to the analysis of Gender's effect. We cannot estimate the Total Effects of Gender and Drug Administration in the same analysis with Likelihood Matching!

So let's start with Gender. We select the node, go to Main Menu > Analysis > Report > Target > Total Effects on Target, and confirm that we want to perform the analysis on the selected node only:

For Drug Administration, let's suppose we choose to adjust for Side Effects. We right-click on its associated Monitor and select Fix Probabilities from its Contextual Menu.

Then, we select the node Drug Administration, go to Analysis > Report > Target > Total Effects on Target, and confirm that we want to perform the analysis on the selected node only:

Direct Effects

Now let's look at the workflow for estimating Direct Effects with Likelihood Matching, i.e., how to assess the effects of the direct paths only. Remember that, by default, BayesiaLab's Direct Effect functions measure a variable's effect by holding all variables constant except those associated with the predefined class Non_Confounder.

Holding a variable constant with Graph Surgery implies the deletion of its entering arcs. Thus, there is no risk of biasing the estimation of Direct Effects. In the Likelihood Matching case, this risk exists because we set evidence on the variable to adjust for it. Indeed, controlling for descendants of the Target Node (e.g., Litigation) automatically biases the estimate.

While we previously added Efficacy to the Non_Confounder class to let it mediate the effect of Drug Administration, we must also add Litigation to prevent its adjustment.

Notice that there is no conflict in this analysis:

Gender
- Controlling for Side Effects, Drug Administration allows to cut the indirect causal path (#2);
- Controlling for Treatment Availability has no impact;
- Not controlling for Efficacy has no effect as path #2 is already blocked;
- Not controlling for Litigation prevents to bias the estimation of the effect;
Drug Administration
- Controlling for Side Effects, Gender allows to cut the non-causal path (#4);
- Controlling for Treatment Availability has no impact;
- Not controlling for Efficacy allows to let the information flows from Drug Administration to Outcome;
- Not controlling for Litigation prevents to bias the estimation of the effect.

We can, therefore, select our two nodes of interest, use Analysis > Report > Target > Direct Effects on Target, and confirm that we want to perform the analysis on the selected nodes only:

Graph Surgery versus Likelihood Matching

Before concluding this chapter, let's summarize the main characteristics of Graph Surgery and Likelihood Matching:

Graph Surgery
- requires a fully specified Causal Bayesian Network;
- uses the mutilated Causal Bayesian Network for causal inference;
Likelihood Matching
- requires the causal analysis of the domain to define the variables that need to be adjusted for to block the non-causal paths and let the causal paths open;
- uses the Bayesian network to carry out probabilistic inference with the adjusted variables. Note that this network does not have to be causal! It just needs to represent the Joint Probability Distribution of the domain.

This last point is especially important. It is indeed sometimes challenging, if not impossible, to design the fully specified Causal Bayesian Network. However, BayesiaLab offers a wide range of machine-learning algorithms that we can use to induce a network that represents the Joint Probability Distribution. Hence, we only need to have a limited amount of causal knowledge to define the variables that have to be adjusted for.

For example, suppose we machine-learned the network below with Main Menu > Learning > Supervised Learning > Augmented Naive Bayes:

The main architecture of the network is Naïve, i.e., the Target Node is the parent of all nodes. Therefore, this Bayesian network is clearly not causal. If we were to use Graph Surgery, we would not find any total or direct effects (see the corresponding mutilated graph below when estimating the Direct Effects with Efficacy and Litigation defined as Non_Confounder).

However, Likelihood Matching returns the correct estimations for the Total Effects with two separate analyses.

One analysis for Gender, without adjusting for any variables:

And one analysis for Drug Administration, by holding constant Side Effects:

As for Direct Effects, the analysis can be carried out for both variables with the current definition of Non_Confounders.

Conclusion

This chapter highlights how much effort is required to derive causal effect estimates from observational data. Simpson’s Paradox illustrates how much can go wrong even in simple circumstances. Given such potentially serious consequences, it is a must for policy analysts to examine all aspects of causality formally. To paraphrase Judea Pearl, we must not leave causal considerations to the mercy of intuition and good judgment. Fortunately, causality has emerged from its pariah status in recent decades, which has allowed tremendous progress in theoretical research and practical tools: “[…] practical problems relying on casual information that long were regarded as either metaphysical or unmanageable can now be solved using elementary mathematics” (Pearl, 1999).