Smoking (chapter 5)

Authors: Aymen Merrouche and Pierre-Henri Wuillemin.

This notebook follows the example from “The Book Of Why” (Pearl, 2018) chapter 5

In [1]:

from IPython.display import display, Math, Latex,HTML

import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
import pyAgrum.causal as csl
import pyAgrum.causal.notebook as cslnb
import os

In the 1950s the strong association between smoking and lung cancer provoked a debate on the issue. Does smoking cause lung cancer?

Corresponding causal diagram:

The corresponding causal diagram is the following:

In [2]:

sc = gum.fastBN("Smoking->Lung Cancer")
sc

Out[2]:

Constitutional Hypothesis:

Smoking industry and some other skeptic statisticians advanced the theory that smokers are genetically different from nonsmokers. A smoking gene could be a confounder that would explain the observed association.

In [3]:

msc = csl.CausalModel(sc, [("Smoking Gene", ["Smoking","Lung Cancer"])])
cslnb.showCausalImpact(msc, "Lung Cancer", doing="Smoking",values={})

Causal Model

$$\begin{equation*}P( Lung Cancer \mid \text{do}(Smoking)) = P\left(Lung Cancer\right)\end{equation*}$$
Explanation : No causal effect of X on Y, because they are d-separated (conditioning on the observed variables if any).

Lung Cancer
0	1
0.3320	0.6680

Impact

This constitutional hypothesis was untestable, we couldn’t sequence the human genome at the time. However, this hypothesis wasn’t plausible because the observed association was way too strong.

Another explanation

To explain this association, another hypothesis was that a smoking gene could be a confounder but there was still a direct causal effect between smoking on lung cancer:

In [4]:

msc = csl.CausalModel(sc, [("Smoking Gene", ["Smoking","Lung Cancer"])], True)
cslnb.showCausalImpact(msc, "Lung Cancer", doing="Smoking",values={})

Causal Model

Hedge Error: G={'Smoking', 'Lung Cancer'}, G[S]={'Lung Cancer'}
Impossible

No result
Impact

Front door criterion:

Let’s suppose now that smoking causes cancer only through tar deposits that are fully due to the physical action of cigarettes, the causal diagram becomes:

In [5]:

sct = gum.fastBN("Smoking->Tar->Lung Cancer")
sct

Out[5]:

In [6]:

msct = csl.CausalModel(sct, [("Smoking Gene", ["Smoking","Lung Cancer"])], True)
gnb.show(msct)

../_images/notebooks_BoW-c5pxxx-smoking_13_0.svg

In [7]:

cslnb.showCausalImpact(msct, "Lung Cancer", doing="Smoking",values={})

Causal Model

$$\begin{equation*}P( Lung Cancer \mid \text{do}(Smoking)) = \sum_{Tar}{P\left(Tar\mid Smoking\right) \cdot \left(\sum_{Smoking'}{P\left(Lung Cancer\mid Smoking',Tar\right) \cdot P\left(Smoking'\right)}\right)}\end{equation*}$$
Explanation : frontdoor ['Tar'] found.

	Lung Cancer
Smoking	0	1
0	0.5364	0.4636
1	0.3692	0.6308

Impact

Even if the smoking gene is unobservable, we can assess the causal effect of Smoking on Lung Cancer using the front-door method. In this case, the front-door is:

\[Smoking \rightarrow \color{red}{Tar} \rightarrow LungCancer\]

It consists of variables that we have observed:

We can measure the causal effect of $Smoking$ on $Tar$, there are no open back-doors between the two ($Tar \leftarrow Smoking \rightarrow SmokingGene \leftarrow LungCancer$ is blocked by the collider node $LungCancer$)

\[P(Tar \mid do(Smoking)) = P (Tar \mid Smoking)\]

In [8]:

formula, adj, exp = csl.causalImpact(msct,on = "Tar",doing = "Smoking",values = {})
display(Math(formula.toLatex()))

$\displaystyle P( Tar \mid \text{do}(Smoking)) = P\left(Tar\mid Smoking\right)$

We can measure the causal effect of $Tar$ on $LungCancer$, we just need to adjust for the $Smoking$ to block the “back-door path” $ Tar \leftarrow `Smoking :nbsphinx-math:leftarrow SmokingGene :nbsphinx-math:rightarrow `LungCancer$

\[P(LungCancer \mid do(Tar)) = \sum_{Smoking}{P(LungCancer \mid Tar, Smoking) \times P(Smoking)}\]

In [9]:

formula, adj, exp = csl.causalImpact(msct,on = "Lung Cancer",doing = "Tar",values = {})
display(Math(formula.toLatex()))

$\displaystyle P( Lung Cancer \mid \text{do}(Tar)) = \sum_{Smoking}{P\left(Lung Cancer\mid Smoking,Tar\right) \cdot P\left(Smoking\right)}$

We can now combine these two pieces of information to have the causal effect of $Smoking$ on $LungCancer$ and reduce the expression of $P(LungCancer \mid do(Smoking))$ to elements that we observed:

\[P(LungCancer \mid do(Smoking)) = \sum_{Tar}{(P(Tar \mid Smoking) \times \sum_{Smoking^{'}}{P(LungCancer \mid Tar, Smoking^{'}) \times P(Smoking^{'})})}\]

Birth-weight paradox:

Studies have shown that babies of smoking mothers tend to weigh less than average. Other studies have shown that low-birth-weight babies have a higher mortality rate than normal-birth-weight babies. The corresponding causal diagram is the following causal:

In [10]:

bwp = gum.fastBN("Smoking->Low Birth Weight->Mortality")
bwp

Out[10]:

In [11]:

# Causal effect of Smoking on neo-natal mortality
bwpModele = csl.CausalModel(bwp)
cslnb.showCausalImpact(bwpModele, "Mortality", doing="Smoking",values={})

Causal Model

$$\begin{equation*}P( Mortality \mid \text{do}(Smoking)) = \sum_{Low Birth Weight}{P\left(Low Birth Weight\mid Smoking\right) \cdot \left(\sum_{Smoking'}{P\left(Mortality\mid Low Birth Weight\right) \cdot P\left(Smoking'\right)}\right)}\end{equation*}$$
Explanation : frontdoor ['Low Birth Weight'] found.

	Mortality
Smoking	0	1
0	0.6329	0.3671
1	0.5871	0.4129

Impact

However the data also showed that low-birth-weight babies of smoker mothers had lower mortality rates than low-birth-weight babies of non-smoker mothers. An explanation for this paradoxical situation is that low-birth-weight is either due to a smoking mother or to another birth defect that is much more threatening to the baby’s health. The causal diagram becomes:

In [12]:

bwpe = gum.fastBN("Smoking->Low Birth Weight->Mortality<-Smoking;Birth defect->Low Birth Weight;Mortality<-Birth defect")
bwpe

Out[12]:

Pinpointing the source of this paradoxical situation becomes easy thanks to this causal diagram: “collider bias”.”Low Birth Weight” is a collider! The data only concerned low-birth-weight babies (it is as if we are adjusting for “Low Birth Weight.”). Knowing that the mother doesn’t smoke increases our belief that a birth defect is the cause of the low-birth-weight, and a birth defect is more threatening for the baby’s health. This opened the backdoor path formerly blocked and allowed non-causal information to flow from Smoking to Mortality ($Smoking \rightarrow `Low Birth Weight :nbsphinx-math:leftarrow Birth defect :nbsphinx-math:rightarrow `Mortality $) introducing a bias.

In [13]:

bwpeModele = csl.CausalModel(bwpe)
cslnb.showCausalImpact(bwpeModele, "Mortality", doing="Smoking",values={})

Causal Model

$$\begin{equation*}P( Mortality \mid \text{do}(Smoking)) = \sum_{Birth defect,Low Birth Weight}{P\left(Birth defect\right) \cdot P\left(Mortality\mid Birth defect,Low Birth Weight,Smoking\right) \cdot P\left(Low Birth Weight\mid Birth defect,Smoking\right)}\end{equation*}$$
Explanation : Do-calculus computations

	Mortality
Smoking	0	1
0	0.5185	0.4815
1	0.6012	0.3988

Impact

In [ ]: