Learning and causality¶

In [1]:

import pyAgrum as gum
import pyAgrum.lib.notebook as gnb

Model¶

Let’s assume a process $X_1\rightarrow Y_1$ with a control on $X_1$ by $C_a$ and a parameter $P_b$ on $Y_1$.

In [2]:

bn=gum.fastBN("Ca->X1->Y1<-Pb")
bn.cpt("Ca").fillWith([0.8,0.2])
bn.cpt("Pb").fillWith([0.3,0.7])

bn.cpt("X1")[:]=[[0.9,0.1],[0.1,0.9]]

bn.cpt("Y1")[{"X1":0,"Pb":0}]=[0.8,0.2]
bn.cpt("Y1")[{"X1":1,"Pb":0}]=[0.2,0.8]
bn.cpt("Y1")[{"X1":0,"Pb":1}]=[0.6,0.4]
bn.cpt("Y1")[{"X1":1,"Pb":1}]=[0.4,0.6]

gnb.flow.row(bn,*[bn.cpt(x) for x in bn.nodes()])

Ca
0	1
0.8000	0.2000

	X1
Ca	0	1
0	0.9000	0.1000
1	0.1000	0.9000

		Y1
Pb	X1	0	1
0	0	0.8000	0.2000
0	1	0.2000	0.8000
1	0	0.6000	0.4000
1	1	0.4000	0.6000

Pb
0	1
0.3000	0.7000

Actually the process is duplicated in the system but the control $C_a$ and the parameter $P_b$ are shared.

In [3]:

bn.add("X2",2)
bn.add("Y2",2)
bn.addArc("X2","Y2")
bn.addArc("Ca","X2")
bn.addArc("Pb","Y2")

bn.cpt("X2").fillWith(bn.cpt("X1"),["X1","Ca"]) # copy cpt(X1) with the translation X2<-X1,Ca<-Ca
bn.cpt("Y2").fillWith(bn.cpt("Y1"),["Y1","X1","Pb"]) # copy cpt(Y1) with translation Y2<-Y1,X2<-X1,Pb<-Pb

gnb.flow.row(bn,bn.cpt("X2"),bn.cpt("Y2"))

	X2
Ca	0	1
0	0.9000	0.1000
1	0.1000	0.9000

		Y2
Pb	X2	0	1
0	0	0.8000	0.2000
0	1	0.2000	0.8000
1	0	0.6000	0.4000
1	1	0.4000	0.6000

Simulation of the data¶

The process is partially observed : the control has been taken into account. However the parameter has not been identified and therefore is not collected.

In [4]:

#the base will be saved in completeData="out/complete_data.csv", observedData="out/observed_data.csv"
import os
completeData="out/complete_data.csv"
observedData="out/observed_data.csv"
fixedObsData="res/fixed_observed_data.csv"

# generating complete date with pyAgrum
size=35000
#gum.generateSample(bn,5000,"data.csv",random_order=True)
generator=gum.BNDatabaseGenerator(bn)
generator.setRandomVarOrder()
generator.drawSamples(size)
generator.toCSV(completeData)

In [5]:

# selecting some variables using pandas
import pandas as pd
f=pd.read_csv(completeData)
keep_col = ["X1","Y1","X2","Y2","Ca"] # Pb is removed
new_f = f[keep_col]
new_f.to_csv(observedData, index=False)

In order to have fixed results, we will use now a database fixed_observed_data.csv generated with the process above.

statistical learning¶

Using a classical statistical learning method, one can approximate a model from the observed data.

In [6]:

learner=gum.BNLearner(fixedObsData)
learner.useGreedyHillClimbing()
bn2=learner.learnBN()

Evaluating the impact of $X2$ on $Y1$¶

Using the database, a question for the user is to evaluate the impact of the value of $X2$ on $Y1$.

In [7]:

target="Y1"
evs="X2"
ie=gum.LazyPropagation(bn)
ie2=gum.LazyPropagation(bn2)
p1=ie.evidenceImpact(target,[evs])
p2=gum.Potential(p1).fillWith(ie2.evidenceImpact(target,[evs]),[target,evs])
errs=((p1-p2)/p1).scale(100)
quaderr=(errs*errs).sum()
gnb.flow.row(p1,p2,errs,f"$${quaderr:3.5f}$$",
              captions=['in original model','in learned model','relative errors','quadratic error'])

	Y1
X2	0	1
0	0.6211	0.3789
1	0.4508	0.5492

in original model

	Y1
X2	0	1
0	0.6183	0.3817
1	0.4415	0.5585

in learned model

	Y1
X2	0	1
0	0.4422	-0.7249
1	2.0453	-1.6787

relative errors

$$7.72226$$
quadratic error

Evaluating the causal impact of $X2$ on $Y1$ with the learned model¶

The statistician notes that the change wanted by the user to apply on $X_2$ is not an observation but rather an intervention.

In [8]:

import pyAgrum.causal as csl
import pyAgrum.causal.notebook as cslnb

model=csl.CausalModel(bn)
model2=csl.CausalModel(bn2)

In [9]:

cslnb.showCausalModel(model)

../_images/notebooks_14-Examples_CausalityAndLearning_19_0.svg

In [10]:

gum.config['notebook','graph_format']='svg'
cslnb.showCausalImpact(model,on=target, doing={evs})
cslnb.showCausalImpact(model2,on=target, doing={evs})

Causal Model

$$\begin{equation*}P( Y1 \mid \hookrightarrow\mkern-6.5muX2) = \sum_{Ca}{P\left(Y1\mid Ca\right) \cdot P\left(Ca\right)}\end{equation*}$$
Explanation : backdoor ['Ca'] found.

Y1
0	1
0.5768	0.4232

Impact

Causal Model

$$\begin{equation*}P( Y1 \mid \hookrightarrow\mkern-6.5muX2) = \sum_{X1}{P\left(Y1\mid X1,X2\right) \cdot P\left(X1\right)}\end{equation*}$$
Explanation : backdoor ['X1'] found.

	Y1
X2	0	1
0	0.5743	0.4257
1	0.5628	0.4372

Impact

Unfortunately, due to the fact that $P_a$ is not learned, the computation of the causal impact still is imprecise.

In [11]:

_, impact1, _ = csl.causalImpact(model, on=target, doing={evs})
_, impact2orig, _ = csl.causalImpact(model2, on=target, doing={evs})

impact2=gum.Potential(p2).fillWith(impact2orig,['Y1','X2'])
errs=((impact1-impact2)/impact1).scale(100)
quaderr=(errs*errs).sum()
gnb.flow.row(impact1,impact2,errs,f"$${quaderr:3.5f}$$",
              captions=['$P( Y_1 \mid \hookrightarrow X_2)$ <br/>in original model',
                        '$P( Y_1 \mid \hookrightarrow X_2)$  <br/>in learned model',' <br/>relative errors',' <br/>quadratic error'])

Y1
0	1
0.5768	0.4232

$P( Y_1 \mid \hookrightarrow X_2)$
in original model

	Y1
X2	0	1
0	0.5743	0.4257
1	0.5628	0.4372

$P( Y_1 \mid \hookrightarrow X_2)$
in learned model

	Y1
X2	0	1
0	0.4414	-0.6015
1	2.4251	-3.3052

relative errors

$$17.36225$$

quadratic error

Just to be certain, we can verify that in the original model, $P( Y_1 \mid \hookrightarrow X_2)=P(Y_1)$

In [12]:

gnb.flow.row(impact1,ie.evidenceImpact(target,[]),
               captions=["$P( Y_1 \mid \hookrightarrow X_2)$ <br/>in the original model","$P(Y_1)$ <br/>in the original model"])

Y1
0	1
0.5768	0.4232

$P( Y_1 \mid \hookrightarrow X_2)$
in the original model

Y1
0	1
0.5768	0.4232

$P(Y_1)$
in the original model

Causal learning and causal impact¶

Some learning algorthims such as MIIC (Verny et al., 2017) aim to find the trace of latent variables in the data

In [13]:

learner=gum.BNLearner(fixedObsData)
learner.useMIIC()
bn3=learner.learnBN()

In [14]:

gnb.flow.row(bn,bn3,f"$${[(bn3.variable(i).name(),bn3.variable(j).name()) for (i,j) in learner.latentVariables()]}$$",
              captions=['original model','learned model','Latent variables found'])

original model

learned model

$$[('Y2', 'Y1')]$$
Latent variables found

Then we can build a causal model taking into account this latent variable found by MIIC.

In [15]:

model3=csl.CausalModel(bn2,[("L1",("Y1","Y2"))])
cslnb.showCausalImpact(model3,target, {evs})

Causal Model

$$\begin{equation*}P( Y1 \mid \hookrightarrow\mkern-6.5muX2) = \sum_{X1}{P\left(Y1\mid X1\right) \cdot P\left(X1\right)}\end{equation*}$$
Explanation : backdoor ['X1'] found.

Y1
0	1
0.5725	0.4275

Impact

Then at least, the statistician can say that $X_2$ has no impact on $Y_1$ from the data. The error is just due to the approximation of the parameters in the database.

In [16]:

_, impact1, _ = csl.causalImpact(model, on=target, doing={evs})
_, impact3orig, _ = csl.causalImpact(model3, on=target, doing={evs})

impact3=gum.Potential(impact1).fillWith(impact3orig,['Y1'])
errs=((impact1-impact3)/impact1).scale(100)
quaderr=(errs*errs).sum()
gnb.flow.row(impact1,impact3,errs,f"$${quaderr:3.5f}$$",
              captions=['in original model','in learned model','relative errors','quadratic error'])

Y1
0	1
0.5768	0.4232

in original model

Y1
0	1
0.5725	0.4275

in learned model

Y1
0	1
0.7454	-1.0160

relative errors

$$1.58783$$
quadratic error

In [ ]:

Learning and causality¶

Model¶

Simulation of the data¶

statistical learning¶

Evaluating the impact of \(X2\) on \(Y1\)¶

Evaluating the causal impact of \(X2\) on \(Y1\) with the learned model¶

Causal learning and causal impact¶