Learning and causality
In [1]:
import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
Model
Let’s assume a process \(X_1\rightarrow Y_1\) with a control on \(X_1\) by \(C_a\) and a parameter \(P_b\) on \(Y_1\).
In [2]:
bn=gum.fastBN("Ca->X1->Y1<-Pb")
bn.cpt("Ca").fillWith([0.8,0.2])
bn.cpt("Pb").fillWith([0.3,0.7])
bn.cpt("X1")[:]=[[0.9,0.1],[0.1,0.9]]
bn.cpt("Y1")[{"X1":0,"Pb":0}]=[0.8,0.2]
bn.cpt("Y1")[{"X1":1,"Pb":0}]=[0.2,0.8]
bn.cpt("Y1")[{"X1":0,"Pb":1}]=[0.6,0.4]
bn.cpt("Y1")[{"X1":1,"Pb":1}]=[0.4,0.6]
gnb.flow.row(bn,*[bn.cpt(x) for x in bn.nodes()])
|
|
---|---|
0.8000 | 0.2000 |
|
| |
---|---|---|
0.9000 | 0.1000 | |
0.1000 | 0.9000 |
|
| ||
---|---|---|---|
| 0.8000 | 0.2000 | |
0.2000 | 0.8000 | ||
| 0.6000 | 0.4000 | |
0.4000 | 0.6000 |
|
|
---|---|
0.3000 | 0.7000 |
Actually the process is duplicated in the system but the control \(C_a\) and the parameter \(P_b\) are shared.
In [3]:
bn.add("X2",2)
bn.add("Y2",2)
bn.addArc("X2","Y2")
bn.addArc("Ca","X2")
bn.addArc("Pb","Y2")
bn.cpt("X2").fillWith(bn.cpt("X1"),["X1","Ca"]) # copy cpt(X1) with the translation X2<-X1,Ca<-Ca
bn.cpt("Y2").fillWith(bn.cpt("Y1"),["Y1","X1","Pb"]) # copy cpt(Y1) with translation Y2<-Y1,X2<-X1,Pb<-Pb
gnb.flow.row(bn,bn.cpt("X2"),bn.cpt("Y2"))
|
| |
---|---|---|
0.9000 | 0.1000 | |
0.1000 | 0.9000 |
|
| ||
---|---|---|---|
| 0.8000 | 0.2000 | |
0.2000 | 0.8000 | ||
| 0.6000 | 0.4000 | |
0.4000 | 0.6000 |
Simulation of the data
The process is partially observed : the control has been taken into account. However the parameter has not been identified and therefore is not collected.
In [4]:
#the base will be saved in completeData="out/complete_data.csv", observedData="out/observed_data.csv"
import os
completeData="out/complete_data.csv"
observedData="out/observed_data.csv"
# generating complete date with pyAgrum
size=35000
#gum.generateSample(bn,5000,"data.csv",random_order=True)
generator=gum.BNDatabaseGenerator(bn)
generator.setRandomVarOrder()
generator.drawSamples(size)
generator.toCSV(completeData)
In [5]:
# selecting some variables using pandas
import pandas as pd
f=pd.read_csv(completeData)
keep_col = ["X1","Y1","X2","Y2","Ca"] # Pb is removed
new_f = f[keep_col]
new_f.to_csv(observedData, index=False)
We will use now a database fixed_observed_data.csv. While both databases originate from the same process (the cell above), the use of fixed_observed_data.csv instead of observed_data.csv is made to guarantee a deterministic and stable behavior for the rest of the notebook.
In [6]:
fixedObsData="res/fixed_observed_data.csv"
statistical learning
Using a classical statistical learning method, one can approximate a model from the observed data.
In [7]:
learner=gum.BNLearner(fixedObsData)
learner.useGreedyHillClimbing()
bn2=learner.learnBN()
bn2
Out[7]:
Evaluating the impact of \(X2\) on \(Y1\)
Using the database, a question for the user is to evaluate the impact of the value of \(X2\) on \(Y1\).
In [8]:
target="Y1"
evs="X2"
ie=gum.LazyPropagation(bn)
ie2=gum.LazyPropagation(bn2)
p1=ie.evidenceImpact(target,[evs])
p2=gum.Potential(p1).fillWith(ie2.evidenceImpact(target,[evs]),[target,evs])
errs=((p1-p2)/p1)
quaderr1=(errs*errs).sum()
gnb.flow.row(p1,p2,errs,rf"$${100*quaderr1:3.5f}\%$$",
captions=['in original model','in learned model','relative errors','quadratic relative error'])
|
| |
---|---|---|
0.6211 | 0.3789 | |
0.4508 | 0.5492 |
|
| |
---|---|---|
0.6183 | 0.3817 | |
0.4415 | 0.5585 |
|
| |
---|---|---|
0.0044 | -0.0072 | |
0.0205 | -0.0168 |
Evaluating the causal impact of \(X2\) on \(Y1\) with the learned model
The statistician notes that the change wanted by the user to apply on \(X_2\) is not an observation but rather an intervention.
In [9]:
import pyAgrum.causal as csl
import pyAgrum.causal.notebook as cslnb
model=csl.CausalModel(bn)
model2=csl.CausalModel(bn2)
In [10]:
cslnb.showCausalModel(model)
In [11]:
gum.config['notebook','graph_format']='svg'
cslnb.showCausalImpact(model,on=target, doing={evs})
cslnb.showCausalImpact(model2,on=target, doing={evs})
|
|
---|---|
0.5768 | 0.4232 |
|
| |
---|---|---|
0.5743 | 0.4257 | |
0.5628 | 0.4372 |
Unfortunately, due to the fact that \(P_a\) is not learned, the computation of the causal impact still is imprecise.
In [12]:
_, impact1, _ = csl.causalImpact(model, on=target, doing={evs})
_, impact2orig, _ = csl.causalImpact(model2, on=target, doing={evs})
impact2=gum.Potential(p2).fillWith(impact2orig,['Y1','X2'])
errs=((impact1-impact2)/impact1)
quaderr2=(errs*errs).sum()
gnb.flow.row(impact1,impact2,errs,rf"$${100*quaderr2:3.5f}\%$$",
captions=[r'$P( Y_1 \mid \hookrightarrow X_2)$ <br/>in original model',
r'$P( Y_1 \mid \hookrightarrow X_2)$ <br/>in learned model',' <br/>relative errors',' <br/>quadratic relative error'])
|
|
---|---|
0.5768 | 0.4232 |
in original model
|
| |
---|---|---|
0.5743 | 0.4257 | |
0.5628 | 0.4372 |
in learned model
|
| |
---|---|---|
0.0044 | -0.0060 | |
0.0243 | -0.0331 |
relative errors
quadratic relative error
Just to be certain, we can verify that in the original model, \(P( Y_1 \mid \hookrightarrow X_2)=P(Y_1)\)
In [13]:
gnb.flow.row(impact1,ie.evidenceImpact(target,[]),
captions=[r"$P( Y_1 \mid \hookrightarrow X_2)$ <br/>in the original model","$P(Y_1)$ <br/>in the original model"])
|
|
---|---|
0.5768 | 0.4232 |
in the original model
|
|
---|---|
0.5768 | 0.4232 |
in the original model
Causal learning and causal impact
Some learning algorthims such as MIIC (Verny et al., 2017) aim to find the trace of latent variables in the data !
In [14]:
learner=gum.BNLearner(fixedObsData)
learner.useMIIC()
bn3=learner.learnBN()
In [15]:
gnb.flow.row(bn,bn3,f"$${[(bn3.variable(i).name(),bn3.variable(j).name()) for (i,j) in learner.latentVariables()]}$$",
captions=['original model','learned model','Latent variables found'])
A latent variable (common cause) has been found in the data betwenn \(Y1\) and \(Y2\) !
Therefore we can build a causal model taking into account this latent variable found by MIIC.
In [16]:
model3=csl.CausalModel(bn2,[("L1",("Y1","Y2"))])
cslnb.showCausalImpact(model3,target, {evs})
|
|
---|---|
0.5725 | 0.4275 |
Then at least, the statistician can say that \(X_2\) has no impact on \(Y_1\) from the data. The error is just due to the approximation of the parameters in the database.
In [17]:
_, impact1, _ = csl.causalImpact(model, on=target, doing={evs})
_, impact3orig, _ = csl.causalImpact(model3, on=target, doing={evs})
impact3=gum.Potential(impact1).fillWith(impact3orig,['Y1'])
errs=((impact1-impact3)/impact1)
quaderr3=(errs*errs).sum()
gnb.flow.row(impact1,impact3,errs,rf"$${100*quaderr3:3.5f}\%$$",
captions=['in original model','in learned model','relative errors','quadratic relative error'])
|
|
---|---|
0.5768 | 0.4232 |
|
|
---|---|
0.5725 | 0.4275 |
|
|
---|---|
0.0075 | -0.0102 |
In [18]:
print("In conclusion :")
print(rf"- Error with spurious structure and classical inference : {100*quaderr1:3.5f}%")
print(rf"- Error with spurious structure and do-calculus : {100*quaderr2:3.5f}%")
print(rf"- Error with correct causal structure and do-calculus : {100*quaderr3:3.5f}%")
In conclusion :
- Error with spurious structure and classical inference : 0.07722%
- Error with spurious structure and do-calculus : 0.17362%
- Error with correct causal structure and do-calculus : 0.01588%
In [ ]: