Multinomial Simpson Paradox

this notebook shows a model for a multinomial Simpson paradox.

Creative Commons License

aGrUM

interactive online version

In [1]:
import matplotlib.pyplot as plt
import random

import pandas as pd
import numpy as np

import pyAgrum as gum
import pyAgrum.lib.notebook as gnb

import pyAgrum.causal as csl
import pyAgrum.causal.notebook as cslnb

Building the models

In [2]:
# building a model including a Simpson's paradox
def fillWithUniform(p,fmin=None,fmax=None):
  if fmin is None:
    vmin=0
  if fmax is None:
    vmax=p.variable(0).domainSize()-1

  mi=int(p.variable(0).numerical(0))
  ma=int(p.variable(0).numerical(p.variable(0).domainSize()-1))

  p.fillWith(0)

  I=gum.Instantiation(p)

  I.setFirst()
  while not I.end():
    vars={p.variable(i).name():p.variable(i).numerical(I.val(i)) for i in range(1,p.nbrDim())}
    if fmin is not None:
      vmin=int(eval(fmin,None,vars))
    if fmax is not None:
      vmax=int(eval(fmax,None,vars))
    if vmin<mi:
      vmin=mi
    if vmin>ma:
      vmin=ma
    if vmax<mi:
      vmax=mi
    if vmax>ma:
      vmax=ma

    for pos in range(vmin,vmax+1):
      I.chgVal(0,pos)
      p.set(I,1)
    I.incNotVar(p.variable(0))
  p.normalizeAsCPT()

size=70
sizeZ=5
bn=gum.fastBN(f"A[0,{size-1}]->B[0,{size-1}]<-C[0,{sizeZ-1}]->A")

bn.cpt("C").fillWith(1).normalize()
fillWithUniform(bn.cpt("A"),fmin="C*12",fmax="C*12+30")
bn.cpt("B").fillWithFunction("5+C*4-int(A/8)",[0.05,0.2,0.5,0.2,0.05]);
In [3]:
#  generating a CSV, taking this model as the causal one.
gum.generateSample(bn,400,"out/sample.csv")
df=pd.read_csv("out/sample.csv")
df.plot.scatter(x='A', y='B', c='C',colormap="tab20");
../_images/notebooks_63-Causality_MultinomialSimpsonParadox_5_0.svg
In [4]:
cm=csl.CausalModel(bn)
_,p,_=csl.causalImpact(cm,on="B",doing="A")
In [5]:
# building an Markov-equivalent model, generating a CSV, taking this model as the causal one.
bn2=gum.BayesNet(bn)
bn2.reverseArc("C","A")

gum.generateSample(bn2,400,"out/sample2.csv")
df2=pd.read_csv("out/sample2.csv")

cm2=csl.CausalModel(bn2)
_,p2,_=csl.causalImpact(cm2,on="B",doing="A")

The observationnal model and its paradoxal structure (exactly the same with the second Markov-equivalent model)

In [6]:
gnb.flow.row(gnb.getBN(bn),
             df.plot.scatter(x='A', y='B'),
             df.plot.scatter(x='A', y='B', c='C',colormap="tab20"),
             captions=["the observationnal model","the trend is increasing","the trend is decreasing for any value for C !"])
gnb.flow.row(gnb.getBN(bn2),
             df2.plot.scatter(x='A', y='B'),
             df2.plot.scatter(x='A', y='B', c='C',colormap="tab20"),
             captions=["the Markov-equivalent model","the trend is increasing","the trend is decreasing for any value for C !"])
G A A B B A->B C C C->A C->B
the observationnal model

the trend is increasing

the trend is decreasing for any value for C !
G A A C C A->C B B A->B C->B
the Markov-equivalent model

the trend is increasing

the trend is decreasing for any value for C !

The paradox is revealed in the trend of the inferred means : the means are increasing with the value of \(A\) except for any value of :math:`C`

In [7]:
for v in [10,20,30]:
  gnb.flow.add_html(gnb.getPosterior(bn,target="B",evs={"A":v}),f"$P(B|A={v})$")
gnb.flow.new_line()
for v in [10,20,30]:
  gnb.flow.add_html(gnb.getPosterior(bn,target="B",evs={"A":v,"C":0}),f"P(B | $A={v},C=0)$")
gnb.flow.new_line()
for v in [10,20,30]:
  gnb.flow.add_html(gnb.getPosterior(bn,target="B",evs={"A":v,"C":2}),f"P(B | $A={v},C=2$)")
gnb.flow.new_line()
for v in [10,20,30]:
  gnb.flow.add_html(gnb.getPosterior(bn,target="B",evs={"A":v,"C":4}),f"P(B | $A={v},C=4$)")
gnb.flow.display()

$P(B|A=10)$

$P(B|A=20)$

$P(B|A=30)$


P(B | $A=10,C=0)$

P(B | $A=20,C=0)$

P(B | $A=30,C=0)$


P(B | $A=10,C=2$)

P(B | $A=20,C=2$)

P(B | $A=30,C=2$)


P(B | $A=10,C=4$)

P(B | $A=20,C=4$)

P(B | $A=30,C=4$)

Now that the paradoxal structure is understood and the paradox is revealed, will we choose to observe \(C\) (or not) before deciding to increase or decrease \(A\) (with the goal to maximize \(B\)) ?

Of course, it depends on the causal structure of the problem !

In [8]:
gnb.flow.add_html(cslnb.getCausalModel(cm),"the first causal model")
for v in [10,20,30]:
  gnb.flow.add_html(gnb.getProba(p.extract({'A':v})),f"Doing $A={v}$")
gnb.flow.display()
G A A B B A->B C C C->A C->B
the first causal model

Doing $A=10$

Doing $A=20$

Doing $A=30$

If \(C\) is cause for \(A\), observing \(C\) really gives a new information about \(B\).

In [9]:
gnb.flow.add_html(cslnb.getCausalModel(cm2),"the second causal model")
for v in [10,20,30]:
  gnb.flow.add_html(gnb.getProba(p2.extract({'A':v})),f"Doing $A={v}$")
gnb.flow.display()
G A A B B A->B C C A->C C->B
the second causal model

Doing $A=10$

Doing $A=20$

Doing $A=30$

if \(A\) is cause for \(C\), observing \(C\) may lead to misinterpretations about the causal role of \(A\).

In [ ]: