Dirichlet prior

Creative Commons License

aGrUM

interactive online version

Dirichlet prior as database

BNLearner gives access of many priors for the parameters and structural learning. One of them is the Dirichlet prior which needs a a prior for every possible parameter in a BN. aGrUM/pyAgrum allows to use a database as a source of Dirichlet prior.

In [1]:
%matplotlib inline
from pylab import *
import matplotlib.pyplot as plt

import os

import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
import pyAgrum.lib.explain as explain

sizePrior=30000
sizeData=20000

import os
#the bases will be saved in "out/dirichlet_database.csv" and "out/observaition_database.csv"
dirichletDatabase="out/dirichlet_database.csv"
obsDatabase="out/observation_database.csv"

Generating databases for Dirichlet prior and for the learning

In [2]:
bnPrior = gum.fastBN("A->B;C;D")
bnData = gum.fastBN("A->B->C->D")
bnData.cpt("B").fillWith([0.99,0.01,
                          0.01,0.99])
bnData.cpt("C").fillWith([0.9,0.1,
                          0.1,0.9])
bnData.cpt("D").fillWith([0.9,0.1,
                          0.1,0.9])
bnPrior.cpt("B").fillWith(bnData.cpt("B"))

gum.generateSample(bnPrior, sizePrior, dirichletDatabase, with_labels=True,random_order=True)

gum.generateSample(bnData, sizeData, obsDatabase, with_labels=True,random_order=False)

gnb.sideBySide(bnData,bnPrior,
               captions=[f"Database ({sizeData} cases)",f"Prior ({sizePrior} cases)"])
G C C D D C->D A A B B A->B B->C
Database (20000 cases)
G C C D D A A B B A->B
Prior (30000 cases)

Learning databases

In [3]:
# bnPrior is used to give the variables and their domains
learnerData = gum.BNLearner(obsDatabase)
learnerPrior = gum.BNLearner(dirichletDatabase)
learnerData.useScoreBIC()
learnerPrior.useScoreBIC()
gnb.sideBySide(learnerData.learnBN(),learnerPrior.learnBN(),
              captions=["Learning from Data","Learning from Prior"])
G C C B B C->B D D D->C A A B->A
Learning from Data
G C C D D A A B B B->A
Learning from Prior

Learning with Dirichlet prior

Now we use the Dirichlet prior. In order to have an idea of the influence of the priori, we change the weights of Data and Prior from [0,1] to [1,0] using a \(ratio \in [0,1]\). The weight of a database is considered equal to the sum of the weights of each row. It is therefore in fact an equivalent sample size that is given.

In [4]:
learner = gum.BNLearner(obsDatabase, bnPrior)
print(learner)
Filename       : out/observation_database.csv
Size           : (20000,4)
Variables      : A[2], B[2], C[2], D[2]
Induced types  : False
Missing values : False
Algorithm      : MIIC
Score          : BDeu  (Not used for constraint-based algorithms)
Correction     : MDL  (Not used for score-based algorithms)
Prior          : -

In [5]:
def learnWithRatio(ratio):
    # bnPrior is used to give the variables and their domains

    learner = gum.BNLearner(obsDatabase, bnPrior)
    learner.useGreedyHillClimbing()
    learner.useDirichletPrior(dirichletDatabase,ratio*sizeData)
    learner.setDatabaseWeight((1-ratio)*sizeData)
    learner.useScoreBIC() # or another score with no included prior
    return learner.learnBN()

ratios=[0.0,0.01,0.05,0.2,0.5,0.8,0.9,0.95,0.99,1.0]
bns=[learnWithRatio(r) for r in ratios]
gnb.sideBySide(*bns,
              captions=[*[f"with ratio {r}<br/> [datasize : {(int(r*sizeData),int((1-r)*sizeData))}]" for r in ratios]],
              valign="bottom")

G C C D D C->D A A C->A B B C->B A->B
with ratio 0.0
[datasize : (0, 20000)]
G C C A A C->A B B C->B D D D->C A->B
with ratio 0.01
[datasize : (200, 19800)]
G C C D D D->C B B D->B A A B->C B->A
with ratio 0.05
[datasize : (1000, 19000)]
G C C B B C->B D D D->C D->B A A B->A
with ratio 0.2
[datasize : (4000, 16000)]
G C C B B C->B D D D->C A A D->A D->B A->C A->B
with ratio 0.5
[datasize : (10000, 10000)]
G C C D D D->C B B D->B A A B->C B->A
with ratio 0.8
[datasize : (16000, 3999)]
G C C D D D->C A A A->C B B A->B B->D
with ratio 0.9
[datasize : (18000, 1999)]
G C C B B C->B D D D->C A A B->A
with ratio 0.95
[datasize : (19000, 1000)]
G C C D D A A B B A->B
with ratio 0.99
[datasize : (19800, 200)]
G C C D D A A B B B->A
with ratio 1.0
[datasize : (20000, 0)]

The BNs learned when mixing the 2 data sources look much more complex than the data and the Dirichlet structures (with \(ratio \in [0.01,0.99]\)). It may seem odd. However, if one looks at the mutual information,

In [6]:
gnb.sideBySide(*[explain.getInformation(bn) for bn in bns],
              captions=[*[f"with ratio {r}<br/> [datasize : {r*sizePrior+(1-r)*sizeData}]" for r in ratios]],
              valign="bottom")
G C C D D C->D A A C->A B B C->B A->B

with ratio 0.0
[datasize : 20000.0]
G C C A A C->A B B C->B D D D->C A->B

with ratio 0.01
[datasize : 20100.0]
G C C D D D->C B B D->B A A B->C B->A

with ratio 0.05
[datasize : 20500.0]
G C C B B C->B D D D->C D->B A A B->A

with ratio 0.2
[datasize : 22000.0]
G C C B B C->B D D D->C A A D->A D->B A->C A->B

with ratio 0.5
[datasize : 25000.0]
G C C D D D->C B B D->B A A B->C B->A

with ratio 0.8
[datasize : 28000.0]
G C C D D D->C A A A->C B B A->B B->D

with ratio 0.9
[datasize : 29000.0]
G C C B B C->B D D D->C A A B->A

with ratio 0.95
[datasize : 29500.0]
G C C D D A A B B A->B

with ratio 0.99
[datasize : 29900.0]
G C C D D A A B B B->A

with ratio 1.0
[datasize : 30000.0]

It is obvious that these arcs represent weak and spurious correlations due to mixing probabilities (see Pennock and Wellman, 1999), that become weaker when the weight of the prior increases.

Another way to look at the mixing is to plot the Kullback-Leibler divergence between the learned BNs and the 2 templates (\(bnData\) and \(bnPrio\)r)

In [7]:
ratios=[i/100.0 for i in range(101)]
bns=[learnWithRatio(r) for r in ratios]


def kls(i):
    kl=gum.ExactBNdistance(bnPrior,bns[i])
    y1=kl.compute()
    kl=gum.ExactBNdistance(bnData,bns[i])
    y2=kl.compute()
    return y1['klPQ'],y2['klPQ'],y1['klQP'],y2['klQP']


fig=figure(figsize=(10,6))
ax  = fig.add_subplot(1, 1, 1)

x=ratios
y1,y2,y3,y4=zip(*[kls(i) for i in range(len(ratios))])
ax.plot(x,y1,label="M-projection with bnPrior")
ax.plot(x,y3,label="I-projection with bnPrior")
ax.plot(x,y2,label="M-projection with bnData")
ax.plot(x,y4,label="I-projection with bnData")

ax.set_xlabel("weight ratio between data and prior")
ax.set_ylabel("KL")
ax.legend(bbox_to_anchor=(0.15, 0.88, 0.7, .102), loc=3,ncol=2, mode="expand", borderaxespad=0.)
t=ax.set_title("Weight ratio's Impact on KLs")
plt.show()
../_images/notebooks_34-Learning_DirichletPriorAndWeigthedDatabase_15_0.svg

We can use other divergences (or distances)

In [8]:
def distances(i):
    kl=gum.ExactBNdistance(bnPrior,bns[i])
    y1=kl.compute()
    kl=gum.ExactBNdistance(bnData,bns[i])
    y2=kl.compute()
    return y1['hellinger'],y2['hellinger'],y1['bhattacharya'],y2['bhattacharya'],y1['jensen-shannon'],y2['jensen-shannon']


fig=figure(figsize=(10,6))
ax  = fig.add_subplot(1, 1, 1)

x=ratios
y1,y2,y3,y4,y5,y6=zip(*[distances(i) for i in range(len(ratios))])
ax.plot(x,y1,label="Hellinger with bnPrior")
ax.plot(x,y3,label="Bhattacharya with bnPrior")
ax.plot(x,y5,label="Jensen-Shannon with bnPrior")
ax.plot(x,y2,label="Hellinger with bnData")
ax.plot(x,y4,label="Bhattacharya with bnData")
ax.plot(x,y6,label="Jensen-Shannon with bnData")

ax.set_xlabel("weight ratio between data and prior")
ax.set_ylabel("distances")
ax.legend(bbox_to_anchor=(0.15, 0.85, 0.7, .102), loc=3,ncol=2, mode="expand", borderaxespad=0.)
t=ax.set_title("Weight ratio's Impact on distances")
plt.show()
../_images/notebooks_34-Learning_DirichletPriorAndWeigthedDatabase_17_0.svg

Less informative but still possible, we can trace the scores (precision, etc.) from a pyAgrum.lib.bn_vs_bn.GraphicalBNComparator (see 07-ComparingBN for more)

In [9]:
import pyAgrum.lib.bn_vs_bn as gcm

def scores(i):
    cmp=gcm.GraphicalBNComparator(bnPrior,bns[i])
    y1=cmp.scores()
    cmp=gcm.GraphicalBNComparator(bnData,bns[i])
    y2=cmp.scores()
    return y1['recall']   ,y2['recall'],y1['precision'],y2['precision'],y1['fscore'],y2['fscore'],y1['dist2opt'] ,y2['dist2opt']


fig=figure(figsize=(20,6))
ax1  = fig.add_subplot(1, 2, 1)
ax2  = fig.add_subplot(1, 2, 2)

x=ratios
y1,y2,y3,y4,y5,y6,y7,y8=zip(*[scores(i) for i in range(len(ratios))])
ax1.plot(x,y1,label="recall with bnPrior")
ax1.plot(x,y3,label="precision with bnPrior")
ax1.plot(x,y5,label="fscore with bnPrior")
ax1.plot(x,y7,label="dist2opt with bnPrior")

ax2.plot(x,y2,label="recall with bnData")
ax2.plot(x,y4,label="precision with bnData")
ax2.plot(x,y6,label="fscore with bnData")
ax2.plot(x,y8,label="dist2opt with bnData")

ax1.set_xlabel("weight ratio between data and prior")
ax1.set_ylabel("KL")
ax1.legend(bbox_to_anchor=(0.15, 0.88, 0.7, .102), loc=3,ncol=2, mode="expand", borderaxespad=0.)
ax1.set_title("Weight ratio's Impact on scores")

ax2.set_xlabel("weight ratio between data and prior")
ax2.set_ylabel("KL")
ax2.legend(bbox_to_anchor=(0.15, 0.88, 0.7, .102), loc=3,ncol=2, mode="expand", borderaxespad=0.)
ax2.set_title("Weight ratio's Impact on scores")

plt.show()
../_images/notebooks_34-Learning_DirichletPriorAndWeigthedDatabase_19_0.svg

Weighted database and records

Database can be weighted as done above. But you can also fix the weight record by record in the database. Note that the weight of database is the sum of all weights for each record. And then

learner.setDatabaseWeight(2.5)

is equivalent to

siz=learner.nbRows()
for i in range(siz):
    learner.setRecordWeight(i,2.5/siz)
In [10]:
bn=gum.fastBN("X->Y")
bn1=gum.BayesNet(bn)
bn2=gum.BayesNet(bn)
#the base will be saved in basefile="out/dataW.csv"
basefile="out/dataW.csv"

learning Parameters with weighted records

In the 2 next cells, we compute the parameters of bn using 2 bases : in the next cell, the base contains 8 rows of weight 1. In the next one, the base contains only 4 rows but two of them have different weights. The sum of the weights in this second base is 8 as well …

So the parameters are exactly the same.

In [11]:
%%writefile 'out/dataW.csv'
X,Y
1,0
0,1
0,1
0,0
1,0
0,1
1,1
0,1
Overwriting out/dataW.csv
In [12]:
learner=gum.BNLearner(basefile)
learner.fitParameters(bn1)
gnb.flow.row(bn1.cpt("X"),bn1.cpt("Y"))
X
0
1
0.62500.3750
Y
X
0
1
0
0.20000.8000
1
0.66670.3333
In [13]:
%%writefile 'out/dataW.csv'
X,Y
0,0
1,0
0,1
1,1
Overwriting out/dataW.csv
In [14]:
learner=gum.BNLearner(basefile)
learner.setRecordWeight(1,2.0) # line #1 as a weight 2
learner.setRecordWeight(2,4.0) # line #2 as a weight 4

learner.fitParameters(bn2)
gnb.flow.row(bn2.cpt("X"),bn2.cpt("Y"))
X
0
1
0.62500.3750
Y
X
0
1
0
0.20000.8000
1
0.66670.3333

learning Structure with weighted records

In the 2 next cells, we compute the parameters of bn using 2 bases : in the next cell, the base contains 8 rows of weight 1. In the next one, the base contains only 4 rows but two of them have different weights. The sum of the weights in this second base is 8 as well …

So the parameters are exactly the same.

In [15]:
%%writefile 'out/dataW.csv'
X,Y,Z
1,0,1
1,0,1
1,0,0
1,0,1
1,0,1
0,1,0
0,1,0
0,0,1
1,0,0
0,1,1
1,1,0
0,1,0
Overwriting out/dataW.csv
In [16]:
learner=gum.BNLearner(basefile)
bn1=learner.learnBN()
gnb.flow.row(bn1,bn1.cpt("Z"),bn1.cpt("X"),bn1.cpt("Y"))
G Y Y X X Y->X Z Z Z->Y
Z
0
1
0.50000.5000
X
Y
0
1
0
0.16670.8333
1
0.77270.2273
Y
Z
0
1
0
0.34620.6538
1
0.80770.1923
In [17]:
%%writefile 'out/dataW.csv'
X,Y,Z
0,0,1
0,1,0
0,1,1
1,0,0
1,0,1
1,1,0
Overwriting out/dataW.csv
In [18]:
learner=gum.BNLearner(basefile)
learner.setRecordWeight(1,3.0) # line #1 as a weight 3
learner.setRecordWeight(3,2.0) # line #2 as a weight 2
learner.setRecordWeight(4,4.0) # line #2 as a weight 4
# others lines as a weight 1

bn2=learner.learnBN()
gnb.flow.row(bn2,bn2.cpt("Z"),bn2.cpt("X"),bn2.cpt("Y"))
G Y Y X X Y->X Z Z Z->Y
Z
0
1
0.50000.5000
X
Y
0
1
0
0.16670.8333
1
0.77270.2273
Y
Z
0
1
0
0.34620.6538
1
0.80770.1923
In [ ]: