Conditional Linear Gaussian models
In [1]:
import pyagrum as gum
import pyagrum.lib.notebook as gnb
import pyagrum.lib.bn_vs_bn as gcm
import pyagrum.clg as gclg
import pyagrum.clg.notebook as gclgnb
Build a CLG model
From scratch
Suppose we want to build a CLG with these specifications \(A={\cal N}(5,1)\), \(B={\cal N}(4,3)\) and \(C=2.A+3.B+{\cal N}(3,2)\)
In [2]:
model = gclg.CLG()
model.add(gclg.GaussianVariable("A", 5, 1))
model.add(gclg.GaussianVariable("C", 3, 2))
model.add(gclg.GaussianVariable("B", 4, 3))
model.addArc("A", "C", 2)
model.addArc("B", "C", 3)
model
Out[2]:
From SEM (Structural Equation Model)
We can create a Conditional Linear Gaussian Bayesian networ(CLG model) using a SEM-like syntax.
A = 4.5 [0.3] means that the mean of the distribution for Gaussian random variable A is 4.5 and ist standard deviation is 0.3.
B = 3 + 0.8F [0.3] means that the mean of the distribution for the Gaussian random variable B is 3 and the standard deviation is 0.3.
pyagrum.CLG.SEM is a set of static methods to manipulate this kind of SEM.
In [3]:
sem2 = """
A=4.5 [0.3] # comments are allowed
F=7 [0.5]
B=3 + 1.2F [0.3]
C=9 + 2A + 1.5B [0.6]
D=9 + C + F[0.7]
E=9 + D [0.9]
"""
model2 = gclg.SEM.toclg(sem2)
In [4]:
gnb.show(model2)
One can of course build the SEM from a CLG using pyagrum.CLG.SEM.tosem :
In [5]:
gnb.flow.row(
model,
"<pre><div align='left'>" + gclg.SEM.tosem(model) + "</div></pre>",
captions=["the first CLG model", "the SEM from the CLG"],
)
B=4[3] A=5[1] C=3+2A+3B[2]
And this SEM allows of course input/output format for CLG
In [6]:
gclg.SEM.saveCLG(model2, "out/model2.sem")
print("=== file content ===")
with open("out/model2.sem", "r") as file:
for line in file.readlines():
print(line, end="")
print("====================")
=== file content ===
F=7.0[0.5]
B=3.0+1.2F[0.3]
A=4.5[0.3]
C=9.0+2.0A+1.5B[0.6]
D=9.0+F+C[0.7]
E=9.0+D[0.9]
====================
In [7]:
model3 = gclg.SEM.loadCLG("out/model2.sem")
gnb.sideBySide(model2, model3, captions=["saved model", "loaded model"])
input/output with pickle for CLG
In [8]:
import pickle
with open("test.pkl", "bw") as f:
pickle.dump(model3, f)
model3
Out[8]:
In [9]:
with open("test.pkl", "br") as f:
copyModel3 = pickle.load(f)
copyModel3
Out[9]:
Exact or approximated Inference
Exact inference : Variable Elimination
Compute some posterior using difference exact inference
In [10]:
ie = gclg.CLGVariableElimination(model2)
ie.updateEvidence({"D": 3})
print(ie.posterior("A"))
print(ie.posterior("B"))
print(ie.posterior("C"))
print(ie.posterior("D"))
print(ie.posterior("E"))
print(ie.posterior("F"))
v = ie.posterior("E")
print(v)
print(f" - mean(E|D=3)={v.mu()}")
print(f" - stdev(E|D=3)={v.sigma()}")
A:1.9327650111193468[0.28353638852446156]
B:-2.5058561897702[0.41002992170553515]
C:3.9722757598220895[0.5657771474513671]
D:3[0]
E:12.0[0.9]
F:-2.9836916234247597[0.32358490464094586]
E:12.0[0.9]
- mean(E|D=3)=12.0
- stdev(E|D=3)=0.9
In [11]:
gnb.sideBySide(
model2,
gclgnb.getInference(model2, evs={"D": 3}, size="3!"),
gclgnb.getInference(model2, evs={"D": 3, "F": 1}),
captions=["The CLG", "First inference", "Second inference"],
)
Approximated inference : MonteCarlo Sampling
When the model is too complex for exact infernece, we can use forward sampling to generate 5000 samples from the original CLG model.
In [12]:
fs = gclg.ForwardSampling(model2)
fs.makeSample(5000).tocsv("./out/model2.csv")
We will use the generated database to do learning. But before, we can also compute posterior but without evidence :
In [13]:
ie = gclg.CLGVariableElimination(model2)
print("| 'Exact' inference | Results from sampling |")
print("|------------------------------------------|------------------------------------------|")
for i in model2.names():
print(f"| {str(ie.posterior(i)):40} | {str(gclg.GaussianVariable(i, fs.mean_sample(i), fs.stddev_sample(i))):40} |")
| 'Exact' inference | Results from sampling |
|------------------------------------------|------------------------------------------|
| A:4.499999999999998[0.3] | A:4.506411314616108[0.3047201154128288] |
| F:7.000000000000008[0.5000000000000002] | F:7.004625249073706[0.4992357073054573] |
| B:11.399999999999999[0.6708203932499367] | B:11.406069053496976[0.6753879538148174] |
| C:35.099999999999994[1.3162446581088183] | C:35.119368681209885[1.3231024261603428] |
| D:51.10000000000002[1.8364367672206963] | D:51.128075206127676[1.8457570890362218] |
| E:60.100000000000016[2.0451161336217565] | E:60.11128006318519[2.0481536533271183] |
Now with the generated database and the original model, we can calculate the log-likelihood of the model.
In [14]:
print("log-likelihood w.r.t orignal model : ", model2.logLikelihood("./out/model2.csv"))
log-likelihood w.r.t orignal model : -22190.293165304694
Learning a CLG from data
Use the generated database to do our RAvel Learning. This part needs some time to run.
In [15]:
# RAveL learning
learner = gclg.CLGLearner("./out/model2.csv")
We can get the learned_clg model with function learn_clg() which contains structure learning and parameter estimation.
In [16]:
learned_clg = learner.learnCLG()
gnb.sideBySide(model2, learned_clg, captions=["original CLG", "learned CLG"])
Compare the learned model’s structure with that of the original model’.
In [17]:
cmp = gcm.GraphicalBNComparator(model2, learned_clg)
print(f"F-score(original_clg,learned_clg) : {cmp.scores()['fscore']}")
F-score(original_clg,learned_clg) : 0.5454545454545454
Get the learned model’s parameters and compare them with the original model’s parameters using the SEM syntax.
In [18]:
gnb.flow.row(
"<pre><div align='left'>" + gclg.SEM.tosem(model2) + "</div></pre>",
"<pre><div align='left'>" + gclg.SEM.tosem(learned_clg) + "</div></pre>",
captions=["original sem", "learned sem"],
)
F=7.0[0.5] B=3.0+1.2F[0.3] A=4.5[0.3] C=9.0+2.0A+1.5B[0.6] D=9.0+F+C[0.7] E=9.0+D[0.9]
F=7.005[0.499] B=2.911+1.21F[0.299] D=26.445+2.16B[1.127] E=9.026+D[0.891] A=4.506[0.305] C=0.53+0.79A+0.61D[0.511]
We can algo do parameter estimation only with function fitParameters() if we already have the structure of the model.
In [19]:
# We can copy the original CLG
copy_original = gclg.CLG(model2)
# RAveL learning again
RAveL_l = gclg.CLGLearner("./out/model2.csv")
# Fit the parameters of the copy clg
RAveL_l.fitParameters(copy_original)
copy_original
Out[19]:
Compare two CLG models
We first create two CLG from two SEMs.
In [20]:
# TWO DIFFERENT CLGs
# FIRST CLG
clg1 = gclg.SEM.toclg("""
# hyper parameters
A=4[1]
B=3[5]
C=-2[5]
#equations
D=A[.2] # D is a noisy version of A
E=1+D+2B [2]
F=E+C+B+E [0.001]
""")
# SECOND CLG
clg2 = gclg.SEM.toclg("""
# hyper parameters
A=4[1]
B=3+A[5]
C=-2+2B+A[5]
#equations
D=A[.2] # D is a noisy version of A
E=1+D+2B [2]
F=E+C [0.001]
""")
This cell shows how to have a quick view of the differences
In [21]:
gnb.flow.row(clg1, clg2, gcm.graphDiff(clg1, clg2), gcm.graphDiffLegend(), gcm.graphDiff(clg2, clg1))
We compare the CLG models.
In [22]:
# We use the F-score to compare the two CLGs
cmp = gcm.GraphicalBNComparator(clg1, clg1)
print(f"F-score(clg1,clg1) : {cmp.scores()['fscore']}")
cmp = gcm.GraphicalBNComparator(clg1, clg2)
print(f"F-score(clg1,clg2) : {cmp.scores()['fscore']}")
F-score(clg1,clg1) : 1.0
F-score(clg1,clg2) : 0.7142857142857143
In [23]:
# The complete list of structural scores is :
print("score(clg1,clg2) :")
for score, val in cmp.scores().items():
print(f" - {score} : {val}")
score(clg1,clg2) :
- count : {'tp': 5, 'tn': 21, 'fp': 3, 'fn': 1}
- recall : 0.8333333333333334
- precision : 0.625
- fscore : 0.7142857142857143
- dist2opt : 0.41036907507483766
Forward Sampling
In [24]:
# We create a simple CLG with 3 variables
clg = gclg.CLG()
# prog=« sigma=2;X=N(5);Y=N(3);Z=X+Y »
A = gclg.GaussianVariable(mu=2, sigma=1, name="A")
B = gclg.GaussianVariable(mu=1, sigma=2, name="B")
C = gclg.GaussianVariable(mu=2, sigma=3, name="C")
idA = clg.add(A)
idB = clg.add(B)
idC = clg.add(C)
clg.addArc(idA, idB, 1.5)
clg.addArc(idB, idC, 0.75)
# We can show it as a graph
original_clg = gclgnb.CLG2dot(clg)
original_clg
Out[24]:
In [25]:
fs = gclg.ForwardSampling(clg)
fs.makeSample(10)
Out[25]:
<pyagrum.clg.forwardSampling.ForwardSampling at 0x123d465d0>
In [26]:
print("A's sample_variance: ", fs.variance_sample(0))
print("B's sample_variance: ", fs.variance_sample("B"))
print("C's sample_variance: ", fs.variance_sample(2))
A's sample_variance: 0.3594438509509771
B's sample_variance: 7.87614159531051
C's sample_variance: 7.362055598235939
In [27]:
print("A's sample_mean: ", fs.mean_sample("A"))
print("B's sample_mean: ", fs.mean_sample("B"))
print("C's sample_mean: ", fs.mean_sample("C"))
A's sample_mean: 1.320826152929032
B's sample_mean: 3.6641975569680048
C's sample_mean: 4.767190127723293
In [28]:
fs.toarray()
Out[28]:
array([[ 1.07459513, 0.04808577, 0.3849656 ],
[ 0.3073453 , -0.19441256, 2.66658039],
[ 1.77906467, 2.43565606, 0.90365465],
[ 1.66956915, 5.79951864, 6.21164636],
[ 2.65139076, 6.73601829, 6.033236 ],
[ 0.99228303, 1.46300046, 7.2093204 ],
[ 1.15542912, 4.877287 , 8.45064496],
[ 1.16605401, 3.34509417, 7.00633286],
[ 1.53729338, 9.01006178, 2.58938582],
[ 0.87523698, 3.12166595, 6.21613425]])
In [29]:
# export to dataframe
fs.topandas()
Out[29]:
| A | B | C | |
|---|---|---|---|
| 0 | 1.074595 | 0.048086 | 0.384966 |
| 1 | 0.307345 | -0.194413 | 2.666580 |
| 2 | 1.779065 | 2.435656 | 0.903655 |
| 3 | 1.669569 | 5.799519 | 6.211646 |
| 4 | 2.651391 | 6.736018 | 6.033236 |
| 5 | 0.992283 | 1.463000 | 7.209320 |
| 6 | 1.155429 | 4.877287 | 8.450645 |
| 7 | 1.166054 | 3.345094 | 7.006333 |
| 8 | 1.537293 | 9.010062 | 2.589386 |
| 9 | 0.875237 | 3.121666 | 6.216134 |
In [30]:
# export to csv
fs.makeSample(10000)
fs.tocsv("./out/samples.csv")
PC-algorithm & Parameter Estimation
The module allows to investigale more deeply into the learning algorithm.
We first create a random CLG model with 5 variables.
In [31]:
# Create a new random CLG
clg = gclg.randomCLG(nb_variables=5, names="ABCDE")
# Display the CLG
print(clg)
CLG{nodes: 5, arcs: 6, parameters: 16}
We then do the Forward Sampling and CLGLearner.
In [32]:
n = 20 # n is the selected values of MC number n in n-MCERA
K = 10000 # K is the list of selected values of number of samples
Delta = 0.05 # Delta is the FWER we want to control
# Sample generation
fs = gclg.ForwardSampling(clg)
fs.makeSample(K).tocsv("./out/clg.csv")
# Learning
RAveL_l = gclg.CLGLearner("./out/clg.csv", n_sample=n, fwer_delta=Delta)
We use the PC algorithme to learn the structure of the model.
In [33]:
# Use the PC algorithm to get the skeleton
C = RAveL_l.PC_algorithm(order=clg.nodes(), verbose=False)
print("The final skeleton is:\n", C)
The final skeleton is:
{0: {3}, 1: {4}, 2: {1, 4}, 3: set(), 4: set()}
In [34]:
# Create a Mixedgraph to display the skeleton
RAveL_MixGraph = gum.MixedGraph()
# Add variables
for i in range(len(clg.names())):
RAveL_MixGraph.addNodeWithId(i)
# Add arcs and edges
for father, kids in C.items():
for kid in kids:
if father in C[kid]:
RAveL_MixGraph.addEdge(father, kid)
else:
RAveL_MixGraph.addArc(father, kid)
RAveL_MixGraph
Out[34]:
In [35]:
# Create a BN with the same structure as the CLG
bn = gum.BayesNet()
# add variables
for name in clg.names():
new_variable = gum.LabelizedVariable(name, "a labelized variable", 2)
bn.add(new_variable)
# add arcs
for arc in clg.arcs():
bn.addArc(arc[0], arc[1])
# Compare the result above with the EssentialGraph
Real_EssentialGraph = gum.EssentialGraph(bn)
Real_EssentialGraph
Out[35]:
In [36]:
# create a CLG from the skeleton of PC algorithm
clg_PC = gclg.CLG()
for node in clg.nodes():
clg_PC.add(clg.variable(node))
for father, kids in C.items():
for kid in kids:
clg_PC.addArc(father, kid)
# Compare the structure of the created CLG and the original CLG
print(f"F-score : {clg.CompareStructure(clg_PC)}")
F-score : 0.8
We can also do the parameter learning.
In [37]:
id2mu, id2sigma, arc2coef = RAveL_l.estimate_parameters(C)
for node in clg.nodes():
print(f"Real Value: node {node} : mu = {clg.variable(node)._mu}, sigma = {clg.variable(node)._sigma}")
print(f"Estimation: node {node} : mu = {id2mu[node]}, sigma = {id2sigma[node]}")
for arc in clg.arcs():
print(f"Real Value: arc {arc} : coef = {clg.coefArc(*arc)}")
print(f"Estimation: arc {arc} : coef = {(arc2coef[arc] if arc in arc2coef else '-')}")
Real Value: node 0 : mu = 0.08093620608756957, sigma = 1.9489905333443087
Estimation: node 0 : mu = 0.08028388526738774, sigma = 1.9565448414861146
Real Value: node 1 : mu = 1.0874077664315536, sigma = 6.043672025109873
Estimation: node 1 : mu = -17.252714398799895, sigma = 77.1193292765197
Real Value: node 2 : mu = -3.1269394358113645, sigma = 8.360069498146906
Estimation: node 2 : mu = -3.0667574696888202, sigma = 8.344086888381153
Real Value: node 3 : mu = 2.031068416871209, sigma = 2.2128045054914476
Estimation: node 3 : mu = 2.0474699873534403, sigma = 2.2051555550932185
Real Value: node 4 : mu = 0.2661515090013564, sigma = 5.372195542567968
Estimation: node 4 : mu = 0.2738042115977919, sigma = 5.381400340713036
Real Value: arc (0, 1) : coef = 7.621804351857164
Estimation: arc (0, 1) : coef = -
Real Value: arc (2, 4) : coef = 5.094580524038454
Estimation: arc (2, 4) : coef = 5.09601400263659
Real Value: arc (2, 1) : coef = 9.300182015962493
Estimation: arc (2, 1) : coef = 9.26420864191017
Real Value: arc (3, 1) : coef = -7.404504505825376
Estimation: arc (3, 1) : coef = -
Real Value: arc (0, 3) : coef = 6.226617111602195
Estimation: arc (0, 3) : coef = 6.215576032269046
Real Value: arc (1, 4) : coef = 9.644159715611616
Estimation: arc (1, 4) : coef = 9.644082765212918
In [ ]:

