Conditional Linear Gaussian models

In [1]:

import pyagrum as gum
import pyagrum.lib.notebook as gnb
import pyagrum.lib.bn_vs_bn as gcm

import pyagrum.clg as gclg
import pyagrum.clg.notebook as gclgnb

Build a CLG model

From scratch

Suppose we want to build a CLG with these specifications \(A={\cal N}(5,1)\), \(B={\cal N}(4,3)\) and \(C=2.A+3.B+{\cal N}(3,2)\)

In [2]:

model = gclg.CLG()
model.add(gclg.GaussianVariable("A", 5, 1))
model.add(gclg.GaussianVariable("C", 3, 2))
model.add(gclg.GaussianVariable("B", 4, 3))
model.addArc("A", "C", 2)
model.addArc("B", "C", 3)
model

Out[2]:

From SEM (Structural Equation Model)

We can create a Conditional Linear Gaussian Bayesian networ(CLG model) using a SEM-like syntax.

A = 4.5 [0.3] means that the mean of the distribution for Gaussian random variable A is 4.5 and ist standard deviation is 0.3.

B = 3 + 0.8F [0.3] means that the mean of the distribution for the Gaussian random variable B is 3 and the standard deviation is 0.3.

pyagrum.CLG.SEM is a set of static methods to manipulate this kind of SEM.

In [3]:

sem2 = """
A=4.5 [0.3] # comments are allowed
F=7 [0.5]
B=3 + 1.2F [0.3]
C=9 +  2A + 1.5B [0.6]
D=9 + C + F[0.7]
E=9 + D [0.9]
"""

model2 = gclg.SEM.toclg(sem2)

In [4]:

gnb.show(model2)

../_images/notebooks_72-PyModels_CLG_7_0.svg

One can of course build the SEM from a CLG using pyagrum.CLG.SEM.tosem :

In [5]:

gnb.flow.row(
  model,
  "<pre><div align='left'>" + gclg.SEM.tosem(model) + "</div></pre>",
  captions=["the first CLG model", "the SEM from the CLG"],
)

Out[5]:

the first CLG model

B=4[3]
A=5[1]
C=3+2A+3B[2]

the SEM from the CLG

And this SEM allows of course input/output format for CLG

In [6]:

gclg.SEM.saveCLG(model2, "out/model2.sem")

print("=== file content ===")
with open("out/model2.sem", "r") as file:
  for line in file.readlines():
    print(line, end="")
print("====================")

=== file content ===
F=7.0[0.5]
B=3.0+1.2F[0.3]
A=4.5[0.3]
C=9.0+2.0A+1.5B[0.6]
D=9.0+F+C[0.7]
E=9.0+D[0.9]
====================

In [7]:

model3 = gclg.SEM.loadCLG("out/model2.sem")
gnb.sideBySide(model2, model3, captions=["saved model", "loaded model"])

saved model

loaded model

Exact or approximated Inference

Exact inference : Variable Elimination

Compute some posterior using difference exact inference

In [11]:

ie = gclg.CLGVariableElimination(model2)
ie.updateEvidence({"D": 3})

print(ie.posterior("A"))
print(ie.posterior("B"))
print(ie.posterior("C"))
print(ie.posterior("D"))
print(ie.posterior("E"))
print(ie.posterior("F"))


v = ie.posterior("E")
print(v)
print(f"  - mean(E|D=3)={v.mu()}")
print(f"  - stdev(E|D=3)={v.sigma()}")

A:1.9327650111193468[0.28353638852446156]
B:-2.5058561897702[0.41002992170553515]
C:3.9722757598220895[0.5657771474513671]
D:3[0]
E:12.0[0.9]
F:-2.9836916234247597[0.32358490464094586]
E:12.0[0.9]
  - mean(E|D=3)=12.0
  - stdev(E|D=3)=0.9

In [12]:

gnb.sideBySide(
  model2,
  gclgnb.getInference(model2, evs={"D": 3}, size="3!"),
  gclgnb.getInference(model2, evs={"D": 3, "F": 1}),
  captions=["The CLG", "First inference", "Second inference"],
)

The CLG

First inference

Second inference

Approximated inference : MonteCarlo Sampling

When the model is too complex for exact infernece, we can use forward sampling to generate 5000 samples from the original CLG model.

In [13]:

fs = gclg.ForwardSampling(model2)
fs.makeSample(5000).tocsv("./out/model2.csv")

We will use the generated database to do learning. But before, we can also compute posterior but without evidence :

In [14]:

ie = gclg.CLGVariableElimination(model2)
print("| 'Exact' inference                        | Results from sampling                    |")
print("|------------------------------------------|------------------------------------------|")
for i in model2.names():
  print(f"| {str(ie.posterior(i)):40} | {str(gclg.GaussianVariable(i, fs.mean_sample(i), fs.stddev_sample(i))):40} |")

| 'Exact' inference                        | Results from sampling                    |
|------------------------------------------|------------------------------------------|
| A:4.499999999999998[0.3]                 | A:4.497297718913133[0.3030205502235108]  |
| F:7.000000000000008[0.5000000000000002]  | F:7.000933183671481[0.506059158525254]   |
| B:11.399999999999999[0.6708203932499367] | B:11.398376840036336[0.6846889936191312] |
| C:35.099999999999994[1.3162446581088183] | C:35.09175884424366[1.3115102638060472]  |
| D:51.10000000000002[1.8364367672206963]  | D:51.096242691481464[1.8442312206837241] |
| E:60.100000000000016[2.0451161336217565] | E:60.10227082682586[2.0518924137010677]  |

Now with the generated database and the original model, we can calculate the log-likelihood of the model.

In [15]:

print("log-likelihood w.r.t orignal model : ", model2.logLikelihood("./out/model2.csv"))

log-likelihood w.r.t orignal model :  -22099.896933122218

Learning a CLG from data

Use the generated database to do our RAvel Learning. This part needs some time to run.

In [16]:

# RAveL learning
learner = gclg.CLGLearner("./out/model2.csv")

We can get the learned_clg model with function learn_clg() which contains structure learning and parameter estimation.

In [17]:

learned_clg = learner.learnCLG()
gnb.sideBySide(model2, learned_clg, captions=["original CLG", "learned CLG"])

original CLG

learned CLG

Compare the learned model’s structure with that of the original model’.

In [18]:

cmp = gcm.GraphicalBNComparator(model2.asDiscreteBN(), learned_clg.asDiscreteBN())
print(f"F-score(original_clg,learned_clg) : {cmp.scores()['fscore']}")

F-score(original_clg,learned_clg) : 0.4444444444444445

Get the learned model’s parameters and compare them with the original model’s parameters using the SEM syntax.

In [19]:

gnb.flow.row(
  "<pre><div align='left'>" + gclg.SEM.tosem(model2) + "</div></pre>",
  "<pre><div align='left'>" + gclg.SEM.tosem(learned_clg) + "</div></pre>",
  captions=["original sem", "learned sem"],
)

Out[19]:

F=7.0[0.5]
B=3.0+1.2F[0.3]
A=4.5[0.3]
C=9.0+2.0A+1.5B[0.6]
D=9.0+F+C[0.7]
E=9.0+D[0.9]

original sem

F=7.000933183671481[0.506059158525254]
B=2.87563302514263+1.2173725403881295F[0.2987741375094306]
D=26.640495974872238+2.145546428216815B[1.1149590601244737]
E=8.786082521511304+1.004304535955043D[0.8830229795492328]
A=4.497297718913133[0.3030205502235108]
C=0.8091816377159518+0.7769793533488817A+0.6025544758539018D[0.518336060972927]

learned sem

We can algo do parameter estimation only with function fitParameters() if we already have the structure of the model.

In [20]:

# We can copy the original CLG
copy_original = gclg.CLG(model2)

# RAveL learning again
RAveL_l = gclg.CLGLearner("./out/model2.csv")

# Fit the parameters of the copy clg
RAveL_l.fitParameters(copy_original)

copy_original

Out[20]:

Compare two CLG models

We first create two CLG from two SEMs.

In [21]:

# TWO DIFFERENT CLGs

# FIRST CLG
clg1 = gclg.SEM.toclg("""
# hyper parameters
A=4[1]
B=3[5]
C=-2[5]

#equations
D=A[.2] # D is a noisy version of A
E=1+D+2B [2]
F=E+C+B+E [0.001]
""")

# SECOND CLG
clg2 = gclg.SEM.toclg("""
# hyper parameters
A=4[1]
B=3+A[5]
C=-2+2B+A[5]

#equations
D=A[.2] # D is a noisy version of A
E=1+D+2B [2]
F=E+C [0.001]
""")

This cell shows how to have a quick view of the differences

In [22]:

gnb.flow.row(clg1, clg2, gcm.graphDiff(clg1, clg2), gcm.graphDiffLegend(), gcm.graphDiff(clg2, clg1))

Out[22]:

We compare the CLG models.

In [23]:

# We use the F-score to compare the two CLGs
cmp = gcm.GraphicalBNComparator(clg1.asDiscreteBN(), clg1.asDiscreteBN())
print(f"F-score(clg1,clg1) : {cmp.scores()['fscore']}")

cmp = gcm.GraphicalBNComparator(clg1.asDiscreteBN(), clg2.asDiscreteBN())
print(f"F-score(clg1,clg2) : {cmp.scores()['fscore']}")

F-score(clg1,clg1) : 1.0
F-score(clg1,clg2) : 0.7142857142857143

In [24]:

# The complete list of structural scores is :
print("score(clg1,clg2) :")
for score, val in cmp.scores().items():
  print(f"  - {score} : {val}")

score(clg1,clg2) :
  - count : {'tp': 5, 'tn': 6, 'fp': 3, 'fn': 1}
  - recall : 0.8333333333333334
  - precision : 0.625
  - fscore : 0.7142857142857143
  - dist2opt : 0.41036907507483766
  - sid : 3

Forward Sampling

In [25]:

# We create a simple CLG with 3 variables
clg = gclg.CLG()
# prog=« sigma=2;X=N(5);Y=N(3);Z=X+Y »
A = gclg.GaussianVariable(mu=2, sigma=1, name="A")
B = gclg.GaussianVariable(mu=1, sigma=2, name="B")
C = gclg.GaussianVariable(mu=2, sigma=3, name="C")

idA = clg.add(A)
idB = clg.add(B)
idC = clg.add(C)

clg.addArc(idA, idB, 1.5)
clg.addArc(idB, idC, 0.75)

# We can show it as a graph
original_clg = gclgnb.CLG2dot(clg)
original_clg

Out[25]:

In [26]:

fs = gclg.ForwardSampling(clg)
fs.makeSample(10)

Out[26]:

<pyagrum.clg.forwardSampling.ForwardSampling at 0x113c5c050>

In [27]:

print("A's sample_variance: ", fs.variance_sample(0))
print("B's sample_variance: ", fs.variance_sample("B"))
print("C's sample_variance: ", fs.variance_sample(2))

A's sample_variance:  1.5536715494590174
B's sample_variance:  8.104376746610344
C's sample_variance:  9.317711054893955

In [28]:

print("A's sample_mean: ", fs.mean_sample("A"))
print("B's sample_mean: ", fs.mean_sample("B"))
print("C's sample_mean: ", fs.mean_sample("C"))

A's sample_mean:  1.898378004671517
B's sample_mean:  4.009560045839862
C's sample_mean:  4.429983805464735

In [29]:

fs.toarray()

Out[29]:

array([[ 2.04611608e-03, -1.81215231e+00, -1.77388968e+00],
       [ 3.70353629e+00,  7.64983713e+00,  6.09679613e+00],
       [ 2.00890627e+00,  3.95592693e+00,  4.36553057e+00],
       [ 2.31583527e+00,  3.46675143e+00,  7.32060229e+00],
       [ 2.54396727e+00,  5.33070331e+00,  6.43275884e+00],
       [ 4.04628059e+00,  8.43172749e+00,  3.75677808e+00],
       [ 1.63736962e+00,  3.97280451e+00,  6.99425709e-01],
       [ 7.09426274e-01,  5.33917442e-01,  2.40269541e+00],
       [ 1.45471649e+00,  4.47050663e+00,  8.55194213e+00],
       [ 5.61695846e-01,  4.09557790e+00,  6.44719857e+00]])

In [30]:

# export to dataframe
fs.topandas()

Out[30]:

	A	B	C
0	0.002046	-1.812152	-1.773890
1	3.703536	7.649837	6.096796
2	2.008906	3.955927	4.365531
3	2.315835	3.466751	7.320602
4	2.543967	5.330703	6.432759
5	4.046281	8.431727	3.756778
6	1.637370	3.972805	0.699426
7	0.709426	0.533917	2.402695
8	1.454716	4.470507	8.551942
9	0.561696	4.095578	6.447199

In [31]:

# export to csv
fs.makeSample(10000)
fs.tocsv("./out/samples.csv")

PC-algorithm & Parameter Estimation

The module allows to investigale more deeply into the learning algorithm.

We first create a random CLG model with 5 variables.

In [32]:

# Create a new random CLG
clg = gclg.randomCLG(nb_variables=5, names="ABCDE")

# Display the CLG
print(clg)

A=1.0206830157456794[5.8275511445019195]
B=2.9212765644834704-9.685092060974101A[5.772850840277943]
C=-4.4033954359007215[4.477027145048831]
D=3.955794416407027-8.20412108135363C+9.68740842349083A[9.075785768606641]
E=-0.41692306161996484+6.887125456821082A-6.908527740591809D[8.012333922404032]

We then do the Forward Sampling and CLGLearner.

In [33]:

n = 20  # n is the selected values of MC number n in n-MCERA
K = 10000  # K is the list of selected values of number of samples
Delta = 0.05  # Delta is the FWER we want to control

# Sample generation
fs = gclg.ForwardSampling(clg)
fs.makeSample(K).tocsv("./out/clg.csv")

# Learning
RAveL_l = gclg.CLGLearner("./out/clg.csv", n_sample=n, fwer_delta=Delta)

We use the PC algorithme to learn the structure of the model.

In [34]:

# Use the PC algorithm to get the skeleton
C = RAveL_l.PC_algorithm(order=clg.nodes(), verbose=False)
print("The final skeleton is:\n", C)

The final skeleton is:
 {0: {4}, 1: set(), 2: {0}, 3: {1, 4}, 4: set()}

In [35]:

# Create a Mixedgraph to display the skeleton
RAveL_MixGraph = gum.MixedGraph()

# Add variables
for i in range(len(clg.names())):
  RAveL_MixGraph.addNodeWithId(i)
  RAveL_MixGraph.setName(i, clg.variable(i).name())

# Add arcs and edges
for father, kids in C.items():
  for kid in kids:
    if father in C[kid]:
      RAveL_MixGraph.addEdge(father, kid)
    else:
      RAveL_MixGraph.addArc(father, kid)

RAveL_MixGraph

Out[35]:

In [36]:

# Create a BN with the same structure as the CLG
bn = clg.asDiscreteBN()

# Compare the result above with the EssentialGraph
Real_EssentialGraph = gum.EssentialGraph(bn)

Real_EssentialGraph

Out[36]:

In [37]:

# create a CLG from the skeleton of PC algorithm
clg_PC = gclg.CLG()
for node in clg.nodes():
  clg_PC.add(clg.variable(node))
for father, kids in C.items():
  for kid in kids:
    clg_PC.addArc(father, kid)

# Compare the structure of the created CLG and the original CLG
print(f"F-score : {clg.structuralFScore(clg_PC)}")

F-score : 0.5

We can also do the parameter learning.

In [38]:

id2mu, id2sigma, arc2coef = RAveL_l.estimate_parameters(C)

for node in clg.nodes():
  print(f"Real Value: node {node} : mu = {clg.variable(node)._mu}, sigma = {clg.variable(node)._sigma}")
  print(f"Estimation: node {node} : mu = {id2mu[node]}, sigma = {id2sigma[node]}")


for arc in clg.arcs():
  print(f"Real Value: arc {arc} : coef = {clg.coefArc(*arc)}")
  print(f"Estimation: arc {arc} : coef = {(arc2coef[arc] if arc in arc2coef else '-')}")

Real Value: node 0 : mu = -0.41692306161996484, sigma = 8.012333922404032
Estimation: node 0 : mu = -92.58990873949037, sigma = 357.69829875049476
Real Value: node 1 : mu = 2.9212765644834704, sigma = 5.772850840277943
Estimation: node 1 : mu = 2.960059221198133, sigma = 5.68779648184328
Real Value: node 2 : mu = -4.4033954359007215, sigma = 4.477027145048831
Estimation: node 2 : mu = -4.435138289812253, sigma = 4.475039827898229
Real Value: node 3 : mu = 1.0206830157456794, sigma = 5.8275511445019195
Estimation: node 3 : mu = 1.0343711946412903, sigma = 5.877217571356469
Real Value: node 4 : mu = 3.955794416407027, sigma = 9.075785768606641
Estimation: node 4 : mu = -0.038856813158837156, sigma = 1.1638595997289731
Real Value: arc (2, 4) : coef = -8.20412108135363
Estimation: arc (2, 4) : coef = -
Real Value: arc (4, 0) : coef = -6.908527740591809
Estimation: arc (4, 0) : coef = -
Real Value: arc (3, 4) : coef = 9.68740842349083
Estimation: arc (3, 4) : coef = 1.002979240908741
Real Value: arc (3, 1) : coef = -9.685092060974101
Estimation: arc (3, 1) : coef = -9.68501359662286
Real Value: arc (3, 0) : coef = 6.887125456821082
Estimation: arc (3, 0) : coef = -