Learning a CLG

One of the main features of this library is the possibility to learn a CLG.

More precisely what can be learned is :
  • The dependency graph of a CLG

  • The parameters of a CLG: the mu and sigma of each variable, the coefficients of the arcs

Learning the graph

To learn the graph of a CLG (ie the dependence between variables) we use a modified PC algorithm based on the workof Diego Colombo, Marloes H. Maathuis: Order-Independent Constraint-Based Causal Structure Learning(2014).

The independence test used is based on the work of Dario Simionato, Fabio Vandin: Bounding the Family-Wise Error Rate in Local Causal Discovery using Rademacher Averages(2022).

class pyAgrum.clg.learning.CLGLearner(filename, *, n_sample=15, fwer_delta=0.05)

Using Rademacher Average to guarantee FWER(Family Wise Error Rate) in independency test. (see “Bounding the Family-Wise Error Rate in Local Causal Discover using Rademacher Averages”, Dario Simionato, Fabio Vandin, 2022)

Parameters:
  • filename (str)

  • n_sample (int)

  • fwer_delta (float)

This function is the first step of PC-algo: Adjacency Search. Apply indep_test() to the first step of PC-algo for Adjacency Search.

Parameters:
  • order (List[NodeId]) – A particular order of the Nodes.

  • verbose (bool) – Whether to print the process of Adjacency Search.

Returns:

  • C (Dict[NodeId, Set[NodeId]]) – The temporary skeleton.

  • sepset (Dict[Tuple[NodeId, NodeId], Set[NodeId]]) – Sepset(which will be used in Step2&3 of PC-Algo).

PC_algorithm(order, verbose=False)

This function is an advanced version of PC-algo. We use Indep_test_Rademacher() to replace indep_test() in PC-algo. And we orient the undirected edges in the skeleton C by comparing the variances of the two nodes.

Parameters:
  • order (List[NodeId]) – A particular order of the Nodes.

  • verbose (bool) – Whether to print the process of the PC algorithm.

Returns:

C – A directed graph DAG representing the causal structure.

Return type:

Dict[NodeId, Set[NodeId]]

Pearson_coeff(X, Y, Z)

Estimate Pearson’s linear correlation(using linear regression when Z is not empty).

Parmeters

XNodeId

id of the first variable tested.

YNodeId

id of the second variable tested.

ZSet[NodeId]

The conditioned variable’s id set.

RAveL_MB(T)

Find the Markov Boundary of variable T with FWER lower than Delta.

Parameters:

T (NodeId) – The id of the target variable T.

Returns:

MB – The Markov Boundary of variable T with FWER lower than Delta.

Return type:

Set[NodeId]

RAveL_PC(T)

Find the Parent-Children of variable T with FWER lower than Delta.

Parameters:

T (NodeId) – The id of the target variable T.

Returns:

The Parent-Children of variable T with FWER lower than Delta.

Return type:

Set[NodeId]

Repeat_II(order, C, l, verbose=False)

This function is the second part of the Step1 of PC algorithm.

Parameters:
  • order (List[NodeId]) – The order of the variables.

  • C (Dict[NodeId, Set[NodeId]]) – The temporary skeleton.

  • l (int) – The size of the sepset

  • verbose (bool) – Whether to print.

Returns:

found_edge – True if a new edge is found, False if not.

Return type:

bool

Step4(C, verbose=False)

This function is the fourth step of PC-algo. Orient the remaining undirected edge by comparing variances of two nodes.

Parameters:
  • C (Dict[NodeId, Set[NodeId]]) – The temporary skeleton.

  • verbose (bool) – Whether to print the process of Step4.

Returns:

  • C (Dict[NodeId, Set[NodeId]]) – The final skeleton (of Step4).

  • new_oriented (bool) – Whether there is a new edge oriented in the fourth step.

estimate_parameters(C)

This function is used to estimate the parameters of the CLG model.

Parameters:

C (Dict[NodeId, Set[NodeId]]) – A directed graph DAG representing the causal structure.

Returns:

  • id2mu (Dict[NodeId, float]) – The estimated mean of each node.

  • id2sigma (Dict[NodeId, float]) – The estimated variance of each node.

  • arc2coef (Dict[Tuple[NodeId, NodeId], float]) – The estimated coefficients of each arc.

fitParameters(clg)

In this function, we fit the parameters of the CLG model.

Parameters:

clg (CLG) – The CLG model to be changed its parameters.

static generate_XYZ(l)

Find all the possible combinations of X, Y and Z.

Returns:

All the possible combinations of X, Y and Z.

Return type:

List[Tuple[Set[NodeId], Set[NodeId]]]

static generate_subsets(S)

Generator that iterates on all all the subsets of S (from the smallest to the biggest).

Parameters:

S (Set[NodeId]) – The set of variables.

id2samples: Dict[NewType()(NodeId, int), List]
learnCLG()

First use PC algorithm to learn the skeleton of the CLG model. Then estimate the parameters of the CLG model. Finally create a CLG model and return it.

Returns:

learned_clg – The learned CLG model.

Return type:

CLG

r_XYZ: Dict[Tuple[FrozenSet[NewType()(NodeId, int)], FrozenSet[NewType()(NodeId, int)]], List[float]]
sepset: Dict[Tuple[NewType()(NodeId, int), NewType()(NodeId, int)], Set[NewType()(NodeId, int)]]
supremum_deviation(n_sample, fwer_delta)

Use n-MCERA to get supremum deviation.

Parameters:
  • n_sample (int) – The MC number n in n-MCERA.

  • fwer_delta (float ∈ (0,1]) – Threshold.

Returns:

SD – The supremum deviation.

Return type:

float

test_indep(X, Y, Z)

Perform a standard statistical test and use Bonferroni correction to correct for multiple hypothesis testing.

Parameters:
  • X (NodeId) – The id of the first variable tested.

  • Y (NodeId) – The id of the second variable tested.

  • Z (Set[NodeId]) – The conditioned variable’s id set.

Returns:

True if X and Y are indep given Z, False if not indep.

Return type:

bool

three_rules(C, verbose=False)

This function is the third step of PC-algo. Orient as many of the remaining undirected edges as possible by repeatedly application of the three rules.

Parameters:
  • C (Dict[NodeId, Set[NodeId]]) – The temporary skeleton.

  • verbose (bool) – Whether to print the process of this function.

Returns:

C – The final skeleton (of Step3).

Return type:

Dict[NodeId, Set[NodeId]]