Learning a CLG
One of the main features of this library is the possibility to learn a CLG.
- More precisely what can be learned is :
The dependency graph of a CLG
The parameters of a CLG: the mu and sigma of each variable, the coefficients of the arcs
Learning the graph
To learn the graph of a CLG (ie the dependence between variables) we use a modified PC algorithm based on the workof Diego Colombo, Marloes H. Maathuis: Order-Independent Constraint-Based Causal Structure Learning(2014).
The independence test used is based on the work of Dario Simionato, Fabio Vandin: Bounding the Family-Wise Error Rate in Local Causal Discovery using Rademacher Averages(2022).
- class pyAgrum.clg.learning.CLGLearner(filename, *, n_sample=15, fwer_delta=0.05)
Using Rademacher Average to guarantee FWER(Family Wise Error Rate) in independency test. (see “Bounding the Family-Wise Error Rate in Local Causal Discover using Rademacher Averages”, Dario Simionato, Fabio Vandin, 2022)
- Parameters:
filename (
str
)n_sample (
int
)fwer_delta (
float
)
- Adjacency_search(order, verbose=False)
This function is the first step of PC-algo: Adjacency Search. Apply indep_test() to the first step of PC-algo for Adjacency Search.
- Parameters:
order (List[NodeId]) – A particular order of the Nodes.
verbose (bool) – Whether to print the process of Adjacency Search.
- Returns:
C (Dict[NodeId, Set[NodeId]]) – The temporary skeleton.
sepset (Dict[Tuple[NodeId, NodeId], Set[NodeId]]) – Sepset(which will be used in Step2&3 of PC-Algo).
- PC_algorithm(order, verbose=False)
This function is an advanced version of PC-algo. We use Indep_test_Rademacher() to replace indep_test() in PC-algo. And we orient the undirected edges in the skeleton C by comparing the variances of the two nodes.
- Parameters:
order (List[NodeId]) – A particular order of the Nodes.
verbose (bool) – Whether to print the process of the PC algorithm.
- Returns:
C – A directed graph DAG representing the causal structure.
- Return type:
Dict[NodeId, Set[NodeId]]
- Pearson_coeff(X, Y, Z)
Estimate Pearson’s linear correlation(using linear regression when Z is not empty).
Parmeters
- XNodeId
id of the first variable tested.
- YNodeId
id of the second variable tested.
- ZSet[NodeId]
The conditioned variable’s id set.
- RAveL_MB(T)
Find the Markov Boundary of variable T with FWER lower than Delta.
- Parameters:
T (NodeId) – The id of the target variable T.
- Returns:
MB – The Markov Boundary of variable T with FWER lower than Delta.
- Return type:
Set[NodeId]
- RAveL_PC(T)
Find the Parent-Children of variable T with FWER lower than Delta.
- Parameters:
T (NodeId) – The id of the target variable T.
- Returns:
The Parent-Children of variable T with FWER lower than Delta.
- Return type:
Set[NodeId]
- Repeat_II(order, C, l, verbose=False)
This function is the second part of the Step1 of PC algorithm.
- Parameters:
order (List[NodeId]) – The order of the variables.
C (Dict[NodeId, Set[NodeId]]) – The temporary skeleton.
l (int) – The size of the sepset
verbose (bool) – Whether to print.
- Returns:
found_edge – True if a new edge is found, False if not.
- Return type:
bool
- Step4(C, verbose=False)
This function is the fourth step of PC-algo. Orient the remaining undirected edge by comparing variances of two nodes.
- Parameters:
C (Dict[NodeId, Set[NodeId]]) – The temporary skeleton.
verbose (bool) – Whether to print the process of Step4.
- Returns:
C (Dict[NodeId, Set[NodeId]]) – The final skeleton (of Step4).
new_oriented (bool) – Whether there is a new edge oriented in the fourth step.
- estimate_parameters(C)
This function is used to estimate the parameters of the CLG model.
- Parameters:
C (Dict[NodeId, Set[NodeId]]) – A directed graph DAG representing the causal structure.
- Returns:
id2mu (Dict[NodeId, float]) – The estimated mean of each node.
id2sigma (Dict[NodeId, float]) – The estimated variance of each node.
arc2coef (Dict[Tuple[NodeId, NodeId], float]) – The estimated coefficients of each arc.
- fitParameters(clg)
In this function, we fit the parameters of the CLG model.
- Parameters:
clg (CLG) – The CLG model to be changed its parameters.
- static generate_XYZ(l)
Find all the possible combinations of X, Y and Z.
- Returns:
All the possible combinations of X, Y and Z.
- Return type:
List[Tuple[Set[NodeId], Set[NodeId]]]
- static generate_subsets(S)
Generator that iterates on all all the subsets of S (from the smallest to the biggest).
- Parameters:
S (Set[NodeId]) – The set of variables.
-
id2samples:
Dict
[NewType()
(NodeId
,int
),List
]
- learnCLG()
First use PC algorithm to learn the skeleton of the CLG model. Then estimate the parameters of the CLG model. Finally create a CLG model and return it.
- Returns:
learned_clg – The learned CLG model.
- Return type:
-
r_XYZ:
Dict
[Tuple
[FrozenSet
[NewType()
(NodeId
,int
)],FrozenSet
[NewType()
(NodeId
,int
)]],List
[float
]]
-
sepset:
Dict
[Tuple
[NewType()
(NodeId
,int
),NewType()
(NodeId
,int
)],Set
[NewType()
(NodeId
,int
)]]
- supremum_deviation(n_sample, fwer_delta)
Use n-MCERA to get supremum deviation.
- Parameters:
n_sample (int) – The MC number n in n-MCERA.
fwer_delta (float ∈ (0,1]) – Threshold.
- Returns:
SD – The supremum deviation.
- Return type:
float
- test_indep(X, Y, Z)
Perform a standard statistical test and use Bonferroni correction to correct for multiple hypothesis testing.
- Parameters:
X (NodeId) – The id of the first variable tested.
Y (NodeId) – The id of the second variable tested.
Z (Set[NodeId]) – The conditioned variable’s id set.
- Returns:
True if X and Y are indep given Z, False if not indep.
- Return type:
bool
- three_rules(C, verbose=False)
This function is the third step of PC-algo. Orient as many of the remaining undirected edges as possible by repeatedly application of the three rules.
- Parameters:
C (Dict[NodeId, Set[NodeId]]) – The temporary skeleton.
verbose (bool) – Whether to print the process of this function.
- Returns:
C – The final skeleton (of Step3).
- Return type:
Dict[NodeId, Set[NodeId]]