Discretizer for Bayesian networks¶
- class pyAgrum.skbn.BNDiscretizer(defaultDiscretizationMethod='quantile', defaultNumberOfBins=10, discretizationThreshold=25)¶
Represents a tool to discretize some variables in a database in order to obtain a way to learn a pyAgrum’s (discrete) Bayesian networks.
- parameters:
- defaultDiscretizationMethod: str
sets the default method of discretization for this discretizer. Possible values are: ‘quantile’, ‘uniform’, ‘kmeans’, ‘NML’, ‘CAIM’ and ‘MDLP’. This method will be used if the user has not specified another method for that specific variable using the setDiscretizationParameters method.
- defaultNumberOfBins: str or int
sets the number of bins if the method used is quantile, kmeans, uniform. In this case this parameter can also be set to the string ‘elbowMethod’ so that the best number of bins is found automatically. If the method used is NML, this parameter sets the the maximum number of bins up to which the NML algorithm searches for the optimal number of bins. In this case this parameter must be an int If any other discetization method is used, this parameter is ignored.
- discretizationThreshold: int or float
When using default parameters a variable will be treated as continous only if it has more unique values than this number (if the number is an int greater than 1). If the number is a float between 0 and 1, we will test if the proportion of unique values is bigger than this number. For example if you have entered 0.95, the variable will be treated as continous only if more than 95% of its values are unique.
- audit(X, y=None)¶
- parameters:
- X: {array-like, sparse matrix} of shape (n_samples, n_features)
training data
- y: array-like of shape (n_samples,)
Target values
- returns:
auditDict: dict()
Audits the passed values of X and y. Tells us which columns in X we think are already discrete and which need to be discretized, as well as the discretization algorithm that will be used to discretize them The parameters which are suggested will be used when creating the variables. To change this the user can manually set discretization parameters for each variable using the setDiscretizationParameters function.
- clear(clearDiscretizationParameters=False)¶
- parameters:
- clearDiscretizationParamaters: bool
if True, this method also clears the parameters the user has set for each variable and resets them to the default.
- returns:
void
Sets the number of continous variables and the total number of bins created by this discretizer to 0. If clearDiscretizationParameters is True, also clears the the parameters for discretization the user has set for each variable.
- createVariable(variableName, X, y=None, possibleValuesY=None)¶
- parameters:
- variableName:
the name of the created variable
- X: ndarray shape(n,1)
A column vector containing n samples of a feature. The column for which the variable will be created
- y: ndarray shape(n,1)
A column vector containing the corresponding for each element in X.
- possibleValuesX: onedimensional ndarray
An ndarray containing all the unique values of X
- possibleValuesY: onedimensional ndarray
An ndarray containing all the unique values of y
- returnModifiedX: bool
X could be modified by this function during
- returns:
- var: pyagrum.DiscreteVariable
the created variable
Creates a variable for the column passed in as a parameter and places it in the Bayesian network
- discretizationCAIM(x, y, possibleValuesX, possibleValuesY)¶
- parametres:
- x: ndarray with shape (n,1) where n is the number of samples
Column-vector that contains all the data that needs to be discretized
- y: ndarray with shape (n,1) where n is the number of samples
Column-vector that contains the class for each sample. This vector will not be discretized, but the class-value of each sample is needed to properly apply the algorithm
- possibleValuesX: one dimensional ndarray
Contains all the possible values that x can take sorted in increasing order. There shouldn’t be any doubles inside
- possibleValuesY: one dimensional ndarray
Contains the possible values of y. There should be two possible values since this is a binary classifier
- returns:
binEdges: a list of the edges of the bins that are chosen by this algorithm
Applies the CAIM algorithm to discretize the values of x
- discretizationElbowMethodRotation(discretizationStrategy, X)¶
- parameters:
- discretizationStrategy: str
The method of discretization that will be used. Possible values are: ‘quantile’ , ‘kmeans’ and ‘uniform’
- X: one dimensional ndarray
Contains the data that should be discretized
- returns:
binEdges: the edges of the bins the algorithm has chosen.
Calculates the sum of squared errors as a function of the number of clusters using the discretization strategy that is passed as a parameter. Returns the bins that are optimal for minimizing the variation and the number of bins at the same time. Uses the elbow method to find this optimal point. To find the “elbow” we rotate the curve and look for its minimum.
- discretizationMDLP(x, y, possibleValuesX, possibleValuesY)¶
- parametres:
- x: ndarray with shape (n,1) where n is the number of samples
Column-vector that contains all the data that needs to be discretized
- y: ndarray with shape (n,1) where n is the number of samples
Column-vector that contains the class for each sample. This vector will not be discretized, but the class-value of each sample is needed to properly apply the algorithm
- possibleValuesX: one dimensional ndarray
Contains all the possible values that x can take sorted in increasing order. There shouldn’t be any doubles inside
- possibleValuesY: one dimensional ndarray
Contains the possible values of y. There should be two possible values since this is a binary classifier
- returns:
binEdges: a list of the edges of the bins that are chosen by this algorithm
Uses the MDLP algorithm described in Fayyad, 1995 to discretize the values of x.
- discretizationNML(X, possibleValuesX, kMax=10, epsilon=None)¶
- parameters:
- X: one dimensional ndarray
array that that contains all the data that needs to be discretized
- possibleValuesX: one dimensional ndarray
Contains all the possible values that x can take sorted in increasing order. There shouldn’t be any doubles inside.
- kMax: int
the maximum number of bins before the algorithm stops itself.
- epsilon: float or None
the value of epsilon used in the algorithm. Should be as small as possible. If None is passed the value is automatically calculated.
- returns:
binEdges: a list of the edges of the bins that are chosen by this algorithm
Uses the disceretization algorithm described in “MDL Histogram Density Estimator”, Kontkaken and Myllymaki, 2007 to discretize.
- setDiscretizationParameters(variableName=None, method=None, numberOfBins=None)¶
- parameters:
- variableName: str
the name of the variable you want to set the discretization paramaters of. Set to None to set the new default for this BNClassifier.
- method: str
The method of discretization used for this variable. Type “NoDiscretization” if you do not want to discretize this variable. Possible values are: ‘NoDiscretization’, ‘quantile’, ‘uniform’, ‘kmeans’, ‘NML’, ‘CAIM’ and ‘MDLP’
- numberOfBins:
sets the number of bins if the method used is quantile, kmeans, uniform. In this case this parameter can also be set to the string ‘elbowMethod’ so that the best number of bins is found automatically. if the method used is NML, this parameter sets the the maximum number of bins up to which the NML algorithm searches for the optimal number of bins. In this case this parameter must be an int If any other discetization method is used, this parameter is ignored.
- returns:
void