Discretizer for Bayesian networks

class pyAgrum.skbn.BNDiscretizer(defaultDiscretizationMethod='quantile', defaultNumberOfBins=10, discretizationThreshold=25)

Represents a tool to discretize some variables in a database in order to obtain a way to learn a pyAgrum’s (discrete) Bayesian networks.

parameters:
defaultDiscretizationMethod: str

sets the default method of discretization for this discretizer. Possible values are: ‘quantile’, ‘uniform’, ‘kmeans’, ‘NML’, ‘CAIM’ and ‘MDLP’. This method will be used if the user has not specified another method for that specific variable using the setDiscretizationParameters method.

defaultNumberOfBins: str or int

sets the number of bins if the method used is quantile, kmeans, uniform. In this case this parameter can also be set to the string ‘elbowMethod’ so that the best number of bins is found automatically. If the method used is NML, this parameter sets the the maximum number of bins up to which the NML algorithm searches for the optimal number of bins. In this case this parameter must be an int If any other discetization method is used, this parameter is ignored.

discretizationThreshold: int or float

When using default parameters a variable will be treated as continous only if it has more unique values than this number (if the number is an int greater than 1). If the number is a float between 0 and 1, we will test if the proportion of unique values is bigger than this number. For example if you have entered 0.95, the variable will be treated as continous only if more than 95% of its values are unique.

audit(X, y=None)
parameters:
X: {array-like, sparse matrix} of shape (n_samples, n_features)

training data

y: array-like of shape (n_samples,)

Target values

returns:

auditDict: dict()

Audits the passed values of X and y. Tells us which columns in X we think are already discrete and which need to be discretized, as well as the discretization algorithm that will be used to discretize them The parameters which are suggested will be used when creating the variables. To change this the user can manually set discretization parameters for each variable using the setDiscretizationParameters function.

clear(clearDiscretizationParameters=False)
parameters:
clearDiscretizationParamaters: bool

if True, this method also clears the parameters the user has set for each variable and resets them to the default.

returns:

void

Sets the number of continous variables and the total number of bins created by this discretizer to 0. If clearDiscretizationParameters is True, also clears the the parameters for discretization the user has set for each variable.

createVariable(variableName, X, y=None, possibleValuesY=None)
parameters:
variableName:

the name of the created variable

X: ndarray shape(n,1)

A column vector containing n samples of a feature. The column for which the variable will be created

y: ndarray shape(n,1)

A column vector containing the corresponding for each element in X.

possibleValuesX: onedimensional ndarray

An ndarray containing all the unique values of X

possibleValuesY: onedimensional ndarray

An ndarray containing all the unique values of y

returnModifiedX: bool

X could be modified by this function during

returns:
var: pyagrum.DiscreteVariable

the created variable

Creates a variable for the column passed in as a parameter and places it in the Bayesian network

discretizationCAIM(x, y, possibleValuesX, possibleValuesY)
parametres:
x: ndarray with shape (n,1) where n is the number of samples

Column-vector that contains all the data that needs to be discretized

y: ndarray with shape (n,1) where n is the number of samples

Column-vector that contains the class for each sample. This vector will not be discretized, but the class-value of each sample is needed to properly apply the algorithm

possibleValuesX: one dimensional ndarray

Contains all the possible values that x can take sorted in increasing order. There shouldn’t be any doubles inside

possibleValuesY: one dimensional ndarray

Contains the possible values of y. There should be two possible values since this is a binary classifier

returns:

binEdges: a list of the edges of the bins that are chosen by this algorithm

Applies the CAIM algorithm to discretize the values of x

discretizationElbowMethodRotation(discretizationStrategy, X)
parameters:
discretizationStrategy: str

The method of discretization that will be used. Possible values are: ‘quantile’ , ‘kmeans’ and ‘uniform’

X: one dimensional ndarray

Contains the data that should be discretized

returns:

binEdges: the edges of the bins the algorithm has chosen.

Calculates the sum of squared errors as a function of the number of clusters using the discretization strategy that is passed as a parameter. Returns the bins that are optimal for minimizing the variation and the number of bins at the same time. Uses the elbow method to find this optimal point. To find the “elbow” we rotate the curve and look for its minimum.

discretizationMDLP(x, y, possibleValuesX, possibleValuesY)
parametres:
x: ndarray with shape (n,1) where n is the number of samples

Column-vector that contains all the data that needs to be discretized

y: ndarray with shape (n,1) where n is the number of samples

Column-vector that contains the class for each sample. This vector will not be discretized, but the class-value of each sample is needed to properly apply the algorithm

possibleValuesX: one dimensional ndarray

Contains all the possible values that x can take sorted in increasing order. There shouldn’t be any doubles inside

possibleValuesY: one dimensional ndarray

Contains the possible values of y. There should be two possible values since this is a binary classifier

returns:

binEdges: a list of the edges of the bins that are chosen by this algorithm

Uses the MDLP algorithm described in Fayyad, 1995 to discretize the values of x.

discretizationNML(X, possibleValuesX, kMax=10, epsilon=None)
parameters:
X: one dimensional ndarray

array that that contains all the data that needs to be discretized

possibleValuesX: one dimensional ndarray

Contains all the possible values that x can take sorted in increasing order. There shouldn’t be any doubles inside.

kMax: int

the maximum number of bins before the algorithm stops itself.

epsilon: float or None

the value of epsilon used in the algorithm. Should be as small as possible. If None is passed the value is automatically calculated.

returns:

binEdges: a list of the edges of the bins that are chosen by this algorithm

Uses the disceretization algorithm described in “MDL Histogram Density Estimator”, Kontkaken and Myllymaki, 2007 to discretize.

setDiscretizationParameters(variableName=None, methode=None, numberOfBins=None)
parameters:
variableName: str

the name of the variable you want to set the discretization paramaters of. Set to None to set the new default for this BNClassifier.

methode: str

The method of discretization used for this variable. Type “NoDiscretization” if you do not want to discretize this variable. Possible values are: ‘NoDiscretization’, ‘quantile’, ‘uniform’, ‘kmeans’, ‘NML’, ‘CAIM’ and ‘MDLP’

numberOfBins:

sets the number of bins if the method used is quantile, kmeans, uniform. In this case this parameter can also be set to the string ‘elbowMethod’ so that the best number of bins is found automatically. if the method used is NML, this parameter sets the the maximum number of bins up to which the NML algorithm searches for the optimal number of bins. In this case this parameter must be an int If any other discetization method is used, this parameter is ignored.

returns:

void