Discretizer for graphical Models

class pyAgrum.lib.discretizer.Discretizer(defaultDiscretizationMethod='quantile', defaultNumberOfBins=10, discretizationThreshold=25)

Represents a tool to discretize some variables in a database in order to obtain a way to learn a pyAgrum’s (discrete) Graphical Model.

Warning

  • The data are represented by tabular data (X and possibly y) where the columns are the variables and the rows are the samples. Generally, X can be replaced by a the name of a csv file.

  • In the case of a classification, y is the class variable and X are the features. y has not to be binary.

Parameters:
  • defaultDiscretizationMethod (str) – sets the default method of discretization for this discretizer. Possible values are: quantile, uniform, kmeans, NML, CAIM and MDLP. This method will be used if the user has not specified another method for that specific variable using the setDiscretizationParameters method.

  • defaultNumberOfBins (str or int) – sets the number of bins if the method used is quantile, kmeans, uniform. In this case this parameter can also be set to the string elbowMethod so that the best number of bins is found automatically. If the method used is NML, this parameter sets the the maximum number of bins up to which the NML algorithm searches for the optimal number of bins. In this case this parameter must be an int If any other discretization method is used, this parameter is ignored.

  • discretizationThreshold (int or float) – When using default parameters a variable will be treated as continuous only if it has more unique values than this number (if the number is an int greater than 1). If the number is a float between 0 and 1, we will test if the proportion of unique values is bigger than this number. For example if you have entered 0.95, the variable will be treated as continuous only if more than 95% of its values are unique.

audit(X, y=None)

Audits the passed values of X and y. Guess which columns in X are already discrete and which need to be discretized, as well as the discretization algorithm that will be used to discretize them The parameters which are suggested will be used when creating the variables. To change this the user can manually set discretization parameters for each variable using the setDiscretizationParameters function.

Parameters:
  • X ({array-like, pandas or polars dataframe} of shape (n_samples, n_features) or str (filename)) – training data

  • y ({array-like, pandas or polars dataframe} of shape (n_samples,) or str (classname)) – Target values

Returns:

Dict

for each variable, the proposition of audit

clear(clearDiscretizationParameters=False)

Sets the number of continuous variables and the total number of bins created by this discretizer to 0. If clearDiscretizationParameters is True, also clears the the parameters for discretization the user has set for each variable.

Parameters:

clearDiscretizationParameters (bool) – if True, this method also clears the parameters the user has set for each variable and resets them to the default.

discretizedBN(X, y=None, *, possibleValuesY=None, template=None)
discretizedTemplate(X, y=None, *, possibleValuesY=None, template=None)

return a graphical model discretized using the suggestion of the Discretized for date source X (and for target y). This graphial model only contains the discretized variables. For instance, it can be used as a template for a BNLearner.

Parameters:
  • X ({array-like, sparse matrix, pandas or polars dataframe} of shape (n_samples, n_features)) or str (filename)) – training data

  • y (array-like, pandas or polars dataframe of shape (n_samples,) or str (classname)) – Target values

  • possibleValuesY (ndarray) – An ndarray containing all the unique values of y

  • template (a graphical model such as pyagrum.BayesNet, pyAgrum.MRF, etc...) – the template that will contain the discretized variables. If None, a new Bayesian network is created.

Returns:

pyagrum.BayesNet or other graphical model:

the discretized graphical model (only (discretized) random variables are created in the model)

Example

>>> discretizer=Discretizer(defaultDiscretizationMethod='uniform',defaultParamDiscretizationMethod=7,discretizationThreshold=10)
>>> learner=gum.BNLearner(data,discretizer.discretizedTemplate(data))
setDiscretizationParameters(variableName=None, method=None, paramDiscretizationMethod=None)

Sets the discretization parameters for a variable. If variableName is None, sets the default parameters.

Parameters:
  • variableName (str) – the name of the variable you want to set the discretization parameters of. Set to None to set the new default.

  • method (str) – The method of discretization used for this variable. Use “NoDiscretization” if you do not want to discretize this variable. Possible values are: ‘NoDiscretization’, ‘quantile’, ‘uniform’, ‘kmeans’, ‘NML’, ‘CAIM’, ‘MDLP’ and ‘expert’

  • paramDiscretizationMethod – Each method of discretization has a parameter that can be set. - ‘quantile’: the number of bins - ‘kmeans’, ‘uniform’: the number of bins. The parameter can also be set to the string ‘elbowMethod’ so that the best number of bins is found automatically. - ‘NML’: this parameter sets the the maximum number of bins up to which the NML algorithm searches for the optimal number of bins. - ‘MDLP’, ‘CAIM’: this parameter is ignored - ‘expert’: this parameter is the set of ticks proposed by the expert. The discretized variable will set the flag ‘empirical’ which means that if the values found in the data are not in the proposed intervals, they did not raise any exception but are nevertheless accepted (as belonging to the smallest or biggest interval). - ‘NoDiscretization’: this parameter is a superset of the values for the variable found in the database.