Data Transformation

Discretization

The discretization process assigns a discrete value to each interval of continuous attribute a to create a new discrete attribute a’. A discretization algorithm determines the interval boundaries that are likely to preserve as much useful information provided by the original attribute as possible. Data set discretization should preserve the relationship between the class and the discretized attributes if the data set is to be used for creation of a classification model.

class nzpyida.analytics.transform.discretization.Discretization(idadb: IdaDataBase)[source]

Bases: object

Generic class for handling data discretization.

Methods

`apply`(in_df, in_bin_df[, keep_org_values, ...])	Apply discretization limits to the given data frame.
`fit`(in_df[, out_table])	Create bins limits based on the given data frame.

apply(in_df: IdaDataFrame, in_bin_df: IdaDataFrame, keep_org_values: bool = False, out_table: str = None) → IdaDataFrame[source]

Apply discretization limits to the given data frame.

Parameters:

in_dfIdaDataFrame: the input data frame
in_bin_dfIdaDataFrame: the data frame with discretization bins
keep_org_valuesbool, optional: a flag indicating whether the discretized columns should replace the original columns (False) or should be added with another name (True). The name of the columns is then prefixed with ‘disc_’
out_tablestr, optional: the output table or view to store the discretized data into

Returns:

IdaDataFrame: the data frame with discerized input data frame

fit(in_df: IdaDataFrame, out_table: str = None) → IdaDataFrame[source]

Create bins limits based on the given data frame.

Parameters:

in_dfIdaDataFrame: the input data frame
out_tablestr, optional: the output table with dicretization bins

Returns:

IdaDataFrame: the data frame with discretization bins

class nzpyida.analytics.transform.discretization.EFDisc(idadb: IdaDataBase, bins: int = 10, bin_precision: float = 0.1)[source]

Bases: Discretization

Discretization with equal frequency of data.

Methods

`apply`(in_df, in_bin_df[, keep_org_values, ...])	Apply discretization limits to the given data frame.
`fit`(in_df[, out_table])	Create bins limits based on the given data frame.

class nzpyida.analytics.transform.discretization.EMDisc(idadb: IdaDataBase, target: str)[source]

Bases: Discretization

Discretization based on minimizing entropy of the data in the target column.

Methods

`apply`(in_df, in_bin_df[, keep_org_values, ...])	Apply discretization limits to the given data frame.
`fit`(in_df[, out_table])	Create bins limits based on the given data frame.

class nzpyida.analytics.transform.discretization.EWDisc(idadb: IdaDataBase, bins: int = 10)[source]

Bases: Discretization

Discretization with equal width and the given number of bins.

Methods

`apply`(in_df, in_bin_df[, keep_org_values, ...])	Apply discretization limits to the given data frame.
`fit`(in_df[, out_table])	Create bins limits based on the given data frame.

nzpyida.analytics.transform.discretization.ef_disc(in_df: IdaDataFrame, bins: int = 10, bin_precision: float = 0.1, keep_org_values: bool = False, out_table: str = None)[source]

Discretizes the given data frame with equal frequency of data. This is a helper function that creates EFDisc class and then calls its fit() and apply() functions, returning the output from the latter.

Parameters:

in_dfIdaDataFrame: the input data frame
binsint, optional: the default number of discretization bins to be calculated, by default 10
bin_precisionfloat, optional: the precision allowed for considering an even distribution of data records in the calculated discretization bins. The number of data records in each bin must be within [iw-<binprec>*iw,iw+<binprec>*iw] where iw is the size of the input table divided by the number of requested discretization bin limits.
keep_org_valuesbool, optional: a flag indicating whether the discretized columns should replace the original columns (False) or should be added with another name (True). The name of the columns is then prefixed with ‘disc_’
out_tablestr, optional: the output table or view to store the discretized data into

Returns:

IdaDataFrame: the data frame with discerized input data frame

nzpyida.analytics.transform.discretization.em_disc(in_df: IdaDataFrame, target: str, keep_org_values: bool = False, out_table: str = None)[source]

Discretizes the given data frame based on minimizing entropy of the data in the target column. This is a helper function that creates EMDisc class and then calls its fit() and apply() functions, returning the output from the latter.

Parameters:

in_dfIdaDataFrame: the input data frame
targetstr: the input table column containing a class label
keep_org_valuesbool, optional: a flag indicating whether the discretized columns should replace the original columns (False) or should be added with another name (True). The name of the columns is then prefixed with ‘disc_’
out_tablestr, optional: the output table or view to store the discretized data into

Returns:

IdaDataFrame: the data frame with discerized input data frame

nzpyida.analytics.transform.discretization.ew_disc(in_df: IdaDataFrame, bins: int = 10, keep_org_values: bool = False, out_table: str = None) → IdaDataFrame[source]

Discretizes the given data frame with equal width and the given number of bins. This is a helper function that creates EWDisc class and then calls its fit() and apply() functions, returning the output from the latter.

Parameters:

in_dfIdaDataFrame: the input data frame
binsint, optional: the default number of discretization bins to be calculated, by default 10
keep_org_valuesbool, optional: a flag indicating whether the discretized columns should replace the original columns (False) or should be added with another name (True). The name of the columns is then prefixed with ‘disc_’
out_tablestr, optional: the output table or view to store the discretized data into

Returns:

IdaDataFrame: the data frame with discerized input data frame

Data Preparation

This module contains function that can be used to prepare an input data frame for machine learning.

nzpyida.analytics.transform.preparation.impute_data(in_df: IdaDataFrame, in_column: str = None, method: str = None, numeric_value: float = -1, nominal_value: str = 'missing', out_table: str = None) → IdaDataFrame[source]

Many analytic algorithms require that the data set has no missing attribute values. However, real-world data sets frequently suffer from missing attribute values. Missing value imputation provides usable attribute values in place of the missing values, allowing the algorithms to run.

This function replaces missing values in the input data frame and returns the result in a new data frame.

Parameters:

in_dfIdaDataFrame: the input data frame
in_columnstr, optional: the input table column where missing values have to be replaced. If not specified, all input data columns are considered.
methodstr, optional: the data imputation method. Allowed values are: mean, median, freq (most frequent value), replace. If not specified, the method is median for the numeric columns and freq for the nominal columns. The methods mean and median cannot be used with nominal columns.
numeric_valuefloat, optional: the numeric replacement value when method=replace
nominal_valuestr, optional: the nominal replacement value when method=replace
out_tablestr, optional: the output table with the modified data

Returns:

IdaDataFrame: the data frame with requested transformations

nzpyida.analytics.transform.preparation.random_sample(in_df: IdaDataFrame, size: int = None, fraction: float = None, by_column: str = None, out_signature: str = None, rand_seed: int = None, out_table: str = None) → IdaDataFrame[source]

Random sampling procedures are a vital component of many analytical systems. They can be used to select a test sample and a training sample for a model building process (machine learning). They can also be used to get a smaller sample of the training set, which you may do because of learning algorithm complexity considerations. In both cases, you would sample without replacement.

Another application of sampling is the learning methods based on bootstrapping. This requires many independent samples from the same data, which are preferentially applied if the available data sets are small or for other reasons where the sample independence is vital. Samples with replacement are usually drawn in this case.

In application, sampling is used for promotion campaigns, for example when you want only a representative set of customers to be subjects of an action. In all cases, whether for use with scientific methods or business practices, uniform sampling is important.

This function creates a random sample of a data frame a fixed size or a fixed probability and returns the result in a new data frame.

Parameters:

in_dfIdaDataFrame: the input data frame
sizeint, optional: the number of rows in the sample (alias of size). If specified, the parameter <fraction> must not be specified. Only one of both parameters <num> and <size> must be specified.
fractionfloat, optional: the probability of each row to be in the sample. If specified, the parameters <num> and <size> must not be specified. Otherwise, one of both parameters <num> or <size> must be specified.
by_columnstr, optional: the column used to stratify the input table. If indicated, stratified sampling is done: it ensures that each value of the column is represented in the sample in about the same percentage as in the original input table.
out_signaturestr, optional: the input table columns to keep in the sample, separated by a semi-colon (;). If not specified, all columns are kept in the output table.
rand_seedint, optional: the seed of the random function
out_tablestr, optional: the output table with the modified data

Returns:

IdaDataFrame: the data frame with requested transformations

nzpyida.analytics.transform.preparation.std_norm(in_df: IdaDataFrame, in_column: List[str], id_column: str = None, by_column: str = None, out_table: str = None) → IdaDataFrame[source]

Standardization and normalization transformations use the original continuous attribute a to generate a new continuous attribute a ‘ that has a different range or distribution than the original attribute. Common transformations modify the range to fit the [-1,1 ] interval (normalization) or modify the distribution to have a mean of 0 and a standard deviation of 1 (standardization).

This function normalize and stardardize columns of the input data frame and returns that in a new data frame.

Parameters:

in_dfIdaDataFrame: the input data frame
in_columnList[str]: the list of input table columns to consider. Each column name may be followed by :L to leave it unchanged, by :S to standardize its values, by :N to normalize its values or by :U to make it of unit length. Additionally, two columns may be indicated, separated by a slash (/), followed by :C to make the columns be a row unit vector or by :V to divide the column values by the length of the longest row vector.
id_columnstr, optional: the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id
by_columnstr, optional: the input table column which splits the data into groups for which the operation is to be performed
out_tablestr, optional: the output table with the modified data

Returns:

IdaDataFrame: the data frame with requested transformations

nzpyida.analytics.transform.preparation.train_test_split(in_df: IdaDataFrame, out_table_train: str = None, out_table_test: str = None, id_column: str = None, fraction: float = 0.5, rand_seed: float = None) → Tuple[IdaDataFrame, IdaDataFrame][source]

Parameters:

in_dfIdaDataFrame: the input data frame
out_table_trainstr, optional: the name of output dataframe that will contain the given fraction of the input records
out_table_teststr, optional: the name of output dataframe that will contain the rest (1-<fraction>) of the input records
id_column: str, optional: the input dataframe column identifying a unique instance id
fractionfloat, optional: the fraction of the data to that goes to the training dataframe
rand_seedint, optional: the seed of the random function

Returns:

IdaDataFrame: the data frame with train data
IdaDataFrame: the data frame with test data