Data Transformation
Discretization
The discretization process assigns a discrete value to each interval of continuous attribute a to create a new discrete attribute a’. A discretization algorithm determines the interval boundaries that are likely to preserve as much useful information provided by the original attribute as possible. Data set discretization should preserve the relationship between the class and the discretized attributes if the data set is to be used for creation of a classification model.
- class nzpyida.analytics.transform.discretization.Discretization(idadb: IdaDataBase)[source]
Bases:
objectGeneric class for handling data discretization.
Methods
apply(in_df, in_bin_df[, keep_org_values, ...])Apply discretization limits to the given data frame.
fit(in_df[, out_table])Create bins limits based on the given data frame.
- apply(in_df: IdaDataFrame, in_bin_df: IdaDataFrame, keep_org_values: bool = False, out_table: str = None) IdaDataFrame[source]
Apply discretization limits to the given data frame.
- Parameters:
- in_dfIdaDataFrame
the input data frame
- in_bin_dfIdaDataFrame
the data frame with discretization bins
- keep_org_valuesbool, optional
a flag indicating whether the discretized columns should replace the original columns (False) or should be added with another name (True). The name of the columns is then prefixed with ‘disc_’
- out_tablestr, optional
the output table or view to store the discretized data into
- Returns:
- IdaDataFrame
the data frame with discerized input data frame
- fit(in_df: IdaDataFrame, out_table: str = None) IdaDataFrame[source]
Create bins limits based on the given data frame.
- Parameters:
- in_dfIdaDataFrame
the input data frame
- out_tablestr, optional
the output table with dicretization bins
- Returns:
- IdaDataFrame
the data frame with discretization bins
- class nzpyida.analytics.transform.discretization.EFDisc(idadb: IdaDataBase, bins: int = 10, bin_precision: float = 0.1)[source]
Bases:
DiscretizationDiscretization with equal frequency of data.
Methods
apply(in_df, in_bin_df[, keep_org_values, ...])Apply discretization limits to the given data frame.
fit(in_df[, out_table])Create bins limits based on the given data frame.
- class nzpyida.analytics.transform.discretization.EMDisc(idadb: IdaDataBase, target: str)[source]
Bases:
DiscretizationDiscretization based on minimizing entropy of the data in the target column.
Methods
apply(in_df, in_bin_df[, keep_org_values, ...])Apply discretization limits to the given data frame.
fit(in_df[, out_table])Create bins limits based on the given data frame.
- class nzpyida.analytics.transform.discretization.EWDisc(idadb: IdaDataBase, bins: int = 10)[source]
Bases:
DiscretizationDiscretization with equal width and the given number of bins.
Methods
apply(in_df, in_bin_df[, keep_org_values, ...])Apply discretization limits to the given data frame.
fit(in_df[, out_table])Create bins limits based on the given data frame.
- nzpyida.analytics.transform.discretization.ef_disc(in_df: IdaDataFrame, bins: int = 10, bin_precision: float = 0.1, keep_org_values: bool = False, out_table: str = None)[source]
Discretizes the given data frame with equal frequency of data. This is a helper function that creates EFDisc class and then calls its fit() and apply() functions, returning the output from the latter.
- Parameters:
- in_dfIdaDataFrame
the input data frame
- binsint, optional
the default number of discretization bins to be calculated, by default 10
- bin_precisionfloat, optional
the precision allowed for considering an even distribution of data records in the calculated discretization bins. The number of data records in each bin must be within [iw-<binprec>*iw,iw+<binprec>*iw] where iw is the size of the input table divided by the number of requested discretization bin limits.
- keep_org_valuesbool, optional
a flag indicating whether the discretized columns should replace the original columns (False) or should be added with another name (True). The name of the columns is then prefixed with ‘disc_’
- out_tablestr, optional
the output table or view to store the discretized data into
- Returns:
- IdaDataFrame
the data frame with discerized input data frame
- nzpyida.analytics.transform.discretization.em_disc(in_df: IdaDataFrame, target: str, keep_org_values: bool = False, out_table: str = None)[source]
Discretizes the given data frame based on minimizing entropy of the data in the target column. This is a helper function that creates EMDisc class and then calls its fit() and apply() functions, returning the output from the latter.
- Parameters:
- in_dfIdaDataFrame
the input data frame
- targetstr
the input table column containing a class label
- keep_org_valuesbool, optional
a flag indicating whether the discretized columns should replace the original columns (False) or should be added with another name (True). The name of the columns is then prefixed with ‘disc_’
- out_tablestr, optional
the output table or view to store the discretized data into
- Returns:
- IdaDataFrame
the data frame with discerized input data frame
- nzpyida.analytics.transform.discretization.ew_disc(in_df: IdaDataFrame, bins: int = 10, keep_org_values: bool = False, out_table: str = None) IdaDataFrame[source]
Discretizes the given data frame with equal width and the given number of bins. This is a helper function that creates EWDisc class and then calls its fit() and apply() functions, returning the output from the latter.
- Parameters:
- in_dfIdaDataFrame
the input data frame
- binsint, optional
the default number of discretization bins to be calculated, by default 10
- keep_org_valuesbool, optional
a flag indicating whether the discretized columns should replace the original columns (False) or should be added with another name (True). The name of the columns is then prefixed with ‘disc_’
- out_tablestr, optional
the output table or view to store the discretized data into
- Returns:
- IdaDataFrame
the data frame with discerized input data frame
Data Preparation
This module contains function that can be used to prepare an input data frame for machine learning.
- nzpyida.analytics.transform.preparation.impute_data(in_df: IdaDataFrame, in_column: str = None, method: str = None, numeric_value: float = -1, nominal_value: str = 'missing', out_table: str = None) IdaDataFrame[source]
Many analytic algorithms require that the data set has no missing attribute values. However, real-world data sets frequently suffer from missing attribute values. Missing value imputation provides usable attribute values in place of the missing values, allowing the algorithms to run.
This function replaces missing values in the input data frame and returns the result in a new data frame.
- Parameters:
- in_dfIdaDataFrame
the input data frame
- in_columnstr, optional
the input table column where missing values have to be replaced. If not specified, all input data columns are considered.
- methodstr, optional
the data imputation method. Allowed values are: mean, median, freq (most frequent value), replace. If not specified, the method is median for the numeric columns and freq for the nominal columns. The methods mean and median cannot be used with nominal columns.
- numeric_valuefloat, optional
the numeric replacement value when method=replace
- nominal_valuestr, optional
the nominal replacement value when method=replace
- out_tablestr, optional
the output table with the modified data
- Returns:
- IdaDataFrame
the data frame with requested transformations
- nzpyida.analytics.transform.preparation.random_sample(in_df: IdaDataFrame, size: int = None, fraction: float = None, by_column: str = None, out_signature: str = None, rand_seed: int = None, out_table: str = None) IdaDataFrame[source]
Random sampling procedures are a vital component of many analytical systems. They can be used to select a test sample and a training sample for a model building process (machine learning). They can also be used to get a smaller sample of the training set, which you may do because of learning algorithm complexity considerations. In both cases, you would sample without replacement.
Another application of sampling is the learning methods based on bootstrapping. This requires many independent samples from the same data, which are preferentially applied if the available data sets are small or for other reasons where the sample independence is vital. Samples with replacement are usually drawn in this case.
In application, sampling is used for promotion campaigns, for example when you want only a representative set of customers to be subjects of an action. In all cases, whether for use with scientific methods or business practices, uniform sampling is important.
This function creates a random sample of a data frame a fixed size or a fixed probability and returns the result in a new data frame.
- Parameters:
- in_dfIdaDataFrame
the input data frame
- sizeint, optional
the number of rows in the sample (alias of size). If specified, the parameter <fraction> must not be specified. Only one of both parameters <num> and <size> must be specified.
- fractionfloat, optional
the probability of each row to be in the sample. If specified, the parameters <num> and <size> must not be specified. Otherwise, one of both parameters <num> or <size> must be specified.
- by_columnstr, optional
the column used to stratify the input table. If indicated, stratified sampling is done: it ensures that each value of the column is represented in the sample in about the same percentage as in the original input table.
- out_signaturestr, optional
the input table columns to keep in the sample, separated by a semi-colon (;). If not specified, all columns are kept in the output table.
- rand_seedint, optional
the seed of the random function
- out_tablestr, optional
the output table with the modified data
- Returns:
- IdaDataFrame
the data frame with requested transformations
- nzpyida.analytics.transform.preparation.std_norm(in_df: IdaDataFrame, in_column: List[str], id_column: str = None, by_column: str = None, out_table: str = None) IdaDataFrame[source]
Standardization and normalization transformations use the original continuous attribute a to generate a new continuous attribute a ‘ that has a different range or distribution than the original attribute. Common transformations modify the range to fit the [-1,1 ] interval (normalization) or modify the distribution to have a mean of 0 and a standard deviation of 1 (standardization).
This function normalize and stardardize columns of the input data frame and returns that in a new data frame.
- Parameters:
- in_dfIdaDataFrame
the input data frame
- in_columnList[str]
the list of input table columns to consider. Each column name may be followed by :L to leave it unchanged, by :S to standardize its values, by :N to normalize its values or by :U to make it of unit length. Additionally, two columns may be indicated, separated by a slash (/), followed by :C to make the columns be a row unit vector or by :V to divide the column values by the length of the longest row vector.
- id_columnstr, optional
the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id
- by_columnstr, optional
the input table column which splits the data into groups for which the operation is to be performed
- out_tablestr, optional
the output table with the modified data
- Returns:
- IdaDataFrame
the data frame with requested transformations
- nzpyida.analytics.transform.preparation.train_test_split(in_df: IdaDataFrame, out_table_train: str = None, out_table_test: str = None, id_column: str = None, fraction: float = 0.5, rand_seed: float = None) Tuple[IdaDataFrame, IdaDataFrame][source]
- Parameters:
- in_dfIdaDataFrame
the input data frame
- out_table_trainstr, optional
the name of output dataframe that will contain the given fraction of the input records
- out_table_teststr, optional
the name of output dataframe that will contain the rest (1-<fraction>) of the input records
- id_column: str, optional
the input dataframe column identifying a unique instance id
- fractionfloat, optional
the fraction of the data to that goes to the training dataframe
- rand_seedint, optional
the seed of the random function
- Returns:
- IdaDataFrame
the data frame with train data
- IdaDataFrame
the data frame with test data