Predictive Modeling

Decision Trees

In many classification applications it may be required or desirable not only to accurately classify instances, but also to inspect the model. The inspection makes it possible to explain its decisions, modify it, or combine with some existing background knowledge. In such applications, where both the high classification accuracy and human-readability of the model are required, the method of choice is typically going to be decision trees.

A decision tree is a hierarchical structure that represents a classification model using a “divide and conquer” approach. Internal tree nodes represent splits applied to decompose the data set into subsets, and terminal nodes, also referred to as leaves, assign class labels to sufficiently small or uniform subsets. Splits are specified by logical conditions based on selected single attributes, with a separate outgoing branch corresponding to each possible outcome.

The concept of decision tree construction is to select splits that decrease the impurity of class distribution in the resulting subsets of instances, and increase the domination of one or more classes over the others. The goal is to find a subset containing only or mostly instances of one class after a small number of splits, so that a leaf with that class label is created. This approach promotes simple trees, which typically generalize better.

class nzpyida.analytics.predictive.decision_trees.DecisionTreeClassifier(idadb: IdaDataBase, model_name: str)[source]

Bases: Classification

Decision tree based classifier.

Methods

`conf_matrix`(in_df, target_column[, ...])	Makes a predition for a test data set given by the user and returns a confusion matrix, together with other stats (ACC and WACC).
`cross_validation`(in_df, target_column[, ...])	Performs a cross validation on <in_df> data for given model.
`describe`()	Returns model description.
`fit`(in_df, target_column[, id_column, ...])	Grows the decision tree and stores its model in the database.
`predict`(in_df[, out_table, id_column, prob, ...])	Makes predictions based on this model.
`score`(in_df, target_column[, id_column])	Scores the model.

fit(in_df: IdaDataFrame, target_column: str, id_column: str = None, in_columns: List[str] = None, col_def_type: str = None, col_def_role: str = None, col_properties_table: str = None, weights: str = None, eval_measure: str = None, min_improve: float = 0.02, min_split: int = 50, max_depth: int = 10, val_table: str = None, val_weights: str = None, qmeasure: str = None, statistics: str = None)[source]

Grows the decision tree and stores its model in the database.

Parameters:

in_dfIdaDataFrame

the input data frame

target_columnstr

the input table column representing the class

id_columnstr, optional

the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id

in_columnsstr, optional

the list of input table columns with special properties. Each column is followed by one or several of the following properties:

its type: ‘:nom’ (for nominal), ‘:cont’ (for continuous).
Per default, all numerical types are continuous, other types are nominal.

its role: ‘:id’, ‘:target’, ‘:input’, ‘:ignore’.

(Remark: ‘:objweight’ is unsupported, i.e. ‘:objweight’ same as ‘:ignore’). (Remark: ‘:colweight(<wgt>)’ is unsupported, i.e. ‘:colweight(<wgt>)’ same as ‘:col-weight(1)’ same as ‘:input’). If the parameter is undefined, all columns of the input table have default properties.

col_def_typestr, optional

default type of the input table columns. Allowed values are ‘nom’ and ‘cont’. If the parameter is undefined, all numeric columns are considered continuous, other columns nominal.

col_def_rolestr, optional

default role of the input table columns. Allowed values are ‘input’ and ‘ignore’. If the parameter is undefined, all columns are considered ‘input’ columns.

col_properties_tablestr, optional

the input table where column properties for the input table columns are stored. The format of this table is the output format of stored procedure nza..COLUMN_PROPERTIES(). If the parameter is undefined, the input table column properties will be detected automatically. (Remark: colPropertiesTable with “COLROLE” column with value ‘objweight’ is unsupported, i.e. same as ‘ignore’) (Remark: colPropertiesTable with “COLWEIGHT” column with value ‘<wgt>’ is unsupported, i.e. same as ‘1’)

weightsstr, optional

the input table containing optional instance or class weights for the input table columns. If the parameter is undefined, we assume that the weights are uniformly equal to 1. The <weights> table contains following columns:

weight: a numeric column containing the instance or class weight, id: a column to be joined with the <id> column of <intable>, defining instance weights, class: a column to be joined with the <target> column of <intable>, defining class weights.

The id or class column can be missing, at least one of them must be present. For instances or classes not occurring in this table, weights of 1 are assumed.

eval_measurestr, optional

the class impurity measure used for split evaluation. Allowed values are ‘entropy’ and ‘gini’

min_improvefloat, optional

the minimum improvement of the split evaluation measure required

min_splitint, optional

the minimum number of instances per tree node that can be split

max_depthint, optional

the maximum number of tree levels (including leaves)

val_tablestr, optional

the input table containing the validation dataset. If this parameter is undefined, no pruning will be performed.

val_weightsstr, optional

the input table containing optional instance or class weights for the validation dataset. It is similar to the <weights> table.

qmeasurestr, optional

the quality measure for pruning. Allowed values are Acc or wAcc.

statisticsstr, optional

flags indicating which statistics to collect. Allowed values are: none, columns, values:n, all. If statistics=none, no statistics are collected. If statistics=columns, statistics on the input table columns like mean value are collected. If statistics=values:n with n a positive number, statistics about the columns and the column values are collected. Up to <n> column value statistics are collected:

If a nominal column contains more than <n> values, only the <n> most frequent column statistics are kept. If a numeric column contains more than <n> values, the values will be discretized and the statistics will be collected on the discretized values.

Indicating statistics=all is equal to statistics=values:100.

predict(in_df: IdaDataFrame, out_table: str = None, id_column: str = None, prob: bool = False, out_table_prob: str = None) → IdaDataFrame[source]

Makes predictions based on this model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame
out_tablestr, optional: the output table where the predictions will be stored
id_columnstr, optional: the input table column identifying a unique instance id Default: id column used to build the model
probbool, optional: the flag indicating whether the probability of the predicted class should be included into the output table or not
out_table_probstr, optional: if specified, the probability output table where class probability predictions will be stored

Returns:

IdaDataFrame: the data frame containing row identifiers and predicted target values

KMeans

The k-means algorithm is the most widely-used clustering algorithm that uses an explicit distance measure to partition the data set into clusters. The main concept behind the k-means algorithm is to represent each cluster by the vector of mean attribute values of all training instances assigned to that cluster, called the cluster’s center. There are direct consequences of such a cluster representation:

the algorithm handles continuous attributes only, although workarounds

for discrete attributes are possible

both the cluster formation and cluster modeling processes can be performed

in a computationally efficient way by applying the specified distance function to match instances against cluster centers

class nzpyida.analytics.predictive.kmeans.KMeans(idadb: IdaDataBase, model_name: str)[source]

Bases: PredictiveModeling

KMeans clustering.

Methods

`describe`()	Returns model description.
`fit`(in_df[, id_column, in_columns, ...])	Creates and trains a model for clustering based on provided data and store it in a database.
`predict`(in_df[, out_table, id_column])	Makes predictions based on this model.
`score`(in_df, target_column[, id_column])	Scores the model.

fit(in_df: IdaDataFrame, id_column: str = None, in_columns: List[str] = None, col_def_type: str = None, col_def_role: str = None, col_properties_table: str = None, out_table: str = None, distance: str = 'norm_euclidean', k: int = 3, max_iter: int = 5, rand_seed: int = 12345, id_based: bool = False, statistics: str = None, transform: str = 'L') → IdaDataFrame[source]

Creates and trains a model for clustering based on provided data and store it in a database.

The training algorithm operates by performing several iterations of the same basic process. Each training instance is assigned to the closest cluster with respect to the specified distance function, applied to the instance and cluster center. All cluster centers are then re-calculated as the mean attribute value vectors of the instances assigned to particular clusters. The cluster centers are initialized by randomly picking k training instances, where k is the desired number of clusters. The iterative process should terminate when there are either no or sufficiently few changes in cluster assignments. In practice, however, it is sufficient to specify the number of iterations, typically a number between 3 and 36.

Parameters:

in_dfIdaDataFrame

the input data frame

id_columnstr, optional

the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id

in_columnsList[str], optional

the list of input table columns with special properties. Each column is followed by one or several of the following properties:

its type: ‘:nom’ (for nominal), ‘:cont’ (for continuous). Per default, all numerical types are con-tinuous, other types are nominal. its role: ‘:id’, ‘:target’, ‘:input’, ‘:ignore’.

(Remark: ‘:objweight’ is unsupported, i.e. ‘:objweight’ same as ‘:ignore’). (Remark: ‘:colweight(<wgt>)’ is unsupported, i.e. ‘:colweight(<wgt>)’ same as ‘:colweight(1)’ same as ‘:input’). If the parameter is undefined, all columns of the input table have default properties.

col_def_typestr, optional

default type of the input table columns. Allowed values are ‘nom’ and ‘cont’. If the parameter is undefined, all numeric columns are considered continuous, other columns nominal.

col_def_rolestr, optional

default role of the input table columns. Allowed values are ‘input’ and ‘ignore’. If the parameter is undefined, all columns are considered ‘input’ columns.

col_properties_tablestr, optional

the input table where column properties for the input table columns are stored. The format of this table is the output format of stored procedure nza..COLUMN_PROPERTIES(). If the parameter is undefined, the input table column properties will be detected automatically. (Remark: colPropertiesTable with “COLROLE” column with value ‘objweight’ is unsupported, i.e. same as ‘ignore’) (Remark: colPropertiesTable with “COLWEIGHT” column with value ‘<wgt>’ is unsupported, i.e. same as ‘1’)

out_tablestr, optional

the output table where clusters are assigned to each input table record

distancestr, optional

the distance function. Allowed values are: euclidean, norm_euclidean, manhattan, canberra, maximum, mahalanobis.

kint, optional

number of centers

max_iterint, optional

the maximum number of iterations to perform

rand_seedint, optional

the random generator seed

id_basedbool, optional

the specification that random generator seed is based on id column value

statisticsstr, optional

flags indicating which statistics to collect. Allowed values are: none, columns, values:n, all. If statistics=none, no statistics are collected. If statistics=columns, statistics on the input table columns like mean value are collec-ted. If statistics=values:n with n a positive number, statistics about the columns and the column values are collected. Up to <n> column value statistics are collected: If a nominal column contains more than <n> values, only the <n> most frequent column statistics are kept. If a numeric column contains more than <n> values, the values will be discretized and the statistics will be collected on the discretized values. Indicating statistics=all is equal to statistics=values:100.

transformstr, optional

flag indicating if the input table columns have to be transformed. Allowed values are: L (for leave as is), N (for normalization) or S (for standardization). If it is not specified, no transformation will be performed.

Returns:

IdaDataFrame: output table with following columns: id, cluster_id, distance. The id column matches the <id_column> of the input table. Each input table record is associated with a cluster, where the distance from the record to the cluster center is the smallest. The cluster ID and the distance to the cluster center are given in the columns cluster_id and distance

predict(in_df: IdaDataFrame, out_table: str = None, id_column: str = None) → IdaDataFrame[source]

Makes predictions based on this model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame
out_tablestr, optional: the output table where the assigned clusters will be stored
id_columnstr, optional: the input table column identifying a unique instance id Default: id column used to build the model

Returns:

IdaDataFrame: the data frame containing row identifiers and predicted target values

score(in_df: IdaDataFrame, target_column: str, id_column: str = None) → float[source]

Scores the model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame for scoring
target_columnstr: the input table column representing the class
id_columnstr, optional: the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id

Returns:

float: the model score

KNN

The nearest neighbor family of classification and regression algorithms is frequently referred to as memory-based or instance-based learning, and sometimes also as lazy learning. These terms correspond to the main concept of this approach, which is to replace model creation by memorizing the training data set and using it appropriately to make predictions.

class nzpyida.analytics.predictive.knn.KNeighborsClassifier(idadb: IdaDataBase, model_name: str)[source]

Bases: Classification

K-neighbors based classifier.

Methods

`conf_matrix`(in_df, target_column[, ...])	Makes a predition for a test data set given by the user and returns a confusion matrix, together with other stats (ACC and WACC).
`cross_validation`(in_df, target_column[, ...])	Performs a cross validation on <in_df> data for given model.
`describe`()	Returns model description.
`fit`(in_df, target_column[, id_column, ...])	Builds a K-Nearest Neighbors Classification or Regression model.
`predict`(in_df[, out_table, id_column, ...])	Applies a K-Nearest Neighbors model to generate classification or regression predictions for a data frame.
`score`(in_df, target_column[, id_column, ...])	Scores the model and returns classification error ratio.

conf_matrix(in_df: IdaDataFrame, target_column: str, id_column: str = None, out_matrix_table: str = None, distance: str = 'euclidean', k: int = 3, stand: bool = True, fast: bool = True, weights: str = None)[source]

Makes a predition for a test data set given by the user and returns a confusion matrix, together with other stats (ACC and WACC).

Parameters:

in_dfIdaDataFrame: the input data frame for scoring
target_columnstr: the input table column representing the class
id_columnstr, optional: the input table column identifying a unique instance id
out_matrix_tablestr, optional: the output table where the confidence matrix will be stored

Returns:

IdaDataFrame: the confidence matrix data frame
float: classification accuracy (ACC)
float: weighted classification accuracy (WACC)

fit(in_df: IdaDataFrame, target_column: str, id_column: str = None, in_columns: List[str] = None, col_def_type: str = None, col_def_role: str = None, col_properties_table: str = None)[source]

Builds a K-Nearest Neighbors Classification or Regression model.

Parameters:

in_dfIdaDataFrame

the input data frame

target_columnstr

the input table column representing the class

id_columnstr, optional

the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id

in_columnsList[str], optional

the list of input table columns with special properties. Each column is followed by one or several of the following properties: its type: ‘:nom’ (for nominal), ‘:cont’ (for continuous).

Per default, all numerical types are continuous, other types are nominal.

its role: ‘:id’, ‘:target’, ‘:input’, ‘:ignore’.

col_def_typestr, optional

default type of the input table columns. Allowed values are ‘nom’ and ‘cont’. If the parameter is undefined, all numeric columns are considered continuous, other columns nominal.

col_def_rolestr, optional

default role of the input table columns. Allowed values are ‘input’ and ‘ignore’. If the parameter is undefined, all columns are considered ‘input’

col_properties_tablestr, optional

the input table where column properties for the input table columns are stored. If the parameter is undefined, the input table column properties will be detected automatically.

predict(in_df: IdaDataFrame, out_table: str = None, id_column: str = None, distance: str = 'euclidean', k: int = 3, stand: bool = True, fast: bool = True, weights: str = None) → IdaDataFrame[source]

Applies a K-Nearest Neighbors model to generate classification or regression predictions for a data frame.

Parameters:

in_dfIdaDataFrame

the input data frame

out_tablestr, optional

the output table where the predictions will be stored

id_columnstr, optional

the input table column identifying a unique instance id Default: id column used to build the model

distancestr, optional

the distance function. Allowed values are: euclidean, manhatthan, canberra, maximum

kint, optional

number of nearest neighbors to consider

standbool, optional

flag indicating whether the measurements in the input table are standardized before calculating the distance

fastbool, optional

flag indicating that the algorithm used coresets based method

weightsstr, optional

the input table containing optional class weights for the input table <target> column. The <weights> table is used only when the <target> column is not numeric. If the parameter is undefined, we assume that the weights are uniformly equal to 1. The <weights> table contains following columns:

weight: a numeric column containing the class weight, class: a column to be joined with the <target> column of <intable>, defining class weights.

For classes not occurring in this table, weights of 1 are assumed.

Returns:

IdaDataFrame: a data frame with id and predicted class

score(in_df: IdaDataFrame, target_column: str, id_column: str = None, distance: str = 'euclidean', k: int = 3, stand: bool = True, fast: bool = True, weights: str = None) → float[source]

Scores the model and returns classification error ratio.

Parameters:

in_dfIdaDataFrame

the input data frame used to test the model

target_columnstr

the input table column representing the class in the input data frame

id_columnstr, optional

the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id

distancestr, optional

the distance function. Allowed values are: euclidean, manhatthan, canberra, maximum

kint, optional

number of nearest neighbors to consider

standbool, optional

flag indicating whether the measurements in the input table are standardized before calculating the distance

fastbool, optional

flag indicating that the algorithm used coresets based method

weightsstr, optional

the input table containing optional class weights for the input table <target> column. The <weights> table is used only when the <target> column is not numeric. If the parameter is undefined, we assume that the weights are uniformly equal to 1. The <weights> table contains following columns:

weight: a numeric column containing the class weight, class: a column to be joined with the <target> column of <intable>, defining class weights.

For classes not occurring in this table, weights of 1 are assumed.

Returns:

float: model classification error ratio

Linear Regression

Linear regression is a simple but very useful and commonly applied approach to the regression task, even though it only performs direct modeling of linear relationships. It is the thing that limits its applicability, a linear model representation, that makes it fast, efficient, and easy to use (compared to more refined regression algorithms).

class nzpyida.analytics.predictive.linear_regression.LinearRegression(idadb: IdaDataBase, model_name: str)[source]

Bases: Regression

Linear regression predictive model.

Methods

`describe`()	Returns model description.
`fit`(in_df, target_column[, id_column, ...])	Creates a linear regression model based on provided data and store it in a database.
`predict`(in_df[, out_table, id_column])	Makes predictions based on this model.
`score`(in_df, target_column[, id_column])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

fit(in_df: IdaDataFrame, target_column: str, id_column: str = None, in_columns: List[str] = None, nominal_colums: str = None, col_def_type: str = None, col_def_role: str = None, col_properties_table: str = None, use_svd_solver: bool = False, intercept: bool = True, calculate_diagnostics: bool = False)[source]

Creates a linear regression model based on provided data and store it in a database.

Parameters:

in_dfIdaDataFrame

the input data frame

target_columnstr

the input table column representing the prediction target, definition of multitargets can be processed by ‘incolumn’ parameter and column properties.

id_columnstr, optional

the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id

nominal_columsstr, optional

the input table nominal columns, if any, separated by a semi-colon (;). Parameter ‘nominalCols’ is deprecated please use ‘incolumn’ intead.

in_columnsList[str], optional

the list of input table columns with special properties. Each column is followed by one or several of the following properties:

its type: ‘:nom’ (for nominal), ‘:cont’ (for continuous). Per default, all numerical types are con-tinuous, other types are nominal. its role: ‘:id’, ‘:target’, ‘:input’, ‘:ignore’.

(Remark: ‘:objweight’ is unsupported, i.e. ‘:objweight’ same as ‘:ignore’). (Remark: ‘:colweight(<wgt>)’ is unsupported, i.e. ‘:colweight(<wgt>)’ same as ‘:colweight(1)’ same as ‘:input’). If the parameter is undefined, all columns of the input table have default properties.

col_def_typestr, optional

default type of the input table columns. Allowed values are ‘nom’ and ‘cont’. If the parameter is undefined, all numeric columns are considered continuous, other columns nominal.

col_def_rolestr, optional

default role of the input table columns. Allowed values are ‘input’ and ‘ignore’. If the parameter is undefined, all columns are considered ‘input’ columns.

col_properties_tablestr, optional

the input table where column properties for the input table columns are stored. The format of this table is the output format of stored procedure nza..COLUMN_PROPERTIES(). If the parameter is undefined, the input table column properties will be detected automatically. (Remark: colPropertiesTable with “COLROLE” column with value ‘objweight’ is unsupported, i.e. same as ‘ignore’) (Remark: colPropertiesTable with “COLWEIGHT” column with value ‘<wgt>’ is unsup-ported, i.e. same as ‘1’)

use_svd_solverbool, optional

a flag indicating whether Singular Value Decomposition and matrix multiplication should be used for solving the matrix equation

interceptbool, optional

flag indicating whether the model is built with or without an intercept value. The default has changed to true.

calculate_diagnosticsbool, optional

a flag indicating whether diagnostics information should be displayed

Naive Bayes

The naive Bayes classifier is a simpler classification algorithm than most, which makes it quick and easy to apply. While it does not compete with more sophisticated algorithms with respect to classification accuracy, in some cases it may be able to deliver similar results in a fraction of the computation time.

class nzpyida.analytics.predictive.naive_bayes.NaiveBayesClassifier(idadb: IdaDataBase, model_name: str)[source]

Bases: Classification

Naive Bayes classifier

Methods

`conf_matrix`(in_df, target_column[, ...])	Makes a predition for a test data set given by the user and returns a confusion matrix, together with other stats (ACC and WACC).
`cross_validation`(in_df, target_column[, ...])	Performs a cross validation on <in_df> data for given model.
`describe`()	Returns model description.
`fit`(in_df, target_column[, id_column, ...])	Builds a Naive Bayes model.
`predict`(in_df[, out_table, id_column, ...])	Makes predictions based on this model.
`score`(in_df, target_column[, id_column])	Scores the model.

fit(in_df: IdaDataFrame, target_column: str, id_column: str = None, in_columns: List[str] = None, col_def_type: str = None, col_def_role: str = None, col_properties_table: str = None, disc: str = None, bins: int = 10)[source]

Builds a Naive Bayes model.

Parameters:

in_dfIdaDataFrame

the input data frame

target_columnstr

the input table column representing the class

id_columnstr, optional

the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id

in_columnsList[str], optional

the input table columns with special properties, separated by a semi-colon (;). Each column is followed by one or several of the following properties:

its type: ‘:nom’ (for nominal), ‘:cont’ (for continuous). Per default, all numerical types are con-tinuous, other types are nominal. its role: ‘:id’, ‘:target’, ‘:input’, ‘:ignore’.

(Remark: ‘:objweight’ is unsupported, i.e. ‘:objweight’ same as ‘:ignore’). (Remark: ‘:colweight(<wgt>)’ is unsupported, i.e. ‘:colweight(<wgt>)’ same as ‘:colweight(1)’ same as ‘:input’). If the parameter is undefined, all columns of the input table have default properties.

col_def_typestr, optional

default type of the input table columns. Allowed values are ‘nom’ and ‘cont’. If the parameter is undefined, all numeric columns are considered continuous, other columns nominal.

col_def_rolestr, optional

default role of the input table columns. Allowed values are ‘input’ and ‘ignore’. If the parameter is undefined, all columns are considered ‘input’ columns.

col_properties_tablestr, optional

the input table where column properties for the input table columns are stored. The format of this table is the output format of stored procedure nza..COLUMN_PROPERTIES(). If the parameter is undefined, the input table column properties will be detected automatically. (Remark: colPropertiesTable with “COLROLE” column with value ‘objweight’ is unsupported, i.e. same as ‘ignore’) (Remark: colPropertiesTable with “COLWEIGHT” column with value ‘<wgt>’ is unsupported, i.e. same as ‘1’)

discstr, optional

discretization type for numeric columns [ew, ef, em]

binsint, optional

default number of bins for numeric columns

predict(in_df: IdaDataFrame, out_table: str = None, id_column: str = None, out_table_prob: str = None, mestimation: str = None)[source]

Makes predictions based on this model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame
out_tablestr, optional: the output table where the predictions will be stored
id_columnstr, optional: the input table column identifying a unique instance id Default: id column used to build the model
out_table_probstr, optional: if specified, the probability output table where class probability predictions will be stored
mestimationstr, optional: flag indicating to use m-estimation for probabilities. This kind of estimation of probabilities may be slower but can give better results for small or heavy unbalanced datasets.

Returns:

IdaDataFrame: the data frame containing row identifiers and predicted target values

Regression Trees

Regression trees are decision trees adapted to the regression task, which store numeric target attribute values instead of class labels in leaves, and use appropriately modified split selection and stop criteria.

As with decision trees, regression tree nodes decompose the data into subsets, and regression tree leaves correspond to sufficiently small or sufficiently uniform subsets. Splits are selected to decrease the dispersion of target attribute values, so that they can be reasonably well predicted by their mean values at leaves. The resulting model is piecewise-constant, with fixed predicted values assigned to regions to which the domain is decomposed by the tree structure.

class nzpyida.analytics.predictive.regression_trees.DecisionTreeRegressor(idadb: IdaDataBase, model_name: str)[source]

Bases: Regression

Decision tree based regressor

Methods

`describe`()	Returns model description.
`fit`(in_df, target_column[, id_column, ...])	This function creates a regression tree model based on provided data and store it in a database.
`predict`(in_df[, out_table, id_column, variance])	Makes predictions based on this model.
`score`(in_df, target_column[, id_column])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

fit(in_df: IdaDataFrame, target_column: str, id_column: str = None, in_columns: List[str] = None, col_def_type: str = None, col_def_role: str = None, col_properties_table: str = None, eval_measure: str = None, min_improve: float = 0.1, min_split: int = 50, max_depth: int = 10, val_table: str = None, qmeasure: str = None, statistics: str = None)[source]

This function creates a regression tree model based on provided data and store it in a database.

Parameters:

in_dfIdaDataFrame

the input data frame

target_columnstr

the input table column representing the prediction target, definition of multitargets can be processed by ‘incolumn’ parameter and column properties.

id_columnstr, optional

the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id

nominal_columsstr, optional

the input table nominal columns, if any, separated by a semi-colon (;). Parameter ‘nominalCols’ is deprecated please use ‘incolumn’ intead.

in_columnsList[str], optional

the list of input table columns with special properties. Each column is followed by one or several of the following properties:

its type: ‘:nom’ (for nominal), ‘:cont’ (for continuous). Per default, all numerical types are con-tinuous, other types are nominal. its role: ‘:id’, ‘:target’, ‘:input’, ‘:ignore’.

(Remark: ‘:objweight’ is unsupported, i.e. ‘:objweight’ same as ‘:ignore’). (Remark: ‘:colweight(<wgt>)’ is unsupported, i.e. ‘:colweight(<wgt>)’ same as ‘:colweight(1)’ same as ‘:input’). If the parameter is undefined, all columns of the input table have default properties.

col_def_typestr, optional

default type of the input table columns. Allowed values are ‘nom’ and ‘cont’. If the parameter is undefined, all numeric columns are considered continuous, other columns nominal.

col_def_rolestr, optional

default role of the input table columns. Allowed values are ‘input’ and ‘ignore’. If the parameter is undefined, all columns are considered ‘input’ columns.

col_properties_tablestr, optional

the input table where column properties for the input table columns are stored. The format of this table is the output format of stored procedure nza..COLUMN_PROPERTIES(). If the parameter is undefined, the input table column properties will be detected automatically. (Remark: colPropertiesTable with “COLROLE” column with value ‘objweight’ is unsupported, i.e. same as ‘ignore’) (Remark: colPropertiesTable with “COLWEIGHT” column with value ‘<wgt>’ is unsup-ported, i.e. same as ‘1’)

eval_measurestr, optional

the split evaluation measure. Allowed values are: variance.

min_improvefloat, optional

the minimum improvement of the split evaluation measure required

min_splitint, optional

the minimum number of instances per tree node that can be split

max_depthint, optional

the maximum number of tree levels (including leaves)

val_tablestr, optional

the input table containing the validation dataset. If this parameter is undefined, no pruning will be performed.

qmeasurestr, optional

the quality measure for pruning the tree. Allowed values are: mse, r2.

statisticsstr, optional

flags indicating which statistics to collect. Allowed values are: none, columns, values:n, all. If statistics=none, no statistics are collected. If statistics=columns, statistics on the input table columns like mean value are collected. If statistics=values:n with n a positive number, statistics about the columns and the column val-ues are collected. Up to <n> column value statistics are collected: If a nominal column contains more than <n> values, only the <n> most frequent column stat-istics are kept. If a numeric column contains more than <n> values, the values will be discretized and the stat-istics will be collected on the discretized values. Indicating statistics=all is equal to statistics=values:100.

predict(in_df: IdaDataFrame, out_table: str = None, id_column: str = None, variance: bool = False)[source]

Makes predictions based on this model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame to predict
out_tablestr, optional: the output table where the predictions will be stored
id_columnstr, optional: the input table column identifying a unique instance id Default: id column used to build the model
variancebool, optional: a flag indicating whether the variance of the predictions should be included into the output table

Returns:

IdaDataFrame: the data frame containing row identifiers and predicted target values

Association Rules

Association rules mining is a popular method for discovering interesting and useful patterns in a large scale transaction database. The database contains transactions which consist of a set of items and a transaction identifier (e.g., a market basket). Association rules are implications of the form X -> Y where X and Y are two disjoint subsets of all available items. X is called the antecedent or LHS (left hand side) and Y is called the consequent or RHS (right hand side). Discovered association rules have to satisfy user-defined constraints on measures of significance and interest.

The Apriori algorithm organizes the search for frequent itemsets by systematically considering itemsets of increasing size in consecutive iterations. Due to its method of calculation, the number of candidates identified by the Apriori algorithm may be overwhelming for extremely large data sets or a low support threshold. Because of this limitation, the FP-growth algorithm is provided in the IBM Netezza In-Database Analytics package instead.

The FP-growth algorithm avoids candidate generation as well as multiple passes through the data by creating a data structure called a frequent pattern tree, or FP-tree. This tree is a compact representation of the data set contents sufficient for finding frequent itemsets. Nodes of the tree represent single items and store their occurrence counts. Only items with sufficiently high support, frequent item-sets of size 1, are represented. Branches, called node-links, in the FP-tree connect nodes that represent items co-occurring for some instances in the data set. There is also a frequent item header table that points to nodes corresponding to particular items.

The tree is built by identifying all frequent items and their counts, then consecutively “inserting” each transaction to the tree. This requires exactly two scans of the data set, regardless of its size or support threshold level. The FP-tree is used to identity frequent itemsets using a frequent pattern growth process, which traverses the tree by following node-links in an appropriate way.

By avoiding explicit candidate generation, the FP-growth algorithm reduces the number of data set scans. It can also perform efficiently, regardless of the threshold support.

class nzpyida.analytics.predictive.association_rules.ARule(idadb: IdaDataBase, model_name: str)[source]

Bases: PredictiveModeling

Methods

`describe`()	Returns model description.
`fit`(in_df[, transaction_id_column, ...])	This function builds an Association Rules model.
`predict`(in_df[, out_table, ...])	Makes predictions based on this model.

fit(in_df: IdaDataFrame, transaction_id_column: str = 'tid', item_column: str = 'item', by_column: str = None, level: int = 1, max_set_size: int = 6, support: float = None, support_type: str = 'percent', confidence: float = 0.5)[source]

This function builds an Association Rules model. The model is saved to the database in a set of tables and registered in the database model metadata. Use the function ‘describe’ to display the Association Rules of the model, or the Model Management functions to further manipulate the model

Parameters:

in_dfIdaDataFrame

the input data frame

transaction_id_columnstr, optional

the input table column identifying transactions

items_columnstr, optional

the input table column identifying items in transactions

by_columnstr, optional

the input table column identifying groups of transactions if any. Association Rules min-ing is done separately on each of these groups. Leave the parameter undefined if no groups are to be considered.

levelint, optional

ARULE first temporarily redistributes the data into overlapping parts in such a way, that each part can be processed in parallel without communication between the SPUs. Note that for this to work, there can be redundancy between the parts, such that the accumulative size of the temporary parts can be much higher than the one of the ori- ginal data set. The parameter lvl controls how many parts are created. The higher lvl:

The more computation and temporary database space is required for the splitting

The smaller the amount of main memory that is required for each data slice

Note: To fully use the benefits of parallel computing, do not specify the value of the lvl parameter too low. Additionally, the lower the value of the lvl parameter, the higher the memory consumption for each part. The higher memory consumption might cause an out-of-memory error on the SPUs. If an out-of-memory error on the SPUs oc-curs, increase the lvl parameter.

If you specify the value 0, the algorithm is executed in a serial way for each data set group. However, only if the data set fits in one node, and only if the splitting increases the total number of rows dramatically, the stored procedure might be executed faster when you specify the value 0. Default - 1 Min - 0

support, int, optional

minimum support value satisfied by all association rules. According to supporttype, it defines the absolute number (#supporting transactions) or the percentage of transactions (#supporting transactions/#total transactions*100). Too low minimum support increases the number of generated rules and the computational expense.

support_type: str, optional

the type how the minimum support should be interpreted. The following values are allowed: absolute, percent. Note the support and support_type values are common to all groups in the dataset. E.g. if 3 is the absolute minimum support, then an itemset will be considered frequent if at least 3 trans-actions contain its items, no matter what is the number of transactions in this group. Use support_type=’percent’ to indicate a minimum support depending on the size of the groups. Specifying support_type=’absolute’ takes effect only if a support is explicitly supplied.

confidence: float, optional

the minimum confidence for an association rule to be default - 0.5 min - 0 max - 1

predict(in_df: IdaDataFrame, out_table: str = None, transaction_id_column: str = 'tid', item_column: str = 'item', by_column: str = None, scoring_type: str = 'exclusiveRecommend', name_map_column: str = None, item_name_column: str = 'item', item_name_mapped_column: str = 'item_name', min_size: int = 1, max_size: int = 64, min_support: float = 0.0, max_support: float = 1.0, min_confidence: float = 0.0, max_confidence: float = 1.0, min_lift: float = None, max_lift: float = None, min_conviction: float = 0.0, max_conviction: float = None, min_affinity: float = 0.0, max_affinity: float = 1.0, min_leverage: float = -0.25, max_leverage: float = 1.0) → IdaDataFrame[source]

Makes predictions based on this model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame
out_tablestr, optional: the output table where the predictions will be stored
transaction_id_columnstr, optional: the input table column identifying transactions
items_columnstr, optional: the input table column identifying items in transactions
by_columnstr, optional: the input table column identifying groups of transactions if any. Association Rules min-ing is done separately on each of these groups. Leave the parameter undefined if no groups are to be considered.
scoring_type: str, optional: he type how the scoring algorithm should be applied to the input data. The following values are allowed: recommend, exclusiveRecommend. recommend - A rule is returned if its left hand side itemset is a subset of the transaction. exclusiveRecommend - A rule is returned if its left hand side itemset is a subset of the input itemset, and its right hand side itemset is not a subset of the transaction.
name_map_column: str, optional: table which provides names of items and their associated mapped values in LHS_ITEMS, RHS_ITEMS columns of outtable
item_name_column, str, optional: the column name of namemap table where the item identifiers are
item_name_mapped_column: str, optional: the column name of namemap table where the item names are stored which should be used in-stead of the item identifier
min_size: int, optional: The minimum number of items per association rule to be applied
max_size: int, optional: The maximum number of items per association rule to be applied
min_support: float, optional: The minimum support of an association rule to be applied
max_support: float, optional: The maximum support of an association rule to be applied
min_confidence: float, optional: The minimum confidence of an association rule to be applied.
max_confidence: float, optional: The maximum confidence of an association rule to be applied
min_lift: float, optional: The minimum lift of an association rule to be applied
max_lift: float, optional: The maximum lift of an association rule to be applied
min_conviction: float, optional: The minimum conviction of an association rule to be applied
max_conviction: float, optional: The maximum conviction of an association rule to be applied
min_affinity: float, optional: The minimum affinity of an association rule to be applied
max_affinity: float, optional: The maximum affinity of an association rule to be applied
min_leverage: float, optional: The minimum leverage of an association rule to be applied
max_leverage: float, optional: The maximum leverage of an association rule to be applied

Returns:

IdaDataFrame: the data frame containing output of a Association Rules model prediction

Bisecting KMeans

The divisive clustering algorithm is a computationally efficient, top-down approach to creating hierarchical clustering models. Conceptually, it can be thought of as a wrapper around the k -means algorithm (with a specialized method for initial centroid setting), running the algorithm several times to divide clusters into subclusters. The internal k-means algorithm assumes a fixed k =2 value.

The divisive clustering algorithm may return different results for the same data set and the same random generator seed when you use different input data distribution or a different number of dataslices. This is due to the behavior of the random number generator, which generates random sequences depending on the number of dataslices and data distribution. The algorithm returns the same model when you use the same machine, the same input data distribution, and the same random seed.

The cluster formation process of the divisive clustering algorithm begins with a single cluster containing all training instances, then the first invocation of k-means divides it into two subclusters by creating two descendant nodes of the clustering tree. Subsequent invocations divide these clusters into more subclusters, and so on, until a stop criterion is satisfied. Stop criterion can be specified by the maximum clustering tree depth or by the minimum required umber of instances for further partitioning. The resulting hierarchical clustering tree can be used to classify instances by propagating them down from the root node, and choosing at each level the best matching sub-cluster with respect to the instance’s distance from sub-cluster centers.

The internal k-means process of the divisive clustering algorithm operates using the ordinary k-means algorithm (with the modified initial centroid generation), discussed in the K-Means Clustering section, using a fixed value of k=2 and working with the subset of data from the parent cluster. The initial centroid generation consists two steps: random generation n>>k candidates and then selection of outermost pair of candidates. The cluster center representation and distance measures remain the same. The numbering scheme for clusters in a clustering tree is the same as decision trees: the root node is number 1, and the descendants of node number ‘i’ have numbers ‘2i’ and ‘2i+1’

Additionally, leaves, which are clusters with no subclusters, are designated by negative numbers.

class nzpyida.analytics.predictive.bisecting_kmeans.BisectingKMeans(idadb: IdaDataBase, model_name: str)[source]

Bases: PredictiveModeling

Divisive Clustering

Methods

`describe`()	Returns model description.
`fit`(in_df[, id_column, target_column, ...])	Builds a Hierarchical Clustering model using a divisive method (top-down).
`predict`(in_df[, out_table, id_column, level])	Makes predictions based on this model.
`score`(in_df, target_column[, id_column, level])	Scores the model.

fit(in_df: IdaDataFrame, id_column: str = None, target_column: str = None, in_columns: List[str] = None, col_def_type: str = None, col_def_role: str = None, col_properties_table: str = None, out_table: str = None, distance: str = 'euclidean', max_iter: int = 5, min_split: int = 5, max_depth: int = 3, rand_seed: int = 12345) → IdaDataFrame[source]

Builds a Hierarchical Clustering model using a divisive method (top-down). The K- means algorithm is used recursively. The hierarchy of clusters is represented in a binary tree structure (each parent node has exactly 2 children node). The leafs of the cluster tree are identified by negative numbers. The divisive clustering algorithm may return different results for the same dataset and the same random generator seed when you use different input data distribution or a different number of dataslices. This is due to the behavior of the random number generator, which generates random sequences depending on the number of dataslices and data distribution. The algorithm returns the same model when the same ma-chine, the same input data distribution, and the same random seed is used.

Parameters:

in_dfIdaDataFrame

the input data frame

id_columnstr, optional

the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id

target_columnstr, optional

the input table column representing a class or a value to predict, this column is ignored by the Hierarchical Clustering algorithm

in_columnsList[str], optional

the list of input table columns with special properties. Each column is followed by one or several of the following properties:

its type: ‘:nom’ (for nominal), ‘:cont’ (for continuous). Per default, all numerical types are con-tinuous, other types are nominal. its role: ‘:id’, ‘:target’, ‘:input’, ‘:ignore’.

(Remark: ‘:objweight’ is unsupported, i.e. ‘:objweight’ same as ‘:ignore’). (Remark: ‘:colweight(<wgt>)’ is unsupported, i.e. ‘:colweight(<wgt>)’ same as ‘:colweight(1)’ same as ‘:input’). If the parameter is undefined, all columns of the input table have default properties.

col_def_typestr, optional

default type of the input table columns. Allowed values are ‘nom’ and ‘cont’. If the parameter is undefined, all numeric columns are considered continuous, other columns nominal.

col_def_rolestr, optional

default role of the input table columns. Allowed values are ‘input’ and ‘ignore’. If the parameter is undefined, all columns are considered ‘input’ columns.

col_properties_tablestr, optional

the input table where column properties for the input table columns are stored. The format of this table is the output format of stored procedure nza..COLUMN_PROPERTIES(). If the parameter is undefined, the input table column properties will be detected automatically. (Remark: colPropertiesTable with “COLROLE” column with value ‘objweight’ is unsupported, i.e. same as ‘ignore’) (Remark: colPropertiesTable with “COLWEIGHT” column with value ‘<wgt>’ is unsupported, i.e. same as ‘1’)

out_tablestr, optional

the output table where clusters are assigned to each input table record

distancestr, optional

the distance function. Allowed values are: euclidean, norm_euclidean, manhattan, canberra, maximum, mahalanobis.

max_iterint, optional

the maximum number of iterations to perform in the base K-means Clustering algorithm

min_splitint, optional

the minimum number of instances per cluster that can be split

max_depthint, optional

the maximum number of cluster levels (including leaves)

rand_seedint, optional

the random generator seed

Returns:

IdaDataFrame: the data frame containing row identifiers, cluster_id and distance to cluster center

predict(in_df: IdaDataFrame, out_table: str = None, id_column: str = None, level: int = -1) → IdaDataFrame[source]

Makes predictions based on this model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame
out_tablestr, optional: the output table where the assigned clusters will be stored
id_columnstr, optional: the input table column identifying a unique instance id Default: id column used to build the model
levelint, optional: the level of the cluster hierarchy which should be applied to the data. For level=-1, only the leaves of the clustering tree are considered

Returns:

IdaDataFrame: the data frame containing row identifiers and predicted target values

score(in_df: IdaDataFrame, target_column: str, id_column: str = None, level: int = -1) → float[source]

Scores the model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame for scoring
target_columnstr: the input table column representing the class
id_columnstr, optional: the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id
levelint, optional: the level of the cluster hierarchy which should be applied to the data. For level=-1, only the leaves of the clustering tree are considered

Returns:

float: the model score

Two Step Clustering

TwoStep clustering is a data mining algorithm for large data sets. It is faster than traditional methods because it typically scans a data set only once before it saves the data to a clustering feature (CF) tree. TwoStep clustering can make clustering decisions without repeated data scans, whereas other clustering methods scan all data points, which requires multiple iterations. Non- uniform points are not gathered, so each iteration requires a reinspection of each data point, regardless of the significance of the data point. Because TwoStep clustering treats dense areas as a single unit and ignores pattern outliers, it provides high-quality clustering results without exceeding memory constraints.

The TwoStep algorithm has the following advantages: - It automatically determines the optimal number of clusters. You do not have to manually create a different clustering model for each number of clusters. - It detects input columns that are not useful for the clustering process. These columns are automatically set to supplementary. Statistics are gathered for these columns but they do not influence the clustering algorithm. - The configuration of the CF tree can be granular, so that you can balance between memory usage and model quality, according to the environment and needs.

class nzpyida.analytics.predictive.two_step_clustering.TwoStepClustering(idadb: IdaDataBase, model_name: str)[source]

Bases: PredictiveModeling

Divisive Clustering

Methods

`describe`()	Returns model description.
`fit`(in_df[, id_column, target_column, ...])	Builds a TwoStep Clustering model that first distributes the input data into a hierarchical tree structure according to the distance between the data records, then reduces the tree into k clusters.
`predict`(in_df[, out_table, id_column])	Makes predictions based on this model.
`score`(in_df, target_column[, id_column])	Scores the model.

fit(in_df: IdaDataFrame, id_column: str = None, target_column: str = None, in_columns: List[str] = None, col_def_type: str = None, col_def_role: str = None, col_properties_table: str = None, out_table: str = None, k: int = 0, max_k: int = 20, bins: int = 10, statistics: str = None, rand_seed: int = 12345, distance: str = 'loglikelihood', distance_threshold: float = None, distance_threshold_factor: float = 2.0, epsilon: float = 0.0, node_capacity: int = 6, leaf_capacity: int = 8, max_leaves: int = 1000, outlier_fraction: float = 0.0) → IdaDataFrame[source]

Builds a TwoStep Clustering model that first distributes the input data into a hierarchical tree structure according to the distance between the data records, then reduces the tree into k clusters. A second pass over the data associates the input data records to the next cluster.

Parameters:

in_dfIdaDataFrame

the input data frame

id_columnstr, optional

the input table column identifying a unique instance id

target_columnstr, optional

the input table column representing a class or a value to predict, this column is ignored by the TwoStep Clustering algorithm

in_columnsList[str], optional

the list of input table columns with special properties. Each column is followed by one or several of the following properties:

its type: ‘:nom’ (for nominal), ‘:cont’ (for continuous). Per default, all numerical types are con-tinuous, other types are nominal. its role: ‘:id’, ‘:target’, ‘:input’, ‘:ignore’.

(Remark: ‘:objweight’ is unsupported, i.e. ‘:objweight’ same as ‘:ignore’). (Remark: ‘:colweight(<wgt>)’ is unsupported, i.e. ‘:colweight(<wgt>)’ same as ‘:colweight(1)’ same as ‘:input’). If the parameter is undefined, all columns of the input table have default properties.

col_def_typestr, optional

default type of the input table columns. Allowed values are ‘nom’ and ‘cont’. If the parameter is undefined, all numeric columns are considered continuous, other columns nominal.

col_def_rolestr, optional

default role of the input table columns. Allowed values are ‘input’ and ‘ignore’. If the parameter is undefined, all columns are considered ‘input’ columns.

col_properties_tablestr, optional

the input table where column properties for the input table columns are stored. The format of this table is the output format of stored procedure nza..COLUMN_PROPERTIES(). If the parameter is undefined, the input table column properties will be detected automatically. (Remark: colPropertiesTable with “COLROLE” column with value ‘objweight’ is unsupported, i.e. same as ‘ignore’) (Remark: colPropertiesTable with “COLWEIGHT” column with value ‘<wgt>’ is unsupported, i.e. same as ‘1’)

out_tablestr, optional

the output table where clusters are assigned to each input table record

kint, optional

the number of clusters. If k is 0 or less, the procedure determines the optimal number of clusters

max_kint, optional

the maximum number of clusters that can be determined automatically. If k is bigger than 0, this parameter is ignored

binsint, optional

the average number of bins for numerical statistics with more than <n> values

statisticsstr, optional

flags indicating which statistics to collect. Allowed values are: none, columns, values:n, all. Regardless of the value of the parameter statistics, all statistics are gathered since they are needed to call PREDICT_TWOSTEP on this model. If statistics=none or statistics=columns, the importance of the attributes is not calculated. If statistics=none, statistics=columns or statistics=all, up to 100 discrete values are gathered. If statistics=values:n with n a positive number, up to <n> column value statistics are collected:

If a nominal column contains more than <n> values, only the <n> most

frequent column stat-istics are kept. - If a numeric column contains more than <n> values, the values will be discretized and the stat-istics will be collected on the discretized values.

Indicating statistics=all is equal to statistics=values:100.

rand_seedint, optional

the random generator seed

distancestr, optional

the distance function. Allowed values are: euclidean, norm_euclidean, loglikelihood

distance_thresholdfloat, optional

the threshold under which 2 data records can be merged into one cluster during the first pass. If not set, the distance threshold is calculated automatically

distance_threshold_factorfloat, optional

the factor used to calculate the distance threshold automatically. The distance threshold is then the median distance value minus distance_threshold_factor times the interquartile distance (or the minimum distance if this value is below it). If distance_threshold is set, this parameter is ignored

epsilonfloat, optional

the value to be used as global variance of all continuous fields for the loglikelihood distance. If the value is 0.0 or less, the global variance is calculated for each continuous field. If distance is not loglikelihood, this parameter is ignored

node_capacityint, optional

the branching factor of the internal tree used in pass 1. Each node can have up to node_capacity subnodes

leaf_capacityint, optional

the number of clusters per leaf node in the internal tree used in pass 1

max_leavesint, optional

the maximum number of leaf nodes in the internal tree used in pass 1. When the tree contains maxleaves leaf nodes, the following data records are aggregated into the existing clusters

outlier_fractionfloat, optional

the fraction of the records to be considered as outlier in the internal tree used in pass 1. Clusters containing less than outlierfraction times the mean number of data records per cluster are removed

Returns:

IdaDataFrame: the data frame containing row identifiers, cluster_id and distance to cluster center

predict(in_df: IdaDataFrame, out_table: str = None, id_column: str = None) → IdaDataFrame[source]

Makes predictions based on this model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame
out_tablestr, optional: the output table where the assigned clusters will be stored
id_columnstr, optional: the input table column identifying a unique instance id Default: id column used to build the model

Returns:

IdaDataFrame: the data frame containing row identifiers and predicted target values

score(in_df: IdaDataFrame, target_column: str, id_column: str = None) → float[source]

Scores the model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame for scoring
target_columnstr: the input table column representing the class
id_columnstr, optional: the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id

Returns:

float: the model score

Time Series Forecasting

Many types of business-relevant or scientific data have values that change over time. Some typical examples are: - Daily sales figures for a store - Energy consumption readings from household electric meters - Price per gallon at a local gas station It is often useful to analyze the behavior of such changes, both to describe the development over time, specifically for a particular trend and seasonality, as well as to predict unknown values of the series, usually for the future. A typical area of application is supply chain management, where future needs may be predicted based on past trends.

A time series is a sequence of numerical data values, measured at successive, but not necessarily equidistant—points in time. Examples are daily stock prices, monthly unemployment counts, or annual changes in global temperature. The two main goals of time series analysis are to understand the underlying patterns which are represented by the observed data and to make forecasts.

class nzpyida.analytics.predictive.timeseries.TimeSeries(idadb: IdaDataBase, model_name: str)[source]

Bases: PredictiveModeling

Time Series Model

Methods

`describe`()	Returns model description.
`fit_predict`(in_df, time_column, target_column)	Predicts future values of series of timed numeric values

fit_predict(in_df: IdaDataFrame, time_column: str, target_column: str, by_column: str = None, out_table: str = None, description_table: str = None, algorithm: str = 'ExponentialSmoothing', interpolation_method: str = 'linear', from_time=None, to_time=None, forecast_horizon: str = None, forecast_times: str = None, trend: str = None, seasonality: str = None, period: float = None, unit: str = None, p: int = None, d: int = None, q: int = None, sp: int = None, sd: int = None, sq: int = None, saesonally_adjusted_table: str = None) → IdaDataFrame[source]

Predicts future values of series of timed numeric values

Parameters:

in_dfIdaDataFrame: the input data frame
time_columnstr: the input data frame column which define an order on the numeric values
target_columnsstr: the input data frame column which contains the numeric values
by_columnstr: the input data frame column which uniquely identifies a serie of values. If not specified, all numeric values belong to only one time series.
out_tablestr: the output data frmae containing predicted future values. This parameter is not allowed for algorithm = SpectralAnalysis. If not specified, no output table is written out
description_tablestr: the optional input data frame containing the name and descriptions of the time series. The table must contain following columns: <by_column>, ‘NAME’=str, ‘DESCRIPTION’=str. If not specified, the series do not have a name or a description
algorithmstr: the time series algorithm to use. Allowed values are: ExponentialSmoothing, ARIMA, SeasonalTrendDecomposition, SpectralAnalysis
interpolation_methodstr: the interpolation method. Allowed values are: linear, cubicspline, exponentialspline
from_timesame as type of <time column>: the value of column time to start the analysis from. If not specified, the analysis starts from the first value of the time series in the input table
to_timesame as type of <time column>: the value of column time to stop the analysis at. If not specified, the analysis stops at the last value of the time series in the input table
forecast_horizonstr: the value of column time until which to predict. This parameter is not allowed for algorithm=SpectralAnalysis. If not specified, the algorithm determines itself until which time it predicts values
forecast_timesstr: list of semi-column separated values of column time to predict at. This parameter is not allowed for algorithm=SpectralAnalysis. If not specified, the times to predict values at is determined by the algorithm
trendstr: the trend type for algorithm=ExponentialSmoothing. Allowed values are: N (none), A (addditive), DA (damped additive), M (multiplicative), DM (damped multiplicative). If not specified, the trend type is determined by the algorithm
seasonalitystr: the seasonality type for algorithm=ExponentialSmoothing. Allowed values are: N (none), A (addditive), M (multiplicative). If not specified, the seasonality type is determined by the algorithm
periodfloat: the seasonality period. This parameter is not allowed for algorithm=SpectralAnalysis. If not specified, the seasonality period is determined by the algorithm. If set to 0, no seasonality period will be considered by the algorithm
unitstr: the seasonality period unit. This parameter is not allowed for algorithm=SpectralAnalysis. This parameter must be specified if the parameter period is specified and the <time_column> is of type date, time or timestamp. Otherwise, it must not be spe- cified. Allowed values are: ms, s, min, h, d, wk, qtr, q, a, y
pint: the parameter p for algorithm=ARIMA, either equal to or below specified value. If not specified, the algorithm will determine its best value automatically
dint: the parameter d for algorithm=ARIMA, either equal to or below specified value. If not specified, the algorithm will determine its best value automatically
qint: the parameter q for algorithm=ARIMA, either equal to or below specified value. If not specified, the algorithm will determine its best value automatically
spint: the seasonal parameter SP for algorithm=ARIMA, either equal to or below specified value. If not specified, the algorithm will determine its best value automatically
sdint: the seasonal parameter SD for algorithm=ARIMA, either equal to or below specified value. If not specified, the algorithm will determine its best value automatically
sqint: the seasonal parameter SQ for algorithm=ARIMA, either equal to or below specified value. If not specified, the algorithm will determine its best value automatically
saesonally_adjusted_tablestr: the output table containing seasonally adjusted values. This parameter is not allowed for algorithm=SpectralAnalysis or algorithm=ARIMA. If not specified, no output table is written out

Tree Bayesian Networks

Tree-shaped Bayesian networks formally belong to the data exploration category. However, this algorithm is considerably more complex than other data exploration algorithms and not as widely known, warranting detailed description.

A Bayesian network can be considered a graphical representation of probabilistically described relationships within a set of attributes, allowing probabilistic inference to be performed. The representation is created by extracting the structural properties of the distribution from the data.

Creating and using general Bayesian networks are algorithmically and computationally complex. Tree- shaped Bayesian networks, however, constitute a simplified subclass of Bayesian networks with restrictions imposed on the type of attribute relationships that can be discovered and represented. The restrictions permit simpler and more efficient algorithms as well as more straightforward interpretation. Tree-shaped Bayesian networks may be not sufficient for highly-accurate prediction, but provide an excellent qualitative description of the relationship structure observed in the data

class nzpyida.analytics.predictive.bayesian_networks.BinaryTreeBayesNetwork(idadb: IdaDataBase, model_name: str)[source]

Bases: TreeBayesNetwork

Methods

`describe`()	Returns model description.
`fit`(in_df[, in_columns, base_index, ...])	Builds a tree-like Bayesian Network for continuous variables.
`predict`(in_df[, target_column, id_column, ...])	Makes predictions based on this model.
`score`(in_df[, target_column, id_column, ...])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

class nzpyida.analytics.predictive.bayesian_networks.MultiTreeBayesNetwork(idadb: IdaDataBase, model_name: str)[source]

Bases: TreeBayesNetwork

Methods

`describe`()	Returns model description.
`fit`(in_df, class_column[, in_columns, ...])	Parameters:
`predict`(in_df[, target_column, id_column, ...])	Makes predictions based on this model.
`score`(in_df[, target_column, id_column, ...])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

fit(in_df: IdaDataFrame, class_column: str, in_columns: List[str] = None, base_index: int = None, sample_size: int = None, talk: str = None, edge_lab_sort: str = None, col_def_type: str = None, col_def_role: str = None, col_properties_table: str = None) → None[source]

Parameters:

in_dfIdaDataFrame

the input data frame

class_columnstr

the target class; this should be column with nominal variables

in_columnsList[str]

List of the input dataframe columns with special properties. Each column is followed by one or several of the following properties: - type: ‘:nom’ (for nominal), ‘:cont’ (for continuous). By default,

all numerical types are continuous, other types are nominal

role: ‘:id’, ‘:target’, ‘:input’, ‘:ignore’.

(Remark: ‘:objweight’ is unsupported, i.e. ‘:objweight’ same as ‘:ignore’). (Remark: ‘:colweight(<wgt>)’ is unsupported, i.e. ‘:colweight(<wgt>)’ same as ‘:colweight(1)’ same as ‘:input’).

If the parameter is undefined, all columns of the input table have default properties. Note that this procedure only accepts continuous columns with role ‘input’ Addition-ally, each column is followed by a colon (:) and either X or Y to distinguish the two sets of variables.

base_indexint, optional

the numeric id to be assigned to the first variable

sample_sizeint, optional

the sample size to take if the number of records is too large

talkstr, optional

if talk=yes then additional information on progress will be displayed

edge_lab_sortstr, optional

if edge_lab_sort=yes then the left end of the edge will have a name lower in alphabetic order than the right one

col_def_typestr, optional

default type of the input dataframe columns. Allowed values are ‘nom’ and ‘cont’. If the parameter is undefined, all numeric columns are considered continuous, other columns nominal.

col_def_rolestr, optional

default role of the input dataframe columns. Allowed values are ‘input’ and ‘ignore’. If the parameter is undefined, all columns are considered ‘input’ columns.

col_properties_tablestr, optional

the input table where column properties for the input dataframe columns are stored. The format of this table is the output format of stored procedure nza..COLUMN_PROPERTIES(). If the parameter is undefined, the input dataframe column properties will be detected automatically. (Remark: colPropertiesTable with “COLROLE” column with value ‘objweight’ is unsupported, i.e. same as ‘ignore’) (Remark: colPropertiesTable with “COLWEIGHT” column with value ‘<wgt>’ is unsupported, i.e. same as ‘1’)

class nzpyida.analytics.predictive.bayesian_networks.TreeAgumentedNetwork(idadb: IdaDataBase, model_name: str)[source]

Bases: TreeBayesNetwork

Methods

`describe`()	Returns model description.
`fit`(in_df, in_model, class_column[, ...])	Parameters:
`predict`(in_df[, target_column, id_column, ...])	Makes predictions based on this model.
`score`(in_df[, target_column, id_column, ...])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

fit(in_df: IdaDataFrame, in_model: str, class_column: str, edge_lab_sort: str = None) → None[source]

Parameters:

in_dfIdaDataFrame: the input data frame
in_modelstr: the name of the input Bayesian Network model
class_columnstr: the target class; this should be column with nominal variables
edge_lab_sortstr, optional: if edge_lab_sort=yes then the left end of the edge will have a name lower in alphabetic order than the right one

class nzpyida.analytics.predictive.bayesian_networks.TreeBayesNetwork(idadb: IdaDataBase, model_name: str)[source]

Bases: Regression

Methods

`describe`()	Returns model description.
`fit`(in_df[, in_columns, base_index, ...])	Builds a tree-like Bayesian Network for continuous variables.
`predict`(in_df[, target_column, id_column, ...])	Makes predictions based on this model.
`score`(in_df[, target_column, id_column, ...])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

fit(in_df: IdaDataFrame, in_columns: List[str] = None, base_index: int = 777, sample_size: int = None, talk: str = None, size_warning: str = None, edge_lab_sort: str = None, col_def_type: str = None, col_def_role: str = None, col_properties_table: str = None) → None[source]

Builds a tree-like Bayesian Network for continuous variables. A spanning tree is constructed joining all the variables on grounds of most strong correlations. This gives the user an overview of most significant interrelations governing the whole set of variables

Parameters:

in_dfIdaDataFrame

the input data frame

in_columnsList[str]

List of the input dataframe columns with special properties. Each column is followed by one or several of the following properties: - type: ‘:nom’ (for nominal), ‘:cont’ (for continuous). By default,

all numerical types are continuous, other types are nominal

role: ‘:id’, ‘:target’, ‘:input’, ‘:ignore’.

(Remark: ‘:objweight’ is unsupported, i.e. ‘:objweight’ same as ‘:ignore’). (Remark: ‘:colweight(<wgt>)’ is unsupported, i.e. ‘:colweight(<wgt>)’ same as ‘:colweight(1)’ same as ‘:input’).

If the parameter is undefined, all columns of the input dataframe have default properties. Note that this procedure only accepts continuous columns with role ‘input’

base_indexint, optional

the numeric id to be assigned to the first variable

sample_sizeint, optional

the sample size to take if the number of records is too large

talkstr, optional

if talk=yes then additional information on progress will be displayed

size_warningstr, optional

if sizewarn=yes then no exception is thrown when there are less records than 3 times the number of columns. Instead, a notice is displayed and the stored procedure returns ‘sizewarn’

edge_lab_sortstr, optional

if edge_lab_sort=yes then the left end of the edge will have a name lower in alphabetic order than the right one

col_def_typestr, optional

default type of the input dataaframe columns. Allowed values are ‘nom’ and ‘cont’. If the parameter is undefined, all numeric columns are considered continuous, other columns nominal.

col_def_rolestr, optional

default role of the input dataframe columns. Allowed values are ‘input’ and ‘ignore’. If the parameter is undefined, all columns are considered ‘input’ columns.

col_properties_tablestr, optional

the input table where column properties for the input dataframe columns are stored. The format of this table is the output format of stored procedure nza..COLUMN_PROPERTIES(). If the parameter is undefined, the input table column properties will be detected automatically. (Remark: colPropertiesTable with “COLROLE” column with value ‘objweight’ is unsupported, i.e. same as ‘ignore’) (Remark: colPropertiesTable with “COLWEIGHT” column with value ‘<wgt>’ is unsupported, i.e. same as ‘1’)

predict(in_df: IdaDataFrame, target_column: str = None, id_column: str = None, prediction_type: str = 'best', out_table: str = None) → IdaDataFrame[source]

Makes predictions based on this model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame
target_columnstr: The model variable to be predicted
id_columnstr, optional: The column of the input dataframe that identifies a unique instance ID
prediction_typestr, optional: The type of prediction to be made. Valid values are best (most correlated neighbor), neighbors (weighted prediction of neighbors), and nn-neighbors (non null neighbors)
out_tablestr, optional: The name of the output dataframe where the predictions are to be stored

Returns:

IdaDataFrame: the data frame containing row identifiers and predicted target values

score(in_df: IdaDataFrame, target_column: str = None, id_column: str = None, prediction_type: str = 'best') → float[source]

Scores the model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame for scoring
target_columnstr: the input dataframe column representing the class
id_columnstr: the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id
prediction_typestr, optional: The type of prediction to be made. Valid values are best (most correlated neighbor), neighbors (weighted prediction of neighbors), and nn-neighbors (non null neighbors)

Returns:

float: the model score

class nzpyida.analytics.predictive.bayesian_networks.TreeBayesNetwork1G(idadb: IdaDataBase, model_name: str)[source]

Bases: TreeBayesNetworkBase

Methods

describe()

Returns model description.

grow(in_df[, in_columns, base_index, ...])

Parameters:

grow(in_df: IdaDataFrame, in_columns: List[str] = None, base_index: int = 777, sample_size: int = 330000, talk: str = None, no_check: str = None, edge_lab_sort: str = None, col_def_type: str = None, col_def_role: str = None, col_properties_table: str = None) → IdaDataFrame[source]

Parameters:

in_dfIdaDataFrame

the input data frame

in_columnsList[str]

List of the input dataframe columns with special properties. Each column is followed by one or several of the following properties: - type: ‘:nom’ (for nominal), ‘:cont’ (for continuous). By default,

all numerical types are continuous, other types are nominal

role: ‘:id’, ‘:target’, ‘:input’, ‘:ignore’.

(Remark: ‘:objweight’ is unsupported, i.e. ‘:objweight’ same as ‘:ignore’). (Remark: ‘:colweight(<wgt>)’ is unsupported, i.e. ‘:colweight(<wgt>)’ same as ‘:colweight(1)’ same as ‘:input’).

If the parameter is undefined, all columns of the input table have default properties. Note that this procedure only accepts continuous columns with role ‘input’ Addition-ally, each column is followed by a colon (:) and either X or Y to distinguish the two sets of variables.

base_indexint, optional

the numeric id to be assigned to the first variable

sample_sizeint, optional

the sample size to take if the number of records is too large

talkstr, optional

if talk=yes then additional information on progress will be displayed

no_checkstr, optional

if nocheck=yes then no exception is thrown when a column in <in_columns> does not exis

edge_lab_sortstr, optional

if edge_lab_sort=yes then the left end of the edge will have a name lower in alphabetic order than the right one

col_def_typestr, optional

default type of the input dataframe columns. Allowed values are ‘nom’ and ‘cont’. If the parameter is undefined, all numeric columns are considered continuous, other columns nominal.

col_def_rolestr, optional

default role of the input dataframe columns. Allowed values are ‘input’ and ‘ignore’. If the parameter is undefined, all columns are considered ‘input’ columns.

col_properties_tablestr, optional

the input table where column properties for the input dataframe columns are stored. The format of this table is the output format of stored procedure nza..COLUMN_PROPERTIES(). If the parameter is undefined, the input dataframe column properties will be detected automatically. (Remark: colPropertiesTable with “COLROLE” column with value ‘objweight’ is unsupported, i.e. same as ‘ignore’) (Remark: colPropertiesTable with “COLWEIGHT” column with value ‘<wgt>’ is unsupported, i.e. same as ‘1’)

Returns:

IdaDataFrame: the data frame containing statistics

class nzpyida.analytics.predictive.bayesian_networks.TreeBayesNetwork1G2P(idadb, model_name)[source]

Bases: TreeBayesNetworkBase

Methods

`describe`()	Returns model description.
`grow`(in_df[, in_columns, base_index, talk, ...])	This stored procedure builds a tree-like Bayesian Network for continuous variables.

grow(in_df: IdaDataFrame, in_columns: List[str] = None, base_index: int = 777, talk: str = None, no_check: str = None, edge_lab_sort: str = None, col_def_type: str = None, col_def_role: str = None, col_properties_table: str = None) → IdaDataFrame[source]

This stored procedure builds a tree-like Bayesian Network for continuous variables. A spanning tree is constructed joining all the variables on grounds of most strong correlations. This gives the user an overview of most significant interrelations governing the whole set of variables.

The stored procedure constructs the tree in an incremental manner. It calculates correlations on one set of variables, then on the other set of variables, then between variables of the 2 sets. The final model is obtained by joining the three sub-models

Parameters:

in_dfIdaDataFrame

the input data frame

in_columnsList[str]

List of the input dataframe columns with special properties. Each column is followed by one or several of the following properties: - type: ‘:nom’ (for nominal), ‘:cont’ (for continuous). By default,

all numerical types are continuous, other types are nominal

role: ‘:id’, ‘:target’, ‘:input’, ‘:ignore’.

(Remark: ‘:objweight’ is unsupported, i.e. ‘:objweight’ same as ‘:ignore’). (Remark: ‘:colweight(<wgt>)’ is unsupported, i.e. ‘:colweight(<wgt>)’ same as ‘:colweight(1)’ same as ‘:input’).

If the parameter is undefined, all columns of the input table have default properties. Note that this procedure only accepts continuous columns with role ‘input’ Addition-ally, each column is followed by a colon (:) and either X or Y to distinguish the two sets of variables.

base_indexint, optional

the numeric id to be assigned to the first variable

talkstr, optional

if talk=yes then additional information on progress will be displayed

no_checkstr, optional

if nocheck=yes then no exception is thrown when a column in <in_columns> does not exis

edge_lab_sortstr, optional

if edge_lab_sort=yes then the left end of the edge will have a name lower in alphabetic order than the right one

col_def_typestr, optional

default type of the input dataframe columns. Allowed values are ‘nom’ and ‘cont’. If the parameter is undefined, all numeric columns are considered continuous, other columns nominal.

col_def_rolestr, optional

default role of the input dataframe columns. Allowed values are ‘input’ and ‘ignore’. If the parameter is undefined, all columns are considered ‘input’ columns.

col_properties_tablestr, optional

the input table where column properties for the input dataframe columns are stored. The format of this table is the output format of stored procedure nza..COLUMN_PROPERTIES(). If the parameter is undefined, the input dataframe column properties will be detected automatically. (Remark: colPropertiesTable with “COLROLE” column with value ‘objweight’ is unsupported, i.e. same as ‘ignore’) (Remark: colPropertiesTable with “COLWEIGHT” column with value ‘<wgt>’ is unsupported, i.e. same as ‘1’)

Returns:

IdaDataFrame: the data frame containing statistics

class nzpyida.analytics.predictive.bayesian_networks.TreeBayesNetwork2G(idadb, model_name)[source]

Bases: TreeBayesNetworkBase

Methods

`describe`()	Returns model description.
`grow`(in_df[, in_columns, base_index, talk, ...])	Builds a tree-like Bayesian Network for continuous variables.

grow(in_df: IdaDataFrame, in_columns: List[str] = None, base_index: int = 777, talk: str = None, no_check: str = None, edge_lab_sort: str = None, col_def_type: str = None, col_def_role: str = None, col_properties_table: str = None) → IdaDataFrame[source]

Builds a tree-like Bayesian Network for continuous variables. A spanning tree is constructed joining all the variables on grounds of most strong correlations. This gives the user an overview of most significant interrelations governing the whole set of variables.

The stored procedure operates with two sets of variables and the resulting tree will be bi-partite. The correlations between variables within each set will not be calculated. This feature is useful when the two sets characterize distinct objects and only links between the objects are of interest

Parameters:

in_dfIdaDataFrame

the input data frame

in_columnsList[str]

List of the input dataframe columns with special properties. Each column is followed by one or several of the following properties: - type: ‘:nom’ (for nominal), ‘:cont’ (for continuous). By default,

all numerical types are continuous, other types are nominal

role: ‘:id’, ‘:target’, ‘:input’, ‘:ignore’.

(Remark: ‘:objweight’ is unsupported, i.e. ‘:objweight’ same as ‘:ignore’). (Remark: ‘:colweight(<wgt>)’ is unsupported, i.e. ‘:colweight(<wgt>)’ same as ‘:colweight(1)’ same as ‘:input’).

If the parameter is undefined, all columns of the input table have default properties. Note that this procedure only accepts continuous columns with role ‘input’ Addition-ally, each column is followed by a colon (:) and either X or Y to distinguish the two sets of variables.

base_indexint, optional

the numeric id to be assigned to the first variable

talkstr, optional

if talk=yes then additional information on progress will be displayed

no_checkstr, optional

if nocheck=yes then no exception is thrown when a column in <in_columns> does not exis

edge_lab_sortstr, optional

if edge_lab_sort=yes then the left end of the edge will have a name lower in alphabetic order than the right one

col_def_typestr, optional

default type of the input dataframe columns. Allowed values are ‘nom’ and ‘cont’. If the parameter is undefined, all numeric columns are considered continuous, other columns nominal.

col_def_rolestr, optional

default role of the input dataframe columns. Allowed values are ‘input’ and ‘ignore’. If the parameter is undefined, all columns are considered ‘input’ columns.

col_properties_tablestr, optional

the input table where column properties for the input dataframe columns are stored. The format of this table is the output format of stored procedure nza..COLUMN_PROPERTIES(). If the parameter is undefined, the input dataframe column properties will be detected automatically. (Remark: colPropertiesTable with “COLROLE” column with value ‘objweight’ is unsupported, i.e. same as ‘ignore’) (Remark: colPropertiesTable with “COLWEIGHT” column with value ‘<wgt>’ is unsupported, i.e. same as ‘1’)

Returns:

IdaDataFrame: the data frame containing statistics

class nzpyida.analytics.predictive.bayesian_networks.TreeBayesNetworkBase(idadb: IdaDataBase, model_name: str)[source]

Bases: PredictiveModeling

Methods

describe()

Returns model description.

Generalized Linear Models

class nzpyida.analytics.predictive.glm.BernoulliRegressor(idadb: IdaDataBase, model_name: str)[source]

Bases: GLM

Methods

`describe`()	Returns model description.
`fit`(in_df, target_column[, id_column, ...])	in_df : IdaDataFrame
`predict`(in_df[, out_table, id_column, debug])	in_df : IdaDataFrame
`score`(in_df, target_column[, id_column])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

class nzpyida.analytics.predictive.glm.BinomialRegressor(idadb: IdaDataBase, model_name: str)[source]

Bases: GLM

Methods

`describe`()	Returns model description.
`fit`(in_df, target_column[, id_column, ...])	in_df : IdaDataFrame
`predict`(in_df[, out_table, id_column, debug])	in_df : IdaDataFrame
`score`(in_df, target_column[, id_column])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

class nzpyida.analytics.predictive.glm.GLM(idadb: IdaDataBase, model_name: str)[source]

Bases: Regression

General Linear Regression model

Methods

`describe`()	Returns model description.
`fit`(in_df, target_column[, id_column, ...])	in_df : IdaDataFrame
`predict`(in_df[, out_table, id_column, debug])	in_df : IdaDataFrame
`score`(in_df, target_column[, id_column])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

fit(in_df: IdaDataFrame, target_column: str, id_column: str = None, in_columns: List[str] = None, intercept: bool = True, interaction: str = '', family_param: float = -1, link: str = 'logit', link_param: float = 1, max_iter: int = 20, epsilon: float = 0.001, tolerance: float = 1e-07, method: str = 'irls', trials: str = '', debug: bool = False, col_def_type: str = None, col_def_role: str = None, col_properties_table: str = None)[source]

in_dfIdaDataFrame

the input data frame

target_columnstr

the input dataframe column to predict a value for. Only numeric type of target column is accepted

id_columnstr, optional

the input datafrme column identifying a unique instance id

incolumnstr, optional

the list of input dataframe columns with special properties, separated. Each column is followed by one or several of the following properties: - its type: ‘:nom’ (for nominal), ‘:cont’ (for continuous). Per default, all numerical

types are con-tinuous, other types are nominal.

its role: ‘:id’, ‘:target’, ‘:input’, ‘:ignore’, ‘:objweight’.

(Remark: ‘:colweight(<wgt>)’ is unsupported, i.e. ‘:colweight(<wgt>)’ same as ‘:colweight(1)’ same as ‘:input’). If the parameter is undefined, all columns of the input table have default properties

intercept: bool, optional

flag indicating whether the model is built with or without an intercept value

interaction: str, optional

the definition of the allowed interactions between input columns. The interaction is a list of factors separated by a semicolon (;). A factor is a list of variables separated by a star (*). A variable is a column name of the input table. Continuous variables can be followed by a caret (^) and a numeric value, in this case the given power of values of this column is meant. Nominal variables can be followed by a sign equal (=) and a value, so that only the given value of the variable is allowed to interact with the other variables of this factor. If no value is indicated after a nominal variable, all distinct val- ues interact independantly with the other variables of the factor. By default, all input columns are considered independent and do not interact with each other

family_param: float, optional

additional parameter used for some distributions. IF family_param=’quasi’ then quasi-likelihood in case of Poisson and Binomial distributions is optimized. IF family_param=-1 (or is omitted then mentioned distribution parameter is estimated from data. IF family_param is given explicit then should by > 0

link: str, optional

the type of the link function. Allowed values are: canbinom, cangeom, cannegbinom, cauchit, clog, cloglog, gaussit, identity, inverse, invnegative, invsquare, log, logit, loglog, oddspower, power, probit, sqrt

link_param: float, optional

an additional parameter used for some links like: cannegbinom, oddspower, power. The range of value depends on the used link function

max_iter: int, optional

the maximum number of iterations

epsilon: float, optional

the maximum (relative) error used as stopping criteria

tolerance: float, optional

the tolerance for the linear equation solver when to consider a value to be equal to zero

method: str, optional

the method used to calculate a GLM model. Allowed values are: irls, psgd

trials: str, optional

the input table column containing the number of trials for the binominal distribution. This parameter must be specified when family=binomial. This parametrs is ignored for other distributions

debug: str, optional

flag indicating to display debug information

col_def_type: str, optional

default type of the input dataframe columns. Allowed values are ‘nom’ and ‘cont’. If the parameter is undefined, all numeric columns are considered continuous, other columns nominal

col_def_role: str, optional

default role of the input dataframe columns. Allowed values are ‘input’ and ‘ignore’. If the parameter is undefined, all columns are considered ‘input’ columns

col_properties_table: str, optional

the input table where column properties for the input dataframe columns are stored. The format of this table is the output format of stored procedure nza..COLUMN_PROPER-TIES(). If the parameter is undefined, the input table column properties will be detected automatically. (Remark: colPropertiesTable with “COLWEIGHT” column with value ‘<wgt>’ is unsupported, i.e. same as ‘1’)

predict(in_df: IdaDataFrame, out_table: str = None, id_column: str = None, debug: bool = False)[source]

in_dfIdaDataFrame: the input data frame
out_tablestr, optional: the output table where the predictions will be stored
id_columnstr, optional: the input data frame column identifying a unique instance
debugbool, optional: flag indicating to display debug information

Returns:

IdaDataFrame: the data frame containing row identifiers and predicted target values

class nzpyida.analytics.predictive.glm.GammaRegressor(idadb: IdaDataBase, model_name: str)[source]

Bases: GLM

Methods

`describe`()	Returns model description.
`fit`(in_df, target_column[, id_column, ...])	in_df : IdaDataFrame
`predict`(in_df[, out_table, id_column, debug])	in_df : IdaDataFrame
`score`(in_df, target_column[, id_column])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

class nzpyida.analytics.predictive.glm.GaussianRegressor(idadb: IdaDataBase, model_name: str)[source]

Bases: GLM

Methods

`describe`()	Returns model description.
`fit`(in_df, target_column[, id_column, ...])	in_df : IdaDataFrame
`predict`(in_df[, out_table, id_column, debug])	in_df : IdaDataFrame
`score`(in_df, target_column[, id_column])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

class nzpyida.analytics.predictive.glm.NegativeBinomialRegressor(idadb: IdaDataBase, model_name: str)[source]

Bases: GLM

Methods

`describe`()	Returns model description.
`fit`(in_df, target_column[, id_column, ...])	in_df : IdaDataFrame
`predict`(in_df[, out_table, id_column, debug])	in_df : IdaDataFrame
`score`(in_df, target_column[, id_column])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

class nzpyida.analytics.predictive.glm.PoissonRegressor(idadb: IdaDataBase, model_name: str)[source]

Bases: GLM

Methods

`describe`()	Returns model description.
`fit`(in_df, target_column[, id_column, ...])	in_df : IdaDataFrame
`predict`(in_df[, out_table, id_column, debug])	in_df : IdaDataFrame
`score`(in_df, target_column[, id_column])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

class nzpyida.analytics.predictive.glm.WaldRegressor(idadb: IdaDataBase, model_name: str)[source]

Bases: GLM

Methods

`describe`()	Returns model description.
`fit`(in_df, target_column[, id_column, ...])	in_df : IdaDataFrame
`predict`(in_df[, out_table, id_column, debug])	in_df : IdaDataFrame
`score`(in_df, target_column[, id_column])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

Classification base module

This module contains a class that is the base for all classification algorithms.

class nzpyida.analytics.predictive.classification.Classification(idadb: IdaDataBase, model_name: str)[source]

Bases: PredictiveModeling

Base class for classification algorithms.

Methods

`conf_matrix`(in_df, target_column[, ...])	Makes a predition for a test data set given by the user and returns a confusion matrix, together with other stats (ACC and WACC).
`cross_validation`(in_df, target_column[, ...])	Performs a cross validation on <in_df> data for given model.
`describe`()	Returns model description.
`predict`(in_df[, out_table, id_column])	Makes predictions based on this model.
`score`(in_df, target_column[, id_column])	Scores the model.

conf_matrix(in_df: IdaDataFrame, target_column: str, id_column: str = None, out_matrix_table: str = None) → Tuple[IdaDataFrame, float, float][source]

Makes a predition for a test data set given by the user and returns a confusion matrix, together with other stats (ACC and WACC).

Parameters:

in_dfIdaDataFrame: the input data frame for scoring
target_columnstr: the input table column representing the class
id_columnstr, optional: the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id
out_matrix_tablestr, optional: the output table where the confidence matrix will be stored

Returns:

IdaDataFrame: the confidence matrix data frame
float: classification accuracy (ACC)
float: weighted classification accuracy (WACC)

cross_validation(in_df: IdaDataFrame, target_column: str, id_column: str = None, out_table: str = None, folds: int = 10, rand_seed: float = None) → Tuple[IdaDataFrame, float][source]

Performs a cross validation on <in_df> data for given model. Numer of batches and size of train/test split isdetermined by parameter <folds>

Parameters:

in_dfIdaDataFrame: the input data frame for scoring
target_columnstr: the input table column representing the class
id_columnstr, optional: the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id
out_tablestr, optional: the output table where the predicted values will be stored

Returns:

IdaDataFrame: the data frame with predicted values for all <in_df>
float: classification accuracy (ACC) for all batches

predict(in_df: IdaDataFrame, out_table: str = None, id_column: str = None) → IdaDataFrame[source]

Makes predictions based on this model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame for predictions
out_tablestr, optional: the output table where the predictions will be stored
id_columnstr, optional: the input table column identifying a unique instance id Default: id column used to build the model

score(in_df: IdaDataFrame, target_column: str, id_column: str = None) → float[source]

Scores the model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame for scoring
target_columnstr: the input table column representing the class
id_columnstr, optional: the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id

Returns:

float: the model score

Regression base module

This module contains a class that is the base for all regression algorithms.

class nzpyida.analytics.predictive.regression.Regression(idadb: IdaDataBase, model_name: str)[source]

Bases: PredictiveModeling

Base class for regression algorithms.

Methods

`describe`()	Returns model description.
`predict`(in_df[, out_table, id_column])	Makes predictions based on this model.
`score`(in_df, target_column[, id_column])	Scores the model.
`score_all`(in_df, target_column[, id_column])	Scores the model using MSE, MAE, RSE and RAE.

predict(in_df: IdaDataFrame, out_table: str = None, id_column: str = None) → IdaDataFrame[source]

Makes predictions based on this model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame to predict
out_tablestr, optional: the output table where the predictions will be stored
id_columnstr, optional: the input table column identifying a unique instance id Default: id column used to build the model

Returns:

IdaDataFrame: the data frame containing row identifiers and predicted target values

score(in_df: IdaDataFrame, target_column: str, id_column: str = None) → float[source]

Scores the model. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame for scoring
target_columnstr: the input table column representing the class
id_columnstr: the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id

Returns:

float: the model score

score_all(in_df: IdaDataFrame, target_column: str, id_column: str = None) → Dict[str, float][source]

Scores the model using MSE, MAE, RSE and RAE. The model must exist.

Parameters:

in_dfIdaDataFrame: the input data frame for scoring
target_columnstr: the input table column representing the class
id_columnstr, optional: the input table column identifying a unique instance id - if skipped, the input data frame indexer must be set and will be used as an instance id

Returns:

dict: the model scores in a dictionary with MSE, MAE, RSE and RAE as keys