Source code for nzpyida.analytics.predictive.association_rules

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#-----------------------------------------------------------------------------
# Copyright (c) 2023, IBM Corp.
# All rights reserved.
#
# Distributed under the terms of the BSD Simplified License.
#
# The full license is in the LICENSE file, distributed with this software.
#-----------------------------------------------------------------------------
"""
Association rules mining is a popular method for discovering interesting 
and useful patterns in a large scale transaction database. The database 
contains transactions which consist of a set of items and a transaction 
identifier (e.g., a market basket). Association rules are implications 
of the form X -> Y where X and Y are two disjoint subsets of all available 
items. X is called the antecedent or LHS (left hand side) and Y is called 
the consequent or RHS (right hand side). Discovered association rules have 
to satisfy user-defined constraints on measures of significance and interest.

The Apriori algorithm organizes the search for frequent itemsets by systematically 
considering itemsets of increasing size in consecutive iterations. Due to its method 
of calculation, the number of candidates identified by the Apriori algorithm may 
be overwhelming for extremely large data sets or a low support threshold. Because 
of this limitation, the FP-growth algorithm is provided in the IBM Netezza In-Database 
Analytics package instead.

The FP-growth algorithm avoids candidate generation as well as multiple passes through 
the data by creating a data structure called a frequent pattern tree, or FP-tree. 
This tree is a compact representation of the data set contents sufficient for finding 
frequent itemsets. Nodes of the tree represent single items and store their occurrence 
counts. Only items with sufficiently high support, frequent item-sets of size 1, are 
represented. Branches, called node-links, in the FP-tree connect nodes that represent 
items co-occurring for some instances in the data set. There is also a frequent item 
header table that points to nodes corresponding to particular items.

The tree is built by identifying all frequent items and their counts, then consecutively 
“inserting” each transaction to the tree. This requires exactly two scans of the 
data set, regardless of its size or support threshold level. The FP-tree is used to 
identity frequent itemsets using a frequent pattern growth process, which traverses 
the tree by following node-links in an appropriate way.

By avoiding explicit candidate generation, the FP-growth algorithm reduces the number 
of data set scans. It can also perform efficiently, regardless of the threshold support.
"""

from typing import List
from nzpyida.frame import IdaDataFrame
from nzpyida.base import IdaDataBase
from nzpyida.analytics.predictive.predictive_modeling import PredictiveModeling
from nzpyida.analytics.utils import q

[docs] class ARule(PredictiveModeling): def __init__(self, idadb: IdaDataBase, model_name: str): """ Creates Association Rules Class Parameters ---------- idada : IdaDataBase database connector model_name : str model name - the name of the Association Rules model to build """ super().__init__(idadb, model_name) self.has_print_proc = True self.fit_proc = 'ARULE' self.predict_proc = 'PREDICT_ARULE' self.model_name = model_name
[docs] def fit(self, in_df: IdaDataFrame, transaction_id_column: str='tid', item_column: str='item', by_column: str=None, level: int=1, max_set_size: int=6, support: float=None, support_type: str='percent', confidence: float=0.5): """ This function builds an Association Rules model. The model is saved to the database in a set of tables and registered in the database model metadata. Use the function 'describe' to display the Association Rules of the model, or the Model Management functions to further manipulate the model Parameters ---------- in_df : IdaDataFrame the input data frame transaction_id_column : str, optional the input table column identifying transactions items_column : str, optional the input table column identifying items in transactions by_column : str, optional the input table column identifying groups of transactions if any. Association Rules min-ing is done separately on each of these groups. Leave the parameter undefined if no groups are to be considered. level : int, optional ARULE first temporarily redistributes the data into overlapping parts in such a way, that each part can be processed in parallel without communication between the SPUs. Note that for this to work, there can be redundancy between the parts, such that the accumulative size of the temporary parts can be much higher than the one of the ori- ginal data set. The parameter lvl controls how many parts are created. The higher lvl: - The more computation and temporary database space is required for the splitting - The smaller the amount of main memory that is required for each data slice Note: To fully use the benefits of parallel computing, do not specify the value of the lvl parameter too low. Additionally, the lower the value of the lvl parameter, the higher the memory consumption for each part. The higher memory consumption might cause an out-of-memory error on the SPUs. If an out-of-memory error on the SPUs oc-curs, increase the lvl parameter. If you specify the value 0, the algorithm is executed in a serial way for each data set group. However, only if the data set fits in one node, and only if the splitting increases the total number of rows dramatically, the stored procedure might be executed faster when you specify the value 0. Default - 1 Min - 0 support, int, optional minimum support value satisfied by all association rules. According to supporttype, it defines the absolute number (#supporting transactions) or the percentage of transactions (#supporting transactions/#total transactions*100). Too low minimum support increases the number of generated rules and the computational expense. support_type: str, optional the type how the minimum support should be interpreted. The following values are allowed: absolute, percent. Note the support and support_type values are common to all groups in the dataset. E.g. if 3 is the absolute minimum support, then an itemset will be considered frequent if at least 3 trans-actions contain its items, no matter what is the number of transactions in this group. Use support_type='percent' to indicate a minimum support depending on the size of the groups. Specifying support_type='absolute' takes effect only if a support is explicitly supplied. confidence: float, optional the minimum confidence for an association rule to be default - 0.5 min - 0 max - 1 """ if support_type == 'percent' and not support: support = 5.0 params = { 'tid': q(transaction_id_column), 'item': q(item_column), 'by': q(by_column), 'lvl': level, 'maxsetsize': max_set_size, 'support': support, 'supporttype': support_type, 'confidence': confidence } self._fit(in_df=in_df, params=params, needs_id=False)
[docs] def predict(self, in_df: IdaDataFrame, out_table: str=None, transaction_id_column: str='tid', item_column: str='item', by_column: str=None, scoring_type: str='exclusiveRecommend', name_map_column: str=None, item_name_column: str='item', item_name_mapped_column: str='item_name', min_size: int=1, max_size: int=64, min_support: float=0.0, max_support: float=1.0, min_confidence: float=0.0, max_confidence: float=1.0, min_lift: float=None, max_lift: float=None, min_conviction: float=0.0, max_conviction: float=None, min_affinity: float=0.0, max_affinity: float=1.0, min_leverage: float=-0.25, max_leverage: float=1.0) -> IdaDataFrame: """ Makes predictions based on this model. The model must exist. Parameters ---------- in_df : IdaDataFrame the input data frame out_table : str, optional the output table where the predictions will be stored transaction_id_column : str, optional the input table column identifying transactions items_column : str, optional the input table column identifying items in transactions by_column : str, optional the input table column identifying groups of transactions if any. Association Rules min-ing is done separately on each of these groups. Leave the parameter undefined if no groups are to be considered. scoring_type: str, optional he type how the scoring algorithm should be applied to the input data. The following values are allowed: recommend, exclusiveRecommend. recommend - A rule is returned if its left hand side itemset is a subset of the transaction. exclusiveRecommend - A rule is returned if its left hand side itemset is a subset of the input itemset, and its right hand side itemset is not a subset of the transaction. name_map_column: str, optional table which provides names of items and their associated mapped values in LHS_ITEMS, RHS_ITEMS columns of outtable item_name_column, str, optional the column name of namemap table where the item identifiers are item_name_mapped_column: str, optional the column name of namemap table where the item names are stored which should be used in-stead of the item identifier min_size: int, optional The minimum number of items per association rule to be applied max_size: int, optional The maximum number of items per association rule to be applied min_support: float, optional The minimum support of an association rule to be applied max_support: float, optional The maximum support of an association rule to be applied min_confidence: float, optional The minimum confidence of an association rule to be applied. max_confidence: float, optional The maximum confidence of an association rule to be applied min_lift: float, optional The minimum lift of an association rule to be applied max_lift: float, optional The maximum lift of an association rule to be applied min_conviction: float, optional The minimum conviction of an association rule to be applied max_conviction: float, optional The maximum conviction of an association rule to be applied min_affinity: float, optional The minimum affinity of an association rule to be applied max_affinity: float, optional The maximum affinity of an association rule to be applied min_leverage: float, optional The minimum leverage of an association rule to be applied max_leverage: float, optional The maximum leverage of an association rule to be applied Returns ------- IdaDataFrame the data frame containing output of a Association Rules model prediction """ params = { 'tid': q(transaction_id_column), 'item': q(item_column), 'by': q(by_column), 'type': scoring_type, 'namemap': name_map_column, 'itemname': item_name_column, 'itemnamemapped': item_name_mapped_column, 'minsize': min_size, 'maxsize': max_size, 'minsupp': min_support, 'maxsupp': max_support, 'minconf': min_confidence, 'maxconf': max_confidence, 'minlift': min_lift, 'maxlift': max_lift, 'minconv': min_conviction, 'maxconv': max_conviction, 'minaffi': min_affinity, 'maxaffi': max_affinity, 'minleve': min_leverage, 'maxleve': max_leverage } return self._predict(in_df=in_df, params=params, out_table=out_table)