IdaDataFrame

Open an IdaDataFrame Object.

class nzpyida.frame.IdaDataFrame(idadb, tablename, indexer=None)[source]

An IdaDataFrame object is a reference to a table in a remote Db2 Warehouse database. IDA stands for In-DataBase Analytics. IdaDataFrame copies the Pandas interface for DataFrame objects to ensure intuitive interaction for end-users.

Examples

>>> idadb = IdaDataBase('BLUDB') # See documentation for IdaDataBase
>>> ida_iris = IdaDataFrame(idadb, 'IRIS')
>>> ida_iris.cov()
                  sepal_length  sepal_width  petal_length  petal_width
sepal_length      0.685694    -0.042434      1.274315     0.516271
sepal_width      -0.042434     0.189979     -0.329656    -0.121639
petal_length      1.274315    -0.329656      3.116278     1.295609
petal_width       0.516271    -0.121639      1.295609     0.581006

Attributes:

columns: Index containing the column names in self.
empty: Boolean that is True if the table is empty (no rows).
indexer: The indexer attribute refers to the name of a column that should be used to index the table.
name

Methods

`as_dataframe`()	Download and return an in-memory representation of the dataset as a Pandas DataFrame.
`commit`()	Commit operations in the database.
`corr`([method, features, ignore_indexer])	Compute the correlation matrix, composed of correlation coefficients between all pairs of columns in self.
`corrwith`(other)	Compute the correlation matrix, composed of correlation coefficients between the columns of self and the columns of another IdaDataFrame.
`count`()	Compute the count of non-missing values for all columns of self.
`count_distinct`()	Compute the count of distinct values for all numeric columns of self.
`count_groupby`([columns, count_only, having])	Count the occurence of the values of a column or group of columns
`cov`()	Compute the covariance matrix, composed of covariance coefficients between all pairs of columns in self.
`delete_na`(columns[, logic, inplace])	Filter rows containing NULL values.
`describe`([percentiles])	A basic statistical summary about current IdaDataFrame.
`exists`()	Convenience function delegated from IdaDataBase.
`get_primary_key`()	Get the name of the primary key of self, if there is one.
`groupby`(by)	Creates the groupby object
`head`([nrow, sort])	Print the n first rows of the instance, n is set to 5 by default.
`ida_query`(query[, silent, first_row_only, ...])	Convenience function delegated from IdaDataBase.
`ida_scalar_query`(query[, silent, autocommit])	Convenience function delegated from IdaDataBase.
`info`([buf])	Some information about current IdaDataFrame.
`is_table`()	Convenience function delegated from IdaDataBase.
`is_view`()	Convenience function delegated from IdaDataBase.
`join`(other[, on, how, lsuffix, rsuffix])	Implement pandas-like interface to join tables
`levels`([columns])	Return the numbers of distinct values
`mad`()	Compute the mean absolute distance for all numeric columns of self.
`max`()	Compute the maximum value over for all numeric columns of self.
`mean`()	Compute the mean for each numeric columns of self.
`mean_freq_of_instance`([columns])	Return the average occurence of the values of a column or group of columns
`median`()	Compute the median for all numeric columns of self.
`merge`(right[, how, on, left_on, right_on, ...])	Implement pandas-like interface to merge IdaDataFrames
`min`()	Compute the minimum value for all numerics column of self.
`mode`()	Compute the most common value for each non numeric column self.
`pivot_table`([values, columns, max_entries, ...])	Compute an aggregation function over all rows of each column that is specified as a value on the dataset.
`quantile`([q])	Compute row wise quantiles for each numeric column.
`rank`()	Compute the rank over all entries for all columns of self.
`rollback`()	Rollback operations in the database.
`save_as`(tablename[, clear_existing])	Save self as a table name in the remote database with the name tablename.
`sort`([columns, axis, ascending, inplace])	Sort the IdaDataFrame row wise or column wise.
`std`()	Compute the standard deviation for all numeric columns of self.
`sum`()	Compute the sum of values for all numeric columns of self.
`summary`()	A basic statistical summary about current IdaDataFrame.
`tail`([nrow, sort])	Print the n last rows of the instance, n is set to 5 by default.
`train_test_split`(train_table, test_table, ...)	Split the table into train and test sets
`unique`(column)	Return the unique values of a column
`var`()	Compute the variance for all numeric columns of self.

mean_groupby
print
within_class_std
within_class_var

__init__(idadb, tablename, indexer=None)[source]

Constructor for IdaDataFrame objects.

Parameters:

idadbIdaDataBase: IdaDataBase instance which contains the connection to be used.
tablenamestr: Name of the table to be opened in the database. It should contain only alphanumeric characters and underscores. All lower case characters will be converted to upper case characters.
indexerstr, optional: Name of the column that should be used as an index. This is optional. However, if no indexer is given, the order of rows issued by the head and tail functions is not guaranteed. Also, several in-database machine learning algorithms need an indexer as a parameter to be executed.

Notes

Attributes “type”, “dtypes”, “index”, “columns”, “axes”, and “shape” are evaluated in a lazy way to avoid an overhead when creating an IdaDataFrame. Sometimes the index may be too big to be downloaded.

Examples

>>> idadb = IdaDataBase('BLUDB')
>>> ida_iris = IdaDataFrame(idadb, "IRIS")

Attributes:

_idadbIdaDataBase: IdaDataBase object parent of the current instance.
tablenamestr: Name of the table self references.
namestr: Full name of the table self references, including schema.
schemastr: Name of the schema the table belongs to.
indexerstr: Name of the column used as an index. “None” if no indexer.
locstr: Indexer that enables the selection and projection of IdaDataFrame instances. For more information, see the loc class documentation.
internal_stateInternalState: Object used to internally store the state of the IdaDataFrame. It also allows several non-destructive manipulation methods.
typestr: Type of the IdaDataFrame : “Table”, “View”, or “Unknown”.
dtypesDataFrame: Data type in the database for each column.
indexpandas.core.index: Index containing the row names in this table.
columnspandas.core.index: Index containing the columns names in this table.
axeslist: List containing columns and index attributes.
shapeTuple: Number of rows and number of columns.

Methods

DataFrame introspection

internal_state

IdaDataFrame.internal_state = <lazy.lazy.lazy object>[source]

indexer

IdaDataFrame.indexer: The indexer attribute refers to the name of a column that should be used to index the table. This makes sense because Db2 Warehouse is a column-based database, so row IDs do not make sense and are not deterministic. As a consequence, the only way to address a particular row is to refer to it by its index. If no indexer is provided, ibmdbpy still works but a correct row order is not guaranteed as far as the dataset is not sorted. Also, note that the indexer column is not taken into account in data mining algorithms.

type

IdaDataFrame.type = <lazy.lazy.lazy object>[source]

dtypes

IdaDataFrame.dtypes = <lazy.lazy.lazy object>[source]

index

IdaDataFrame.index = <lazy.lazy.lazy object>[source]

columns

IdaDataFrame.columns

Index containing the column names in self.

Returns:

Index

Examples

>>> ida_iris.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')

axes

IdaDataFrame.axes = <lazy.lazy.lazy object>[source]

shape

IdaDataFrame.shape = <lazy.lazy.lazy object>[source]

empty

IdaDataFrame.empty

Boolean that is True if the table is empty (no rows).

Returns:

Boolean

len

IdaDataFrame.__len__()[source]

Number of records.

Returns:

int

Examples

>>> len(idadf)
150

iter

IdaDataFrame.__iter__()[source]: Iterate over columns.

DataFrame modification

Selection, Projection

IdaDataFrame.__getitem__(item)[source]

Enable label based projection (selection of columns) in IdaDataFrames.

Enable slice based selection of rows in IdaDataFrames.

Enable row filtering.

The syntax is similar to Pandas.

Notes

The row order is not guaranteed if no indexer is given and the dataset is not sorted

Examples

>>> idadf['col1'] # return an IdaSeries

>>> idadf[['col1']] # return an IdaDataFrame with one column

>>> idadf[['col1', 'col2', 'col3']] # return an IdaDataFrame with 3 columns

>>> idadf[0:9] # Select the 10 first rows

>>> idadf[idadf['col1'] = "test"]
# select of rows for which attribute col1 is equal to "test"

Selection and Projection are also possible using the nzpyida.Loc object stored in IdaDataFrame.loc.

class nzpyida.indexing.Loc(idadf)[source]: The Loc class is used to select and project IdaDataFrames. It implements a Pandas-like interface.

Filtering

IdaDataFrame.delete_na(columns, logic='any', inplace=False)[source]

Filter rows containing NULL values. Can be done in a destructive way (rows are deleted in the physical table) or in a non destructive way (a new IdaDataFrame is defined, original table and IdaDataFrame are preserved).

Parameters:

columnslist of strings: list of eligible column names
logicstr, optional: “any” by default. If logic is set to “any” then all rows which contain a NaN value in any (id est at least one) of the cited columns is deleted, this is a union condition; if logic is set to “all”, then only rows containing null values in all the fields are deleted, this is an intersection condition.
inplacebool, optional: False by default. If True, then the underlying table is physically modified, if False, then the original objects remain unaffected, a copy of the IdaDataFrame is made and modified.

Returns:

If inplace is True, the original table is modified and the IdaDataFrame will be modified accordingly. No return.
If inplace is False, a new IdaDataFrame is returned, the original table is not affected. This new IdaDataFrame points to a
new table which has been created according to the user defined criterions on columns.

Examples

>>> idadf.delete_na(["COL_1", "COL_2"], inplace = True)
>>> #each row of the original table is physically deleted if there is a NULL value in one of the listed columns
>>> new_idadf = idadf.delete_na(["COL_1", "COL_2"], logic = 'all')
>>> # a new IdaDataFrame is created, rows will be selected only if all the listed columns have a NULL value.

class nzpyida.filtering.FilterQuery(columns, tablename, method, value)[source]

FilterQueries are used to represent the filtering of an IdaDataFrame in ibmdbpy. The use of comparison operators, such as <, <=, ==, >=, > on an IdaDataFrame instance produces a FilterQuery instance which acts as a container for the where clause of the corresponding SQL request.

Filtering is possible using a Pandas-like syntax. Applying comparison operators to IdaDataFrames produces a FilterQuery instance which contains the string embedding the corresponding where clause in the “wherestr” attribute.

FilterQuery objects also contain a logic that allows them to be combined, thus allowing complex filtering.

You can combine the following operators: |, &, ^ (OR, AND and XOR)

Notes

It is not possible to filter an IdaDataFrame by using an IdaDataFrame that is opened in a different data source in the database. This is due to the fact that, using a Pandas-like syntax, “idadf[‘petal_width’] < 5” will return a Boolean array that is used to subset the original DataFrame. This is a fundamental restriction of ibmdbpy: we cannot afford to compute and download such an array because we cannot assume that the result will fit into user’s memory. Download time can also be a performance issue.

Examples

>>> idadf[['sepal_length', 'petal_width'] < 5]
>>> <ibmdbpy.filtering.FilterQuery at 0xa65ba90>
>>> _.wherestr
'("sepal_length" < 5 AND "petal_width" < 5)'

>>> idadf[idadf[['sepal_length', 'petal_width']] < 5]
<ibmdbpy.frame.IdaDataFrame at 0xa73a860>
>>> _.head() # filtered IdaDataFrame
    sepal_length  sepal_width  petal_length  petal_width     species
0           4.4          2.9           1.4          0.2      setosa
1           4.7          3.2           1.6          0.2      setosa
2           4.9          2.5           4.5          1.7   virginica
3           4.9          2.4           3.3          1.0  versicolor
4           4.6          3.2           1.4          0.2      setosa

>>> idadf[(idadf['sepal_length'] < 5) & (idadf[petal_width'] > 1.5)]
<ibmdbpy.frame.IdaDataFrame at 0xa74b9b0>
>>> _.head()
   sepal_length  sepal_width  petal_length  petal_width    species
0           4.9          2.5           4.5          1.7  virginica

Attributes:

query: Return an SQL query like “SELECT * FROM %s WHERE <WHERESTR>”, where <WHERESTR> is the value of the attribute “wherestr”.

property query: Return an SQL query like “SELECT * FROM %s WHERE <WHERESTR>”, where <WHERESTR> is the value of the attribute “wherestr”.

IdaDataFrame.__lt__(value)[source]: ibmdbpy.filtering.FilterQuery object when comparing self using “<”.

IdaDataFrame.__le__(value)[source]: ibmdbpy.filtering.FilterQuery object when comparing self using “<=”.

IdaDataFrame.__eq__(value)[source]: ibmdbpy.filtering.FilterQuery object when comparing self using “==”.

IdaDataFrame.__ne__(value)[source]: ibmdbpy.filtering.FilterQuery object when comparing self using “!=”.

IdaDataFrame.__ge__(value)[source]: ibmdbpy.filtering.FilterQuery object when comparing self using “>=”.

IdaDataFrame.__gt__(value)[source]: ibmdbpy.filtering.FilterQuery object when comparing self using “>”.

Feature Engineering

IdaDataFrame.__setitem__(key, item)[source]

Enable the creation and aggregation of columns.

Examples

>>> idadf['new'] = idadf['sepal_length'] * idadf['sepalwidth']
# select a new column as the product of two existing columns

>>> idadf['sepal_length'] = idadf['sepal_length'] / idadf['sepal_length'].mean()
# modify an existing column

aggregation.aggregate_idadf(method, other, swap=False)

Modify internal state variables to represent the aggregation of columns of an IdaDataFrame or IdaSeries in ibmdbpy.

The following comparison operators are supported: +, *, /, -, //, %, **.

The syntax is similar to Pandas.

Parameters:

idadfIdaDataFrame or IdaSeries: IdaDataFrame or IdaSerie on the left (if swap is False)
methodstr: Aggregation method that is computed: the following values are admissible: “add”,”mul”,”div”,”sub”,”floordiv”,”mod”,”neg”,”pow”
other: Number or IdaDataFrame or IdaSeries: Another object that idadf will be aggregated with (on the right if swap is False).
swapbool, default: False: Internally used to handle cases where the call is made reflexively, that is when the main IdaDataFrame/IdaSeries is not on the left. If swap is True, this also implies that other is not of type IdaDataFrame/IdaSeries.

Returns:

Aggregated IdaDataFrame or IdaSeries

Raises:

ValueError: Aggregation method not supported.
TypeError: Type not supported for aggregation.

Notes

It is not possible to create aggregations between columns that are stored in different Db2 Warehouse tables.

Examples

>>> idairis['SepalLength'] = idairis['SepalLength'] * 2
...

IdaDataFrame.__add__(other)[source]

Perform an addition between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] + 3

IdaDataFrame.__radd__(other)[source]

Enable the reflexivity of the addtion operation.

Examples

>>> ida = idadf['sepal_length'] + 3

>>> ida = 3 + idadf['sepal_length']

IdaDataFrame.__div__(other)[source]

Perform a division between self and another IdaDataFrame or number.

When __future__.division is not in effect.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] / 3

IdaDataFrame.__rdiv__(other)[source]

Enable the reflexivity of the division operation.

When __future__.division is not in effect.

Examples

>>> ida = idadf['sepal_length'] / 3

>>> ida = 3 / idadf['sepal_length']

IdaDataFrame.__truediv__(other)[source]

Perform a division between self and another IdaDataFrame or number.

When __future__.division is in effect.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] / 3

IdaDataFrame.__rtruediv__(other)[source]

Enable the reflexivity of the division operation.

When __future__.division is in effect.

Examples

>>> ida = idadf['sepal_length'] / 3

>>> ida = 3 / idadf['sepal_length']

IdaDataFrame.__floordiv__(other)[source]

Perform an integer division between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] // 3

IdaDataFrame.__rfloordiv__(other)[source]

Enable the reflexivity of the integer division operation.

Examples

>>> ida = idadf['sepal_length'] // 3

>>> ida = 3 // idadf['sepal_length']

IdaDataFrame.__mod__(other)[source]

Perform a modulo operation between self and another IdaDataFrame or number.

Notes

Arithmetic operations make sense if self has only numeric columns.

Examples

>>> ida = idadf['sepal_length'] % 3

IdaDataFrame.__rmod__(other)[source]

Enable the reflexivity of the modulo operation.

Examples

>>> ida = idadf['sepal_length'] % 3

>>> ida = 3 % idadf['sepal_length']

IdaDataFrame.__mul__(other)[source]

Perform a multiplication between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] * 3

IdaDataFrame.__rmul__(other)[source]

Enable the reflexivity of the multiplication operation.

Examples

>>> ida = idadf['sepal_length'] % 3

>>> ida = 3 % idadf['sepal_length']

IdaDataFrame.__neg__()[source]

Calculate the absolute negative of all columns in self.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

IdaDataFrame.__rpos__(other)[source]: Calculate the absolute positive. No operation required.

IdaDataFrame.__pow__(other)[source]

Perform a power operation between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] ** 3

IdaDataFrame.__rpow__(other)[source]

Enable the reflexivity of the power operation.

Examples

>>> ida = idadf['sepal_length'] ** 3

>>> ida = 3 ** idadf['sepal_length']

IdaDataFrame.__sub__(other)[source]

Perform a substraction between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] - 3

IdaDataFrame.__rsub__(other)[source]

Enable the reflexivity of the substraction operation.

Examples

>>> ida = idadf['sepal_length'] - 3

>>> ida = 3 - idadf['sepal_length']

IdaDataFrame.__delitem__(item)[source]

Enable non-destructive deletion of columns using a Pandas style syntax. This happens inplace, which means that the current IdaDataFrame is modified.

Examples

>>> idadf = IdaDataFrame(idadb, "IRIS")
>>> idadf.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object')
>>> del idadf['sepal_length']
>>> idadf.columns
Index(['sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object')

IdaDataFrame.save_as(tablename, clear_existing=False)[source]: Save self as a table name in the remote database with the name tablename. This function might erase an existing table if tablename already exists and clear_existing is True.

DataBase Features

exists

IdaDataFrame.exists()[source]

Convenience function delegated from IdaDataBase.

Check if the data still exists in the database.

is_view / is_table

IdaDataFrame.is_view()[source]

Convenience function delegated from IdaDataBase.

Check if the IdaDataFrame corresponds to a view in the database.

IdaDataFrame.is_table()[source]

Convenience function delegated from IdaDataBase.

Check if the IdaDataFrame corresponds to a table in the database.

get_primary_key

IdaDataFrame.get_primary_key()[source]: Get the name of the primary key of self, if there is one. Otherwise, this function returns 0. This function may be deprecated in future versions because it is not very useful.

ida_query

IdaDataFrame.ida_query(query, silent=False, first_row_only=False, autocommit=False)[source]

Convenience function delegated from IdaDataBase.

Prepare, execute and format the result of a query in a data frame or in a tuple. See the IdaDataBase.ida_query documentation.

ida_scalar_query

IdaDataFrame.ida_scalar_query(query, silent=False, autocommit=False)[source]

Convenience function delegated from IdaDataBase.

Prepare and execute a query and return only the first element as a string. See the IdaDataBase.ida_scalar_query documentation.

Data Exploration

head

IdaDataFrame.head(nrow=5, sort=True)[source]

Print the n first rows of the instance, n is set to 5 by default.

Parameters:

nrowint > 0: Number of rows to be included in the result.
sort: default is True: If set to True and no indexer is set the data will be sorted by the first numeric column or if no numeric column is available by the first column of the dataframe. If set to False and no indexer is set the row order is not guaranteed and can vary with each execution. For big tables this option might save query processing time.

Returns:

DataFrame or Series: The index of the corresponding row number and the columns are all columns of self. If the IdaDataFrame has only one column, it returns a Series.

Examples

>>> ida_iris.head()
   sepal_length  sepal_width  petal_length  petal_width species
         5.1          3.5           1.4          0.2  setosa
         4.9          3.0           1.4          0.2  setosa
         4.7          3.2           1.3          0.2  setosa
         4.6          3.1           1.5          0.2  setosa
         5.0          3.6           1.4          0.2  setosa

tail

IdaDataFrame.tail(nrow=5, sort=True)[source]

Print the n last rows of the instance, n is set to 5 by default.

Parameters:

nrowint > 0: The number of rows to be included in the result.
sort: default is True: If set to True and no indexer is set the data will be sorted by the first numeric column or if no numeric column is available by the first column of the dataframe. If set to False and no indexer is set the row order is not guaranteed and can vary with each execution. For big tables this option might save query processing time.

Returns:

DataFrame: The index of the corresponding row number and the columns are all columns of self.

Examples

>>> ida_iris.tail()
     sepal_length  sepal_width  petal_length  petal_width    species
         6.7          3.0           5.2          2.3  virginica
         6.3          2.5           5.0          1.9  virginica
         6.5          3.0           5.2          2.0  virginica
         6.2          3.4           5.4          2.3  virginica
         5.9          3.0           5.1          1.8  virginica

pivot_table

IdaDataFrame.pivot_table(values=None, columns=None, max_entries=1000, sort=None, factor_threshold=20, interactive=False, aggfunc='count')[source]

Compute an aggregation function over all rows of each column that is specified as a value on the dataset. The result grouped by the columns defined in “columns”.

Parameters:

values: str or list or str optional

List of columns on which “aggfunc” is computed.

columns: str or list or str optional

List of columns that is used as an index and by which the dataframe is grouped.

max_entries: int, default=1000

The maximum number of cells to be part of the output. By default, set to 1000.

sort: str, optional

Admissible values are: “alpha” and “factors”.

If “alpha”, the index of the output is sorted according to the alphabetical order.
If “factors”, the index of the output will be sorted according to increasing number of the distinct values.

By default, the index will be sorted in the same order that is specified in “columns” argument.

factor_threshold: int, default: 20

Number of distinct values above which a categorical column should not be considered categorical anymore and under which a numerical column column should not be considered numerical anymore.

interactive: bool

If True, the user is asked if he wants to display the output, given its size.

aggfunc: str

Aggregation function to be computed on each column specified in the argument “values”. Admissible values are: “count”, “sum”, “avg”. This entry is not case-sensitive.

Returns:

Pandas Series with Multi-index (columns)

Examples

>>> val = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
>>> ida_iris.pivot_table(values= val, aggfunc="avg")
              species
sepal_length  setosa        5.006
              versicolor    5.936
              virginica     6.588
sepal_width   setosa        3.428
              versicolor    2.770
              virginica     2.974
petal_length  setosa        1.462
              versicolor    4.260
              virginica     5.552
petal_width   setosa        0.246
              versicolor    1.326
              virginica     2.026
dtype: float64

join

IdaDataFrame.join(other, on: str = None, how: str = 'left', lsuffix: str = '_x', rsuffix: str = '_y')[source]

Implement pandas-like interface to join tables

Parameters:

otherIdaDataFrame: IdaDataFrame to join to this object
onstr, optional: name of column in both IdaDataFrames to do join on
howstr, optional: type of join, possible types - ‘outer’, ‘inner’, ‘right’, ‘left’, ‘cross’
lsuffixstr, optional: suffix to add to columns in this IdaDataFrame that are present in both IdaDataFrames
rsuffixstr, optional: suffix to add to columns in other IdaDataFrame that are present in both IdaDataFrames

Returns:

IdaDataFrame: result of joining IdaDataFrames

Examples

>>> from nzpyida.sampledata.iris import iris
>>> from nzpyida.sampledata.iris import iris as iris2
>>> iris.reset_index()
>>> iris2.reset_index()
>>> iris2.columns = ['id', 'sepal_length', 'sepal_width', 'petal_length',             'PETAL_WIDTH', 'CLASS']
>>> iris2['id'] = iris2['id'] + 50
>>> iris1_ida = idadb.as_idadataframe(iris, 'IRIS_TEST1', indexer='index')
>>> iris2_ida = idadb.as_idadataframe(iris2, 'IRIS_TEST2', indexer='id')
>>> iris_merge = iris1_ida.join(iris2_ida)
>>> iris_merge.tail()
    index       sepal_length_x  sepal_width_x   petal_length_x  petal_width     species     id  sepal_length_y  sepal_width_y   petal_length_y  PETAL_WIDTH     CLASS                                                                                           
95      145     6.7             3.0             5.2             2.3             virginica   145 5.7             3.0             4.2             1.2             versicolor
96      146     6.3             2.5             5.0             1.9             virginica   146 5.7             2.9             4.2             1.3             versicolor
97      147     6.5             3.0             5.2             2.0             virginica   147 6.2             2.9             4.3             1.3             versicolor
98      148     6.2             3.4             5.4             2.3             virginica   148 5.1             2.5             3.0             1.1             versicolor
99      149     5.9             3.0             5.1             1.8             virginica   149 5.7             2.8             4.1             1.3             versicolor
>>> len(iris_merge)
100

merge

IdaDataFrame.merge(right, how: str = 'inner', on=None, left_on=None, right_on=None, left_index: bool = False, right_index: bool = False, suffixes: List[str] = ['_x', '_y'], indicator: bool = False)[source]

Implement pandas-like interface to merge IdaDataFrames

Parameters:

rightIdaDataFrame: right IdaDataDrame to merge
howstr, optional: type of join, possible types - ‘outer’, ‘inner’, ‘right’, ‘left’, ‘cross’
onstr, optional: name of column in both IdaDataFrames to do join on
left_onstr, optional: name of column in left IdaDataFrame to do join on
right_onstr, optional: name of column in right IdaDataFrame to do join on
left_indexbool, optional: whether to join on indexer of left IdaDataFrame
right_indexbool, optional: whether to join on indexer of right IdaDataFrame
suffixes: List[str], optional: list of suffixes to add to columns that are present in both IdaDataFrames
indicator: bool optional: whether to add to output IdaDataFrame, column with information about the source of a given row

Returns:

IdaDataFrame: result of merging IdaDataFrames

Examples

>>> from nzpyida.sampledata.iris import iris
>>> from nzpyida.sampledata.iris import iris
>>> iris.reset_index()
>>> iris2.reset_index()
>>> iris2.columns = ['id', 'sepal_length', 'sepal_width', 'petal_length',             'PETAL_WIDTH', 'CLASS']
>>> iris2['id'] = iris2['id'] + 50
>>> iris1_ida = idadb.as_idadataframe(iris, 'IRIS_TEST1', indexer='index')
>>> iris2_ida = idadb.as_idadataframe(iris2, 'IRIS_TEST2', indexer='id')
>>> iris_merge = iris1_ida.merge(iris2_ida, left_on='index', right_on='id')
>>> iris_merge.tail()
    index       sepal_length_x  sepal_width_x   petal_length_x  petal_width     species     id  sepal_length_y  sepal_width_y   petal_length_y  PETAL_WIDTH     CLASS                                                                                           
95      145     6.7             3.0             5.2             2.3             virginica   145 5.7             3.0             4.2             1.2             versicolor
96      146     6.3             2.5             5.0             1.9             virginica   146 5.7             2.9             4.2             1.3             versicolor
97      147     6.5             3.0             5.2             2.0             virginica   147 6.2             2.9             4.3             1.3             versicolor
98      148     6.2             3.4             5.4             2.3             virginica   148 5.1             2.5             3.0             1.1             versicolor
99      149     5.9             3.0             5.1             1.8             virginica   149 5.7             2.8             4.1             1.3             versicolor
>>> len(iris_merge)
100
>>> iris_merge = iris1_ida.merge() iris2_ida, left_on='index', right_index=True, 
                    how='outer', indicator=True, suffixes=("_a", "_b"))
>>> iris_merge.head()
    index       sepal_length_a  sepal_width_a   petal_length_a  petal_width     species         id      sepal_length_b  sepal_width_b   petal_length_b  PETAL_WIDTH CLASS       INDICATOR
0       None    None            None            None            None            None            152     7.1             3.0             5.9             2.1         virginica   right_only
1       None    None            None            None            None            None            151     5.8             2.7             5.1             1.9         virginica   right_only
2       None    None            None            None            None            None            154     6.5             3.0             5.8             2.2         virginica   right_only
3       None    None            None            None            None            None            153     6.3             2.9             5.6             1.8         virginica   right_only
4       None    None            None            None            None            None            150     6.3             3.3             6.0             2.5         virginica   right_only

concat

join_tables.concat(axis: int = 0, join: str = 'outer', keys: List[str] = None)

Implement pandas-like interface to concateate IdaDataFrames

Parameters:

objsList[IdaDataFrame]: list of IdaDataFrames to be concatenated
axisint, optional: axis to concatenate on
joinstr, optional: type of join, possible types - ‘outer’, ‘inner’
keysList[str], optional: if List is present then add column with table indetifier

Returns:

IdaDataFrame: reult of concataneting IdaDataFrames

Examples

>>> from nzpyida.sampledata.iris import iris
>>> from nzpyida.sampledata.iris import iris as iris2
>>> iris.reset_index()
>>> iris2.reset_index()
>>> iris2.columns = ['id', 'sepal_length', 'sepal_width', 'petal_length',         'PETAL_WIDTH', 'CLASS']
>>> iris2['id'] = iris2['id'] + 50
>>> iris1_ida = idadb.as_idadataframe(iris, 'IRIS_TEST1', indexer='index')
>>> iris2_ida = idadb.as_idadataframe(iris2, 'IRIS_TEST2', indexer='id')
>>> iris_concat = nzpyida.concat([iris1_ida, iris2_ida])
>>> len(iris1_ida), len(iris2_ida), len(iris_concat)
(150, 150, 300)
>>> iris_concat.head()
    index   sepal_length    sepal_width     petal_length    petal_width     species     id      PETAL_WIDTH CLASS
0   None    4.9             3.0             1.4             None            None        51      0.2         setosa
1   None    4.6             3.1             1.5             None            None        53      0.2         setosa
2   None    4.7             3.2             1.3             None            None        52      0.2         setosa
3   None    5.1             3.5             1.4             None            None        50      0.2         setosa
4   None    5.9             3.0             5.1             None            None        199     1.8         virginica
>>> iris_concat.tail()
    index   sepal_length    sepal_width     petal_length    petal_width     species         id      PETAL_WIDTH     CLASS                                                                   
295 145     6.7             3.0             5.2             2.3             virginica       None    None            None
296 146     6.3             2.5             5.0             1.9             virginica       None    None            None
297 147     6.5             3.0             5.2             2.0             virginica       None    None            None
298 148     6.2             3.4             5.4             2.3             virginica       None    None            None
299 149     5.9             3.0             5.1             1.8             virginica       None    None            None

sort

IdaDataFrame.sort(columns=None, axis=0, ascending=True, inplace=False)[source]

Sort the IdaDataFrame row wise or column wise.

Parameters:

columnsstr or list of str: Columns that should be used to sort the rows in the IdaDataFrame. If columns is set to None and axis to 0, then the IdaDataFrame columns are sorted in lexicographical order.
axisint (0/1): Axis that is sorted. 0 for sorting row wise, 1 for sorting column wise.
ascendingbool, default: True: Sorting order, True : ascending, False : descending
inplacebool, default: False: The current object is modified or creates a modified copy. If False, the function creates a modified copy of the current dataframe. If True, the function modifies the current dataframe.

Raises:

ValueError

When sorting by column (column not None), the axis value must be 0 (rows).
A column does not belong to self.
The axis argument has a value other than 0 or 1.

Notes

If columns is set to None and axis to 0, this undoes all sorting the IdaDataFrame and returns the original sorting in the Db2 Warehouse database.

No actual changes are made in Db2 Warehouse, only the querying changes. Everything is registered in an InternalState object. Changes can be observed by using head and tail function.

IdaDataFrame.groupby(by)[source]

Creates the groupby object

Parameters:

bystr: Column to group the Data Frame by

Returns:

IdaDataFrameGroupBy

Examples

>>> ida_iris.groupby("CLASS")
<nzpyida.groupby.IdaDataFrameGroupBy at 0x11c288e10>

Obtaing IdaDataFrame with grouped data is possible with aggregate methods of IdaDataFrameGroupBy

IdaDataFrameGroupBy

class nzpyida.groupby.IdaDataFrameGroupBy(in_df, columns_to_aggregate, by_column)[source]

Class representing groupby object, that is object created from IdaDataFrame. It implements Pandas-like object for grouping.

Examples

>>> ida_iris.groupby("CLASS")
<nzpyida.groupby.IdaDataFrameGroupBy at 0x11c288e10>

>>> df.groupby("CLASS").mean().head()
        AVG_SEPAL_LENGTH    AVG_SEPAL_WIDTH         AVG_PETAL_LENGTH        AVG_PETAL_WIDTH         CLASS
    0       5.006               3.418                   1.464                   0.244                   Iris-setosa
    1       5.936               2.770                   4.260                   1.326                   Iris-versicolor
    2       6.588               2.974                   5.552                   2.026                   Iris-virginica

>>> mean_sepal_length = ida_iris[["SEPAL_LENGTH", "CLASS"]].groupby("CLASS").mean()
>>> mean_sepal_length.head()
        AVG_SEPAL_LENGTH    CLASS
    0       5.006               Iris-setosa
    1       5.936               Iris-versicolor
    2       6.588               Iris-virginica

>>> iris_groupby = ida_iris[(ida_iris["SEPAL_LENGTH"]>4) & (ida_iris["SEPAL_WIDTH"]<3)][
>>>     ["SEPAL_WIDTH", "SEPAL_LENGTH", "CLASS"]].groupby("CLASS")
>>> iris_groupby
<nzpyida.groupby.IdaDataFrameGroupBy object at 0x119863190>
>>> iris_groupby.count().head()
        COUNT_SEPAL_WIDTH   COUNT_SEPAL_LENGTH      CLASS
    0       2                       2                       Iris-setosa
    1       21                      21                      Iris-virginica
    2       34                      34                      Iris-versicolor

Methods

`count`()	Aggregates all columns with COUNT method.
`max`()	Aggregates all columns with MAX method.
`mean`()	Aggregates all columns with AVG method.
`min`()	Aggregates all columns with MIN method.
`sum`()	Aggregates all columns with SUM method.

count()[source]

Aggregates all columns with COUNT method. Counts all non null values in column with respect to grouped classes

Returns:

IdaDataFrame: Object with grouped data

max()[source]

Aggregates all columns with MAX method. Selects the maximum value from column with respect to grouped classes

Returns:

IdaDataFrame: Object with grouped data

mean()[source]

Aggregates all columns with AVG method. Calculates mean of values in column with respect to grouped classes

Returns:

IdaDataFrame: Object with grouped data

min()[source]

Aggregates all columns with MIN method. Selects the minimum value from column with respect to grouped classes

Returns:

IdaDataFrame: Object with grouped data

sum()[source]

Aggregates all columns with SUM method. Sums all the values in column with respect to grouped classes

Returns:

IdaDataFrame: Object with grouped data

Descriptive Statistics

describe

IdaDataFrame.describe(percentiles=[0.25, 0.5, 0.75])[source]

A basic statistical summary about current IdaDataFrame. If at least one numerical column exists, the summary includes:

The count of non-missing values for each numerical column.

The mean for each numerical column.

The standart deviation for each numerical column.

The minimum and maximum for each numerical column.

A list of percentiles set by the user (default : the quartiles).

Parameters:

idadfIdaDataFrame
percentilesFloat or list of floats, default: [0.25, 0.50, 0.75].: percentiles to be computed on numerical columns. All values in percentiles must be > 0 and < 1.

Returns:

summary: DataFrame, where

Index is the name of the computed values.
Columns are either numerical or categorical columns of self.

cov (covariance)

IdaDataFrame.cov()[source]

Compute the covariance matrix, composed of covariance coefficients between all pairs of columns in self. It must have at least two numeric columns.

Returns:

covariance matrix: DataFrame: The axes are the columns of self and the values are the covariance coefficients.

corr (correlation)

IdaDataFrame.corr(method='pearson', features=None, ignore_indexer=True)[source]

Compute the correlation matrix, composed of correlation coefficients between all pairs of columns in self. It must have at least two numeric columns.

Parameters:

methodstr, default: pearson: Method to be used to compute the correlation. By default, compute the pearson correlation coefficient. The Spearman rank correlation is also available. Admissible values are: “pearson”, “spearman”.

Returns:

correlation matrix: DataFrame: The axes are the columns of self and the values are the correlation coefficients.

Notes

For the Spearman rank correlation, the ordinal rank of columns is computed. For performance reasons this is easier to compute than the fractional rank traditionally computed for the Spearman rank correlation method. This strategy has the property that the sum of the ranking numbers is the same as under ordinal ranking. We then apply the pearson correlation coefficient method to these ranks.

quantile

IdaDataFrame.quantile(q=0.5)[source]

Compute row wise quantiles for each numeric column.

Parameters:

qfloat or array-like, default 0.5 (50% quantile): 0 <= q <= 1, the quantile(s) to compute

Returns:

quantiles: Series or DataFrame: If q is an array, the function returns a DataFrame in which the index is q. The columns are the columns of sel, and the values are the quantiles. If q is a float, a Series is returned where the index is the columns of self and the values are the quantiles.

mad (mean absolute deviation)

IdaDataFrame.mad()[source]

Compute the mean absolute distance for all numeric columns of self. It must have at least two numeric columns.

Returns:

mad: Series: The index consists of the columns of self and the values are the mean absolute distance.

min (minimum)

IdaDataFrame.min()[source]

Compute the minimum value for all numerics column of self.

Returns:

min: Series: The index consists of the columns of self and the values are the minimum.

max (maximum)

IdaDataFrame.max()[source]

Compute the maximum value over for all numeric columns of self.

Returns:

max: Series.: The index consists of the columns of self and the values are the maximum.

count

IdaDataFrame.count()[source]

Compute the count of non-missing values for all columns of self.

Returns:

count: Series.: The index consists of the columns of self and the values are the number of non-missing values.

count_distinct

IdaDataFrame.count_distinct()[source]

Compute the count of distinct values for all numeric columns of self.

Returns:

disctinct count: Series: The index consists of the columns of self and values are the number of distinct values.

std (standard deviation)

IdaDataFrame.std()[source]

Compute the standard deviation for all numeric columns of self.

Returns:

std: Series: The index consists of the columns of self and the values are the standard deviation.

var (variance)

IdaDataFrame.var()[source]

Compute the variance for all numeric columns of self.

Returns:

var: Series: The index consists of the columns of self and the values are the variance.

mean

IdaDataFrame.mean()[source]

Compute the mean for each numeric columns of self.

Returns:

mean: Series: The index consists of the columns of self and the values are the mean.

sum

IdaDataFrame.sum()[source]

Compute the sum of values for all numeric columns of self.

Returns:

sum: Series: The index consists of the columns of self and the values are the sum.

median

IdaDataFrame.median()[source]

Compute the median for all numeric columns of self.

Returns:

median: Series: The index consists of the columns of self and the values are the median.

Import as DataFrame

as_dataframe

IdaDataFrame.as_dataframe()[source]

Download and return an in-memory representation of the dataset as a Pandas DataFrame.

Returns:

DataFrame: Columns and records are the same as in self.

Examples

>>> iris = ida_iris.as_dataframe()
>>> iris.head()
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

Connection Management

commit

IdaDataFrame.commit()[source]

Commit operations in the database.

Notes

All changes that are made in the database after the last commit, including those in the child IdaDataFrames, are commited.

If the environment variable ‘VERBOSE’ is set to True, the commit operations are notified in the console.

rollback

IdaDataFrame.rollback()[source]

Rollback operations in the database.

Notes

All changes that are made in the database after the last commit, including those in the child IdaDataFrames, are discarded.

IdaDataFrame

Open an IdaDataFrame Object.

DataFrame introspection

internal_state

indexer

type

dtypes

index

columns

axes

shape

empty

__len__

__iter__

DataFrame modification

Selection, Projection

Filtering

Feature Engineering

DataBase Features

exists

is_view / is_table

get_primary_key

ida_query

ida_scalar_query

Data Exploration

head

tail

pivot_table

join

merge

concat

sort

IdaDataFrameGroupBy

Descriptive Statistics

describe

cov (covariance)

corr (correlation)

quantile

mad (mean absolute deviation)

min (minimum)

max (maximum)

count

count_distinct

std (standard deviation)

var (variance)

mean

sum

median

Import as DataFrame

as_dataframe

Connection Management

commit

rollback

len

iter