IdaDataFrame

Open an IdaDataFrame Object.

class nzpyida.frame.IdaDataFrame(idadb, tablename, indexer=None)[source]

An IdaDataFrame object is a reference to a table in a remote Db2 Warehouse database. IDA stands for In-DataBase Analytics. IdaDataFrame copies the Pandas interface for DataFrame objects to ensure intuitive interaction for end-users.

Examples

>>> idadb = IdaDataBase('BLUDB') # See documentation for IdaDataBase
>>> ida_iris = IdaDataFrame(idadb, 'IRIS')
>>> ida_iris.cov()
                  sepal_length  sepal_width  petal_length  petal_width
sepal_length      0.685694    -0.042434      1.274315     0.516271
sepal_width      -0.042434     0.189979     -0.329656    -0.121639
petal_length      1.274315    -0.329656      3.116278     1.295609
petal_width       0.516271    -0.121639      1.295609     0.581006
Attributes:
columns

Index containing the column names in self.

empty

Boolean that is True if the table is empty (no rows).

indexer

The indexer attribute refers to the name of a column that should be used to index the table.

name

Methods

as_dataframe()

Download and return an in-memory representation of the dataset as a Pandas DataFrame.

commit()

Commit operations in the database.

corr([method, features, ignore_indexer])

Compute the correlation matrix, composed of correlation coefficients between all pairs of columns in self.

corrwith(other)

Compute the correlation matrix, composed of correlation coefficients between the columns of self and the columns of another IdaDataFrame.

count()

Compute the count of non-missing values for all columns of self.

count_distinct()

Compute the count of distinct values for all numeric columns of self.

count_groupby([columns, count_only, having])

Count the occurence of the values of a column or group of columns

cov()

Compute the covariance matrix, composed of covariance coefficients between all pairs of columns in self.

delete_na(columns[, logic, inplace])

Filter rows containing NULL values.

describe([percentiles])

A basic statistical summary about current IdaDataFrame.

exists()

Convenience function delegated from IdaDataBase.

get_primary_key()

Get the name of the primary key of self, if there is one.

groupby(by)

Creates the groupby object

head([nrow, sort])

Print the n first rows of the instance, n is set to 5 by default.

ida_query(query[, silent, first_row_only, ...])

Convenience function delegated from IdaDataBase.

ida_scalar_query(query[, silent, autocommit])

Convenience function delegated from IdaDataBase.

info([buf])

Some information about current IdaDataFrame.

is_table()

Convenience function delegated from IdaDataBase.

is_view()

Convenience function delegated from IdaDataBase.

join(other[, on, how, lsuffix, rsuffix])

Implement pandas-like interface to join tables

levels([columns])

Return the numbers of distinct values

mad()

Compute the mean absolute distance for all numeric columns of self.

max()

Compute the maximum value over for all numeric columns of self.

mean()

Compute the mean for each numeric columns of self.

mean_freq_of_instance([columns])

Return the average occurence of the values of a column or group of columns

median()

Compute the median for all numeric columns of self.

merge(right[, how, on, left_on, right_on, ...])

Implement pandas-like interface to merge IdaDataFrames

min()

Compute the minimum value for all numerics column of self.

mode()

Compute the most common value for each non numeric column self.

pivot_table([values, columns, max_entries, ...])

Compute an aggregation function over all rows of each column that is specified as a value on the dataset.

quantile([q])

Compute row wise quantiles for each numeric column.

rank()

Compute the rank over all entries for all columns of self.

rollback()

Rollback operations in the database.

save_as(tablename[, clear_existing])

Save self as a table name in the remote database with the name tablename.

sort([columns, axis, ascending, inplace])

Sort the IdaDataFrame row wise or column wise.

std()

Compute the standard deviation for all numeric columns of self.

sum()

Compute the sum of values for all numeric columns of self.

summary()

A basic statistical summary about current IdaDataFrame.

tail([nrow, sort])

Print the n last rows of the instance, n is set to 5 by default.

train_test_split(train_table, test_table, ...)

Split the table into train and test sets

unique(column)

Return the unique values of a column

var()

Compute the variance for all numeric columns of self.

mean_groupby

print

within_class_std

within_class_var

__init__(idadb, tablename, indexer=None)[source]

Constructor for IdaDataFrame objects.

Parameters:
idadbIdaDataBase

IdaDataBase instance which contains the connection to be used.

tablenamestr

Name of the table to be opened in the database. It should contain only alphanumeric characters and underscores. All lower case characters will be converted to upper case characters.

indexerstr, optional

Name of the column that should be used as an index. This is optional. However, if no indexer is given, the order of rows issued by the head and tail functions is not guaranteed. Also, several in-database machine learning algorithms need an indexer as a parameter to be executed.

Notes

Attributes “type”, “dtypes”, “index”, “columns”, “axes”, and “shape” are evaluated in a lazy way to avoid an overhead when creating an IdaDataFrame. Sometimes the index may be too big to be downloaded.

Examples

>>> idadb = IdaDataBase('BLUDB')
>>> ida_iris = IdaDataFrame(idadb, "IRIS")
Attributes:
_idadbIdaDataBase

IdaDataBase object parent of the current instance.

tablenamestr

Name of the table self references.

namestr

Full name of the table self references, including schema.

schemastr

Name of the schema the table belongs to.

indexerstr

Name of the column used as an index. “None” if no indexer.

locstr

Indexer that enables the selection and projection of IdaDataFrame instances. For more information, see the loc class documentation.

internal_stateInternalState

Object used to internally store the state of the IdaDataFrame. It also allows several non-destructive manipulation methods.

typestr

Type of the IdaDataFrame : “Table”, “View”, or “Unknown”.

dtypesDataFrame

Data type in the database for each column.

indexpandas.core.index

Index containing the row names in this table.

columnspandas.core.index

Index containing the columns names in this table.

axeslist

List containing columns and index attributes.

shapeTuple

Number of rows and number of columns.

Methods

DataFrame introspection

internal_state

IdaDataFrame.internal_state = <lazy.lazy.lazy object>[source]

indexer

IdaDataFrame.indexer

The indexer attribute refers to the name of a column that should be used to index the table. This makes sense because Db2 Warehouse is a column-based database, so row IDs do not make sense and are not deterministic. As a consequence, the only way to address a particular row is to refer to it by its index. If no indexer is provided, ibmdbpy still works but a correct row order is not guaranteed as far as the dataset is not sorted. Also, note that the indexer column is not taken into account in data mining algorithms.

type

IdaDataFrame.type = <lazy.lazy.lazy object>[source]

dtypes

IdaDataFrame.dtypes = <lazy.lazy.lazy object>[source]

index

IdaDataFrame.index = <lazy.lazy.lazy object>[source]

columns

IdaDataFrame.columns

Index containing the column names in self.

Returns:
Index

Examples

>>> ida_iris.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')

axes

IdaDataFrame.axes = <lazy.lazy.lazy object>[source]

shape

IdaDataFrame.shape = <lazy.lazy.lazy object>[source]

empty

IdaDataFrame.empty

Boolean that is True if the table is empty (no rows).

Returns:
Boolean

__len__

IdaDataFrame.__len__()[source]

Number of records.

Returns:
int

Examples

>>> len(idadf)
150

__iter__

IdaDataFrame.__iter__()[source]

Iterate over columns.

DataFrame modification

Selection, Projection

IdaDataFrame.__getitem__(item)[source]

Enable label based projection (selection of columns) in IdaDataFrames.

Enable slice based selection of rows in IdaDataFrames.

Enable row filtering.

The syntax is similar to Pandas.

Notes

The row order is not guaranteed if no indexer is given and the dataset is not sorted

Examples

>>> idadf['col1'] # return an IdaSeries
>>> idadf[['col1']] # return an IdaDataFrame with one column
>>> idadf[['col1', 'col2', 'col3']] # return an IdaDataFrame with 3 columns
>>> idadf[0:9] # Select the 10 first rows
>>> idadf[idadf['col1'] = "test"]
# select of rows for which attribute col1 is equal to "test"

Selection and Projection are also possible using the nzpyida.Loc object stored in IdaDataFrame.loc.

class nzpyida.indexing.Loc(idadf)[source]

The Loc class is used to select and project IdaDataFrames. It implements a Pandas-like interface.

Filtering

IdaDataFrame.delete_na(columns, logic='any', inplace=False)[source]

Filter rows containing NULL values. Can be done in a destructive way (rows are deleted in the physical table) or in a non destructive way (a new IdaDataFrame is defined, original table and IdaDataFrame are preserved).

Parameters:
columnslist of strings

list of eligible column names

logicstr, optional

“any” by default. If logic is set to “any” then all rows which contain a NaN value in any (id est at least one) of the cited columns is deleted, this is a union condition; if logic is set to “all”, then only rows containing null values in all the fields are deleted, this is an intersection condition.

inplacebool, optional

False by default. If True, then the underlying table is physically modified, if False, then the original objects remain unaffected, a copy of the IdaDataFrame is made and modified.

Returns:
If inplace is True, the original table is modified and the IdaDataFrame will be modified accordingly. No return.
If inplace is False, a new IdaDataFrame is returned, the original table is not affected. This new IdaDataFrame points to a
new table which has been created according to the user defined criterions on columns.

Examples

>>> idadf.delete_na(["COL_1", "COL_2"], inplace = True)
>>> #each row of the original table is physically deleted if there is a NULL value in one of the listed columns
>>> new_idadf = idadf.delete_na(["COL_1", "COL_2"], logic = 'all')
>>> # a new IdaDataFrame is created, rows will be selected only if all the listed columns have a NULL value.
class nzpyida.filtering.FilterQuery(columns, tablename, method, value)[source]

FilterQueries are used to represent the filtering of an IdaDataFrame in ibmdbpy. The use of comparison operators, such as <, <=, ==, >=, > on an IdaDataFrame instance produces a FilterQuery instance which acts as a container for the where clause of the corresponding SQL request.

Filtering is possible using a Pandas-like syntax. Applying comparison operators to IdaDataFrames produces a FilterQuery instance which contains the string embedding the corresponding where clause in the “wherestr” attribute.

FilterQuery objects also contain a logic that allows them to be combined, thus allowing complex filtering.

You can combine the following operators: |, &, ^ (OR, AND and XOR)

Notes

It is not possible to filter an IdaDataFrame by using an IdaDataFrame that is opened in a different data source in the database. This is due to the fact that, using a Pandas-like syntax, “idadf[‘petal_width’] < 5” will return a Boolean array that is used to subset the original DataFrame. This is a fundamental restriction of ibmdbpy: we cannot afford to compute and download such an array because we cannot assume that the result will fit into user’s memory. Download time can also be a performance issue.

Examples

>>> idadf[['sepal_length', 'petal_width'] < 5]
>>> <ibmdbpy.filtering.FilterQuery at 0xa65ba90>
>>> _.wherestr
'("sepal_length" < 5 AND "petal_width" < 5)'
>>> idadf[idadf[['sepal_length', 'petal_width']] < 5]
<ibmdbpy.frame.IdaDataFrame at 0xa73a860>
>>> _.head() # filtered IdaDataFrame
    sepal_length  sepal_width  petal_length  petal_width     species
0           4.4          2.9           1.4          0.2      setosa
1           4.7          3.2           1.6          0.2      setosa
2           4.9          2.5           4.5          1.7   virginica
3           4.9          2.4           3.3          1.0  versicolor
4           4.6          3.2           1.4          0.2      setosa
>>> idadf[(idadf['sepal_length'] < 5) & (idadf[petal_width'] > 1.5)]
<ibmdbpy.frame.IdaDataFrame at 0xa74b9b0>
>>> _.head()
   sepal_length  sepal_width  petal_length  petal_width    species
0           4.9          2.5           4.5          1.7  virginica
Attributes:
query

Return an SQL query like “SELECT * FROM %s WHERE <WHERESTR>”, where <WHERESTR> is the value of the attribute “wherestr”.

property query

Return an SQL query like “SELECT * FROM %s WHERE <WHERESTR>”, where <WHERESTR> is the value of the attribute “wherestr”.

IdaDataFrame.__lt__(value)[source]

ibmdbpy.filtering.FilterQuery object when comparing self using “<”.

IdaDataFrame.__le__(value)[source]

ibmdbpy.filtering.FilterQuery object when comparing self using “<=”.

IdaDataFrame.__eq__(value)[source]

ibmdbpy.filtering.FilterQuery object when comparing self using “==”.

IdaDataFrame.__ne__(value)[source]

ibmdbpy.filtering.FilterQuery object when comparing self using “!=”.

IdaDataFrame.__ge__(value)[source]

ibmdbpy.filtering.FilterQuery object when comparing self using “>=”.

IdaDataFrame.__gt__(value)[source]

ibmdbpy.filtering.FilterQuery object when comparing self using “>”.

Feature Engineering

IdaDataFrame.__setitem__(key, item)[source]

Enable the creation and aggregation of columns.

Examples

>>> idadf['new'] = idadf['sepal_length'] * idadf['sepalwidth']
# select a new column as the product of two existing columns
>>> idadf['sepal_length'] = idadf['sepal_length'] / idadf['sepal_length'].mean()
# modify an existing column
aggregation.aggregate_idadf(method, other, swap=False)

Modify internal state variables to represent the aggregation of columns of an IdaDataFrame or IdaSeries in ibmdbpy.

The following comparison operators are supported: +, *, /, -, //, %, **.

The syntax is similar to Pandas.

Parameters:
idadfIdaDataFrame or IdaSeries

IdaDataFrame or IdaSerie on the left (if swap is False)

methodstr

Aggregation method that is computed: the following values are admissible: “add”,”mul”,”div”,”sub”,”floordiv”,”mod”,”neg”,”pow”

other: Number or IdaDataFrame or IdaSeries

Another object that idadf will be aggregated with (on the right if swap is False).

swapbool, default: False

Internally used to handle cases where the call is made reflexively, that is when the main IdaDataFrame/IdaSeries is not on the left. If swap is True, this also implies that other is not of type IdaDataFrame/IdaSeries.

Returns:
Aggregated IdaDataFrame or IdaSeries
Raises:
ValueError

Aggregation method not supported.

TypeError

Type not supported for aggregation.

Notes

It is not possible to create aggregations between columns that are stored in different Db2 Warehouse tables.

Examples

>>> idairis['SepalLength'] = idairis['SepalLength'] * 2
...
IdaDataFrame.__add__(other)[source]

Perform an addition between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] + 3
IdaDataFrame.__radd__(other)[source]

Enable the reflexivity of the addtion operation.

Examples

>>> ida = idadf['sepal_length'] + 3
>>> ida = 3 + idadf['sepal_length']
IdaDataFrame.__div__(other)[source]

Perform a division between self and another IdaDataFrame or number.

When __future__.division is not in effect.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] / 3
IdaDataFrame.__rdiv__(other)[source]

Enable the reflexivity of the division operation.

When __future__.division is not in effect.

Examples

>>> ida = idadf['sepal_length'] / 3
>>> ida = 3 / idadf['sepal_length']
IdaDataFrame.__truediv__(other)[source]

Perform a division between self and another IdaDataFrame or number.

When __future__.division is in effect.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] / 3
IdaDataFrame.__rtruediv__(other)[source]

Enable the reflexivity of the division operation.

When __future__.division is in effect.

Examples

>>> ida = idadf['sepal_length'] / 3
>>> ida = 3 / idadf['sepal_length']
IdaDataFrame.__floordiv__(other)[source]

Perform an integer division between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] // 3
IdaDataFrame.__rfloordiv__(other)[source]

Enable the reflexivity of the integer division operation.

Examples

>>> ida = idadf['sepal_length'] // 3
>>> ida = 3 // idadf['sepal_length']
IdaDataFrame.__mod__(other)[source]

Perform a modulo operation between self and another IdaDataFrame or number.

Notes

Arithmetic operations make sense if self has only numeric columns.

Examples

>>> ida = idadf['sepal_length'] % 3
IdaDataFrame.__rmod__(other)[source]

Enable the reflexivity of the modulo operation.

Examples

>>> ida = idadf['sepal_length'] % 3
>>> ida = 3 % idadf['sepal_length']
IdaDataFrame.__mul__(other)[source]

Perform a multiplication between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] * 3
IdaDataFrame.__rmul__(other)[source]

Enable the reflexivity of the multiplication operation.

Examples

>>> ida = idadf['sepal_length'] % 3
>>> ida = 3 % idadf['sepal_length']
IdaDataFrame.__neg__()[source]

Calculate the absolute negative of all columns in self.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

IdaDataFrame.__rpos__(other)[source]

Calculate the absolute positive. No operation required.

IdaDataFrame.__pow__(other)[source]

Perform a power operation between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] ** 3
IdaDataFrame.__rpow__(other)[source]

Enable the reflexivity of the power operation.

Examples

>>> ida = idadf['sepal_length'] ** 3
>>> ida = 3 ** idadf['sepal_length']
IdaDataFrame.__sub__(other)[source]

Perform a substraction between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] - 3
IdaDataFrame.__rsub__(other)[source]

Enable the reflexivity of the substraction operation.

Examples

>>> ida = idadf['sepal_length'] - 3
>>> ida = 3 - idadf['sepal_length']
IdaDataFrame.__delitem__(item)[source]

Enable non-destructive deletion of columns using a Pandas style syntax. This happens inplace, which means that the current IdaDataFrame is modified.

Examples

>>> idadf = IdaDataFrame(idadb, "IRIS")
>>> idadf.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object')
>>> del idadf['sepal_length']
>>> idadf.columns
Index(['sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object')
IdaDataFrame.save_as(tablename, clear_existing=False)[source]

Save self as a table name in the remote database with the name tablename. This function might erase an existing table if tablename already exists and clear_existing is True.

DataBase Features

exists

IdaDataFrame.exists()[source]

Convenience function delegated from IdaDataBase.

Check if the data still exists in the database.

is_view / is_table

IdaDataFrame.is_view()[source]

Convenience function delegated from IdaDataBase.

Check if the IdaDataFrame corresponds to a view in the database.

IdaDataFrame.is_table()[source]

Convenience function delegated from IdaDataBase.

Check if the IdaDataFrame corresponds to a table in the database.

get_primary_key

IdaDataFrame.get_primary_key()[source]

Get the name of the primary key of self, if there is one. Otherwise, this function returns 0. This function may be deprecated in future versions because it is not very useful.

ida_query

IdaDataFrame.ida_query(query, silent=False, first_row_only=False, autocommit=False)[source]

Convenience function delegated from IdaDataBase.

Prepare, execute and format the result of a query in a data frame or in a tuple. See the IdaDataBase.ida_query documentation.

ida_scalar_query

IdaDataFrame.ida_scalar_query(query, silent=False, autocommit=False)[source]

Convenience function delegated from IdaDataBase.

Prepare and execute a query and return only the first element as a string. See the IdaDataBase.ida_scalar_query documentation.

Data Exploration

tail

IdaDataFrame.tail(nrow=5, sort=True)[source]

Print the n last rows of the instance, n is set to 5 by default.

Parameters:
nrowint > 0

The number of rows to be included in the result.

sort: default is True

If set to True and no indexer is set the data will be sorted by the first numeric column or if no numeric column is available by the first column of the dataframe. If set to False and no indexer is set the row order is not guaranteed and can vary with each execution. For big tables this option might save query processing time.

Returns:
DataFrame

The index of the corresponding row number and the columns are all columns of self.

Examples

>>> ida_iris.tail()
     sepal_length  sepal_width  petal_length  petal_width    species
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

pivot_table

IdaDataFrame.pivot_table(values=None, columns=None, max_entries=1000, sort=None, factor_threshold=20, interactive=False, aggfunc='count')[source]

Compute an aggregation function over all rows of each column that is specified as a value on the dataset. The result grouped by the columns defined in “columns”.

Parameters:
values: str or list or str optional

List of columns on which “aggfunc” is computed.

columns: str or list or str optional

List of columns that is used as an index and by which the dataframe is grouped.

max_entries: int, default=1000

The maximum number of cells to be part of the output. By default, set to 1000.

sort: str, optional
Admissible values are: “alpha” and “factors”.
  • If “alpha”, the index of the output is sorted according to the alphabetical order.

  • If “factors”, the index of the output will be sorted according to increasing number of the distinct values.

By default, the index will be sorted in the same order that is specified in “columns” argument.

factor_threshold: int, default: 20

Number of distinct values above which a categorical column should not be considered categorical anymore and under which a numerical column column should not be considered numerical anymore.

interactive: bool

If True, the user is asked if he wants to display the output, given its size.

aggfunc: str

Aggregation function to be computed on each column specified in the argument “values”. Admissible values are: “count”, “sum”, “avg”. This entry is not case-sensitive.

Returns:
Pandas Series with Multi-index (columns)

Examples

>>> val = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
>>> ida_iris.pivot_table(values= val, aggfunc="avg")
              species
sepal_length  setosa        5.006
              versicolor    5.936
              virginica     6.588
sepal_width   setosa        3.428
              versicolor    2.770
              virginica     2.974
petal_length  setosa        1.462
              versicolor    4.260
              virginica     5.552
petal_width   setosa        0.246
              versicolor    1.326
              virginica     2.026
dtype: float64

join

IdaDataFrame.join(other, on: str = None, how: str = 'left', lsuffix: str = '_x', rsuffix: str = '_y')[source]

Implement pandas-like interface to join tables

Parameters:
otherIdaDataFrame

IdaDataFrame to join to this object

onstr, optional

name of column in both IdaDataFrames to do join on

howstr, optional

type of join, possible types - ‘outer’, ‘inner’, ‘right’, ‘left’, ‘cross’

lsuffixstr, optional

suffix to add to columns in this IdaDataFrame that are present in both IdaDataFrames

rsuffixstr, optional

suffix to add to columns in other IdaDataFrame that are present in both IdaDataFrames

Returns:
IdaDataFrame

result of joining IdaDataFrames

Examples

>>> from nzpyida.sampledata.iris import iris
>>> from nzpyida.sampledata.iris import iris as iris2
>>> iris.reset_index()
>>> iris2.reset_index()
>>> iris2.columns = ['id', 'sepal_length', 'sepal_width', 'petal_length',             'PETAL_WIDTH', 'CLASS']
>>> iris2['id'] = iris2['id'] + 50
>>> iris1_ida = idadb.as_idadataframe(iris, 'IRIS_TEST1', indexer='index')
>>> iris2_ida = idadb.as_idadataframe(iris2, 'IRIS_TEST2', indexer='id')
>>> iris_merge = iris1_ida.join(iris2_ida)
>>> iris_merge.tail()
    index       sepal_length_x  sepal_width_x   petal_length_x  petal_width     species     id  sepal_length_y  sepal_width_y   petal_length_y  PETAL_WIDTH     CLASS                                                                                           
95      145     6.7             3.0             5.2             2.3             virginica   145 5.7             3.0             4.2             1.2             versicolor
96      146     6.3             2.5             5.0             1.9             virginica   146 5.7             2.9             4.2             1.3             versicolor
97      147     6.5             3.0             5.2             2.0             virginica   147 6.2             2.9             4.3             1.3             versicolor
98      148     6.2             3.4             5.4             2.3             virginica   148 5.1             2.5             3.0             1.1             versicolor
99      149     5.9             3.0             5.1             1.8             virginica   149 5.7             2.8             4.1             1.3             versicolor
>>> len(iris_merge)
100

merge

IdaDataFrame.merge(right, how: str = 'inner', on=None, left_on=None, right_on=None, left_index: bool = False, right_index: bool = False, suffixes: List[str] = ['_x', '_y'], indicator: bool = False)[source]

Implement pandas-like interface to merge IdaDataFrames

Parameters:
rightIdaDataFrame

right IdaDataDrame to merge

howstr, optional

type of join, possible types - ‘outer’, ‘inner’, ‘right’, ‘left’, ‘cross’

onstr, optional

name of column in both IdaDataFrames to do join on

left_onstr, optional

name of column in left IdaDataFrame to do join on

right_onstr, optional

name of column in right IdaDataFrame to do join on

left_indexbool, optional

whether to join on indexer of left IdaDataFrame

right_indexbool, optional

whether to join on indexer of right IdaDataFrame

suffixes: List[str], optional

list of suffixes to add to columns that are present in both IdaDataFrames

indicator: bool optional

whether to add to output IdaDataFrame, column with information about the source of a given row

Returns:
IdaDataFrame

result of merging IdaDataFrames

Examples

>>> from nzpyida.sampledata.iris import iris
>>> from nzpyida.sampledata.iris import iris
>>> iris.reset_index()
>>> iris2.reset_index()
>>> iris2.columns = ['id', 'sepal_length', 'sepal_width', 'petal_length',             'PETAL_WIDTH', 'CLASS']
>>> iris2['id'] = iris2['id'] + 50
>>> iris1_ida = idadb.as_idadataframe(iris, 'IRIS_TEST1', indexer='index')
>>> iris2_ida = idadb.as_idadataframe(iris2, 'IRIS_TEST2', indexer='id')
>>> iris_merge = iris1_ida.merge(iris2_ida, left_on='index', right_on='id')
>>> iris_merge.tail()
    index       sepal_length_x  sepal_width_x   petal_length_x  petal_width     species     id  sepal_length_y  sepal_width_y   petal_length_y  PETAL_WIDTH     CLASS                                                                                           
95      145     6.7             3.0             5.2             2.3             virginica   145 5.7             3.0             4.2             1.2             versicolor
96      146     6.3             2.5             5.0             1.9             virginica   146 5.7             2.9             4.2             1.3             versicolor
97      147     6.5             3.0             5.2             2.0             virginica   147 6.2             2.9             4.3             1.3             versicolor
98      148     6.2             3.4             5.4             2.3             virginica   148 5.1             2.5             3.0             1.1             versicolor
99      149     5.9             3.0             5.1             1.8             virginica   149 5.7             2.8             4.1             1.3             versicolor
>>> len(iris_merge)
100
>>> iris_merge = iris1_ida.merge() iris2_ida, left_on='index', right_index=True, 
                    how='outer', indicator=True, suffixes=("_a", "_b"))
>>> iris_merge.head()
    index       sepal_length_a  sepal_width_a   petal_length_a  petal_width     species         id      sepal_length_b  sepal_width_b   petal_length_b  PETAL_WIDTH CLASS       INDICATOR
0       None    None            None            None            None            None            152     7.1             3.0             5.9             2.1         virginica   right_only
1       None    None            None            None            None            None            151     5.8             2.7             5.1             1.9         virginica   right_only
2       None    None            None            None            None            None            154     6.5             3.0             5.8             2.2         virginica   right_only
3       None    None            None            None            None            None            153     6.3             2.9             5.6             1.8         virginica   right_only
4       None    None            None            None            None            None            150     6.3             3.3             6.0             2.5         virginica   right_only

concat

join_tables.concat(axis: int = 0, join: str = 'outer', keys: List[str] = None)

Implement pandas-like interface to concateate IdaDataFrames

Parameters:
objsList[IdaDataFrame]

list of IdaDataFrames to be concatenated

axisint, optional

axis to concatenate on

joinstr, optional

type of join, possible types - ‘outer’, ‘inner’

keysList[str], optional

if List is present then add column with table indetifier

Returns:
IdaDataFrame

reult of concataneting IdaDataFrames

Examples

>>> from nzpyida.sampledata.iris import iris
>>> from nzpyida.sampledata.iris import iris as iris2
>>> iris.reset_index()
>>> iris2.reset_index()
>>> iris2.columns = ['id', 'sepal_length', 'sepal_width', 'petal_length',         'PETAL_WIDTH', 'CLASS']
>>> iris2['id'] = iris2['id'] + 50
>>> iris1_ida = idadb.as_idadataframe(iris, 'IRIS_TEST1', indexer='index')
>>> iris2_ida = idadb.as_idadataframe(iris2, 'IRIS_TEST2', indexer='id')
>>> iris_concat = nzpyida.concat([iris1_ida, iris2_ida])
>>> len(iris1_ida), len(iris2_ida), len(iris_concat)
(150, 150, 300)
>>> iris_concat.head()
    index   sepal_length    sepal_width     petal_length    petal_width     species     id      PETAL_WIDTH CLASS
0   None    4.9             3.0             1.4             None            None        51      0.2         setosa
1   None    4.6             3.1             1.5             None            None        53      0.2         setosa
2   None    4.7             3.2             1.3             None            None        52      0.2         setosa
3   None    5.1             3.5             1.4             None            None        50      0.2         setosa
4   None    5.9             3.0             5.1             None            None        199     1.8         virginica
>>> iris_concat.tail()
    index   sepal_length    sepal_width     petal_length    petal_width     species         id      PETAL_WIDTH     CLASS                                                                   
295 145     6.7             3.0             5.2             2.3             virginica       None    None            None
296 146     6.3             2.5             5.0             1.9             virginica       None    None            None
297 147     6.5             3.0             5.2             2.0             virginica       None    None            None
298 148     6.2             3.4             5.4             2.3             virginica       None    None            None
299 149     5.9             3.0             5.1             1.8             virginica       None    None            None

sort

IdaDataFrame.sort(columns=None, axis=0, ascending=True, inplace=False)[source]

Sort the IdaDataFrame row wise or column wise.

Parameters:
columnsstr or list of str

Columns that should be used to sort the rows in the IdaDataFrame. If columns is set to None and axis to 0, then the IdaDataFrame columns are sorted in lexicographical order.

axisint (0/1)

Axis that is sorted. 0 for sorting row wise, 1 for sorting column wise.

ascendingbool, default: True

Sorting order, True : ascending, False : descending

inplacebool, default: False

The current object is modified or creates a modified copy. If False, the function creates a modified copy of the current dataframe. If True, the function modifies the current dataframe.

Raises:
ValueError
  • When sorting by column (column not None), the axis value must be 0 (rows).

  • A column does not belong to self.

  • The axis argument has a value other than 0 or 1.

Notes

If columns is set to None and axis to 0, this undoes all sorting the IdaDataFrame and returns the original sorting in the Db2 Warehouse database.

No actual changes are made in Db2 Warehouse, only the querying changes. Everything is registered in an InternalState object. Changes can be observed by using head and tail function.

IdaDataFrame.groupby(by)[source]

Creates the groupby object

Parameters:
bystr

Column to group the Data Frame by

Returns:
IdaDataFrameGroupBy

Examples

>>> ida_iris.groupby("CLASS")
<nzpyida.groupby.IdaDataFrameGroupBy at 0x11c288e10>

Obtaing IdaDataFrame with grouped data is possible with aggregate methods of IdaDataFrameGroupBy

IdaDataFrameGroupBy

class nzpyida.groupby.IdaDataFrameGroupBy(in_df, columns_to_aggregate, by_column)[source]

Class representing groupby object, that is object created from IdaDataFrame. It implements Pandas-like object for grouping.

Examples

>>> ida_iris.groupby("CLASS")
<nzpyida.groupby.IdaDataFrameGroupBy at 0x11c288e10>
>>> df.groupby("CLASS").mean().head()
        AVG_SEPAL_LENGTH    AVG_SEPAL_WIDTH         AVG_PETAL_LENGTH        AVG_PETAL_WIDTH         CLASS
    0       5.006               3.418                   1.464                   0.244                   Iris-setosa
    1       5.936               2.770                   4.260                   1.326                   Iris-versicolor
    2       6.588               2.974                   5.552                   2.026                   Iris-virginica
>>> mean_sepal_length = ida_iris[["SEPAL_LENGTH", "CLASS"]].groupby("CLASS").mean()
>>> mean_sepal_length.head()
        AVG_SEPAL_LENGTH    CLASS
    0       5.006               Iris-setosa
    1       5.936               Iris-versicolor
    2       6.588               Iris-virginica
>>> iris_groupby = ida_iris[(ida_iris["SEPAL_LENGTH"]>4) & (ida_iris["SEPAL_WIDTH"]<3)][
>>>     ["SEPAL_WIDTH", "SEPAL_LENGTH", "CLASS"]].groupby("CLASS")
>>> iris_groupby
<nzpyida.groupby.IdaDataFrameGroupBy object at 0x119863190>
>>> iris_groupby.count().head()
        COUNT_SEPAL_WIDTH   COUNT_SEPAL_LENGTH      CLASS
    0       2                       2                       Iris-setosa
    1       21                      21                      Iris-virginica
    2       34                      34                      Iris-versicolor

Methods

count()

Aggregates all columns with COUNT method.

max()

Aggregates all columns with MAX method.

mean()

Aggregates all columns with AVG method.

min()

Aggregates all columns with MIN method.

sum()

Aggregates all columns with SUM method.

count()[source]

Aggregates all columns with COUNT method. Counts all non null values in column with respect to grouped classes

Returns:
IdaDataFrame

Object with grouped data

max()[source]

Aggregates all columns with MAX method. Selects the maximum value from column with respect to grouped classes

Returns:
IdaDataFrame

Object with grouped data

mean()[source]

Aggregates all columns with AVG method. Calculates mean of values in column with respect to grouped classes

Returns:
IdaDataFrame

Object with grouped data

min()[source]

Aggregates all columns with MIN method. Selects the minimum value from column with respect to grouped classes

Returns:
IdaDataFrame

Object with grouped data

sum()[source]

Aggregates all columns with SUM method. Sums all the values in column with respect to grouped classes

Returns:
IdaDataFrame

Object with grouped data

Descriptive Statistics

describe

IdaDataFrame.describe(percentiles=[0.25, 0.5, 0.75])[source]

A basic statistical summary about current IdaDataFrame. If at least one numerical column exists, the summary includes:

  • The count of non-missing values for each numerical column.

  • The mean for each numerical column.

  • The standart deviation for each numerical column.

  • The minimum and maximum for each numerical column.

  • A list of percentiles set by the user (default : the quartiles).

Parameters:
idadfIdaDataFrame
percentilesFloat or list of floats, default: [0.25, 0.50, 0.75].

percentiles to be computed on numerical columns. All values in percentiles must be > 0 and < 1.

Returns:
summary: DataFrame, where
  • Index is the name of the computed values.

  • Columns are either numerical or categorical columns of self.

cov (covariance)

IdaDataFrame.cov()[source]

Compute the covariance matrix, composed of covariance coefficients between all pairs of columns in self. It must have at least two numeric columns.

Returns:
covariance matrix: DataFrame

The axes are the columns of self and the values are the covariance coefficients.

corr (correlation)

IdaDataFrame.corr(method='pearson', features=None, ignore_indexer=True)[source]

Compute the correlation matrix, composed of correlation coefficients between all pairs of columns in self. It must have at least two numeric columns.

Parameters:
methodstr, default: pearson

Method to be used to compute the correlation. By default, compute the pearson correlation coefficient. The Spearman rank correlation is also available. Admissible values are: “pearson”, “spearman”.

Returns:
correlation matrix: DataFrame

The axes are the columns of self and the values are the correlation coefficients.

Notes

For the Spearman rank correlation, the ordinal rank of columns is computed. For performance reasons this is easier to compute than the fractional rank traditionally computed for the Spearman rank correlation method. This strategy has the property that the sum of the ranking numbers is the same as under ordinal ranking. We then apply the pearson correlation coefficient method to these ranks.

quantile

IdaDataFrame.quantile(q=0.5)[source]

Compute row wise quantiles for each numeric column.

Parameters:
qfloat or array-like, default 0.5 (50% quantile)

0 <= q <= 1, the quantile(s) to compute

Returns:
quantiles: Series or DataFrame

If q is an array, the function returns a DataFrame in which the index is q. The columns are the columns of sel, and the values are the quantiles. If q is a float, a Series is returned where the index is the columns of self and the values are the quantiles.

mad (mean absolute deviation)

IdaDataFrame.mad()[source]

Compute the mean absolute distance for all numeric columns of self. It must have at least two numeric columns.

Returns:
mad: Series

The index consists of the columns of self and the values are the mean absolute distance.

min (minimum)

IdaDataFrame.min()[source]

Compute the minimum value for all numerics column of self.

Returns:
min: Series

The index consists of the columns of self and the values are the minimum.

max (maximum)

IdaDataFrame.max()[source]

Compute the maximum value over for all numeric columns of self.

Returns:
max: Series.

The index consists of the columns of self and the values are the maximum.

count

IdaDataFrame.count()[source]

Compute the count of non-missing values for all columns of self.

Returns:
count: Series.

The index consists of the columns of self and the values are the number of non-missing values.

count_distinct

IdaDataFrame.count_distinct()[source]

Compute the count of distinct values for all numeric columns of self.

Returns:
disctinct count: Series

The index consists of the columns of self and values are the number of distinct values.

std (standard deviation)

IdaDataFrame.std()[source]

Compute the standard deviation for all numeric columns of self.

Returns:
std: Series

The index consists of the columns of self and the values are the standard deviation.

var (variance)

IdaDataFrame.var()[source]

Compute the variance for all numeric columns of self.

Returns:
var: Series

The index consists of the columns of self and the values are the variance.

mean

IdaDataFrame.mean()[source]

Compute the mean for each numeric columns of self.

Returns:
mean: Series

The index consists of the columns of self and the values are the mean.

sum

IdaDataFrame.sum()[source]

Compute the sum of values for all numeric columns of self.

Returns:
sum: Series

The index consists of the columns of self and the values are the sum.

median

IdaDataFrame.median()[source]

Compute the median for all numeric columns of self.

Returns:
median: Series

The index consists of the columns of self and the values are the median.

Import as DataFrame

as_dataframe

IdaDataFrame.as_dataframe()[source]

Download and return an in-memory representation of the dataset as a Pandas DataFrame.

Returns:
DataFrame

Columns and records are the same as in self.

Examples

>>> iris = ida_iris.as_dataframe()
>>> iris.head()
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

Connection Management

commit

IdaDataFrame.commit()[source]

Commit operations in the database.

Notes

All changes that are made in the database after the last commit, including those in the child IdaDataFrames, are commited.

If the environment variable ‘VERBOSE’ is set to True, the commit operations are notified in the console.

rollback

IdaDataFrame.rollback()[source]

Rollback operations in the database.

Notes

All changes that are made in the database after the last commit, including those in the child IdaDataFrames, are discarded.