IdaDataFrame
Open an IdaDataFrame Object.
- class nzpyida.frame.IdaDataFrame(idadb, tablename, indexer=None)[source]
An IdaDataFrame object is a reference to a table in a remote Db2 Warehouse database. IDA stands for In-DataBase Analytics. IdaDataFrame copies the Pandas interface for DataFrame objects to ensure intuitive interaction for end-users.
Examples
>>> idadb = IdaDataBase('BLUDB') # See documentation for IdaDataBase >>> ida_iris = IdaDataFrame(idadb, 'IRIS') >>> ida_iris.cov() sepal_length sepal_width petal_length petal_width sepal_length 0.685694 -0.042434 1.274315 0.516271 sepal_width -0.042434 0.189979 -0.329656 -0.121639 petal_length 1.274315 -0.329656 3.116278 1.295609 petal_width 0.516271 -0.121639 1.295609 0.581006
- Attributes:
Methods
Download and return an in-memory representation of the dataset as a Pandas DataFrame.
commit()Commit operations in the database.
corr([method, features, ignore_indexer])Compute the correlation matrix, composed of correlation coefficients between all pairs of columns in self.
corrwith(other)Compute the correlation matrix, composed of correlation coefficients between the columns of self and the columns of another IdaDataFrame.
count()Compute the count of non-missing values for all columns of self.
Compute the count of distinct values for all numeric columns of self.
count_groupby([columns, count_only, having])Count the occurence of the values of a column or group of columns
cov()Compute the covariance matrix, composed of covariance coefficients between all pairs of columns in self.
delete_na(columns[, logic, inplace])Filter rows containing NULL values.
describe([percentiles])A basic statistical summary about current IdaDataFrame.
exists()Convenience function delegated from IdaDataBase.
Get the name of the primary key of self, if there is one.
groupby(by)Creates the groupby object
head([nrow, sort])Print the n first rows of the instance, n is set to 5 by default.
ida_query(query[, silent, first_row_only, ...])Convenience function delegated from IdaDataBase.
ida_scalar_query(query[, silent, autocommit])Convenience function delegated from IdaDataBase.
info([buf])Some information about current IdaDataFrame.
is_table()Convenience function delegated from IdaDataBase.
is_view()Convenience function delegated from IdaDataBase.
join(other[, on, how, lsuffix, rsuffix])Implement pandas-like interface to join tables
levels([columns])Return the numbers of distinct values
mad()Compute the mean absolute distance for all numeric columns of self.
max()Compute the maximum value over for all numeric columns of self.
mean()Compute the mean for each numeric columns of self.
mean_freq_of_instance([columns])Return the average occurence of the values of a column or group of columns
median()Compute the median for all numeric columns of self.
merge(right[, how, on, left_on, right_on, ...])Implement pandas-like interface to merge IdaDataFrames
min()Compute the minimum value for all numerics column of self.
mode()Compute the most common value for each non numeric column self.
pivot_table([values, columns, max_entries, ...])Compute an aggregation function over all rows of each column that is specified as a value on the dataset.
quantile([q])Compute row wise quantiles for each numeric column.
rank()Compute the rank over all entries for all columns of self.
rollback()Rollback operations in the database.
save_as(tablename[, clear_existing])Save self as a table name in the remote database with the name tablename.
sort([columns, axis, ascending, inplace])Sort the IdaDataFrame row wise or column wise.
std()Compute the standard deviation for all numeric columns of self.
sum()Compute the sum of values for all numeric columns of self.
summary()A basic statistical summary about current IdaDataFrame.
tail([nrow, sort])Print the n last rows of the instance, n is set to 5 by default.
train_test_split(train_table, test_table, ...)Split the table into train and test sets
unique(column)Return the unique values of a column
var()Compute the variance for all numeric columns of self.
mean_groupby
print
within_class_std
within_class_var
- __init__(idadb, tablename, indexer=None)[source]
Constructor for IdaDataFrame objects.
- Parameters:
- idadbIdaDataBase
IdaDataBase instance which contains the connection to be used.
- tablenamestr
Name of the table to be opened in the database. It should contain only alphanumeric characters and underscores. All lower case characters will be converted to upper case characters.
- indexerstr, optional
Name of the column that should be used as an index. This is optional. However, if no indexer is given, the order of rows issued by the head and tail functions is not guaranteed. Also, several in-database machine learning algorithms need an indexer as a parameter to be executed.
Notes
Attributes “type”, “dtypes”, “index”, “columns”, “axes”, and “shape” are evaluated in a lazy way to avoid an overhead when creating an IdaDataFrame. Sometimes the index may be too big to be downloaded.
Examples
>>> idadb = IdaDataBase('BLUDB') >>> ida_iris = IdaDataFrame(idadb, "IRIS")
- Attributes:
- _idadbIdaDataBase
IdaDataBase object parent of the current instance.
- tablenamestr
Name of the table self references.
- namestr
Full name of the table self references, including schema.
- schemastr
Name of the schema the table belongs to.
- indexerstr
Name of the column used as an index. “None” if no indexer.
- locstr
Indexer that enables the selection and projection of IdaDataFrame instances. For more information, see the loc class documentation.
- internal_stateInternalState
Object used to internally store the state of the IdaDataFrame. It also allows several non-destructive manipulation methods.
- typestr
Type of the IdaDataFrame : “Table”, “View”, or “Unknown”.
- dtypesDataFrame
Data type in the database for each column.
- indexpandas.core.index
Index containing the row names in this table.
- columnspandas.core.index
Index containing the columns names in this table.
- axeslist
List containing columns and index attributes.
- shapeTuple
Number of rows and number of columns.
Methods
DataFrame introspection
internal_state
indexer
- IdaDataFrame.indexer
The indexer attribute refers to the name of a column that should be used to index the table. This makes sense because Db2 Warehouse is a column-based database, so row IDs do not make sense and are not deterministic. As a consequence, the only way to address a particular row is to refer to it by its index. If no indexer is provided, ibmdbpy still works but a correct row order is not guaranteed as far as the dataset is not sorted. Also, note that the indexer column is not taken into account in data mining algorithms.
type
dtypes
index
columns
- IdaDataFrame.columns
Index containing the column names in self.
- Returns:
- Index
Examples
>>> ida_iris.columns Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object')
axes
shape
empty
- IdaDataFrame.empty
Boolean that is True if the table is empty (no rows).
- Returns:
- Boolean
__len__
__iter__
DataFrame modification
Selection, Projection
- IdaDataFrame.__getitem__(item)[source]
Enable label based projection (selection of columns) in IdaDataFrames.
Enable slice based selection of rows in IdaDataFrames.
Enable row filtering.
The syntax is similar to Pandas.
Notes
The row order is not guaranteed if no indexer is given and the dataset is not sorted
Examples
>>> idadf['col1'] # return an IdaSeries
>>> idadf[['col1']] # return an IdaDataFrame with one column
>>> idadf[['col1', 'col2', 'col3']] # return an IdaDataFrame with 3 columns
>>> idadf[0:9] # Select the 10 first rows
>>> idadf[idadf['col1'] = "test"] # select of rows for which attribute col1 is equal to "test"
Selection and Projection are also possible using the nzpyida.Loc object stored in IdaDataFrame.loc.
Filtering
- IdaDataFrame.delete_na(columns, logic='any', inplace=False)[source]
Filter rows containing NULL values. Can be done in a destructive way (rows are deleted in the physical table) or in a non destructive way (a new IdaDataFrame is defined, original table and IdaDataFrame are preserved).
- Parameters:
- columnslist of strings
list of eligible column names
- logicstr, optional
“any” by default. If logic is set to “any” then all rows which contain a NaN value in any (id est at least one) of the cited columns is deleted, this is a union condition; if logic is set to “all”, then only rows containing null values in all the fields are deleted, this is an intersection condition.
- inplacebool, optional
False by default. If True, then the underlying table is physically modified, if False, then the original objects remain unaffected, a copy of the IdaDataFrame is made and modified.
- Returns:
- If inplace is True, the original table is modified and the IdaDataFrame will be modified accordingly. No return.
- If inplace is False, a new IdaDataFrame is returned, the original table is not affected. This new IdaDataFrame points to a
- new table which has been created according to the user defined criterions on columns.
Examples
>>> idadf.delete_na(["COL_1", "COL_2"], inplace = True) >>> #each row of the original table is physically deleted if there is a NULL value in one of the listed columns >>> new_idadf = idadf.delete_na(["COL_1", "COL_2"], logic = 'all') >>> # a new IdaDataFrame is created, rows will be selected only if all the listed columns have a NULL value.
- class nzpyida.filtering.FilterQuery(columns, tablename, method, value)[source]
FilterQueries are used to represent the filtering of an IdaDataFrame in ibmdbpy. The use of comparison operators, such as <, <=, ==, >=, > on an IdaDataFrame instance produces a FilterQuery instance which acts as a container for the where clause of the corresponding SQL request.
Filtering is possible using a Pandas-like syntax. Applying comparison operators to IdaDataFrames produces a FilterQuery instance which contains the string embedding the corresponding where clause in the “wherestr” attribute.
FilterQuery objects also contain a logic that allows them to be combined, thus allowing complex filtering.
You can combine the following operators: |, &, ^ (OR, AND and XOR)
Notes
It is not possible to filter an IdaDataFrame by using an IdaDataFrame that is opened in a different data source in the database. This is due to the fact that, using a Pandas-like syntax, “idadf[‘petal_width’] < 5” will return a Boolean array that is used to subset the original DataFrame. This is a fundamental restriction of ibmdbpy: we cannot afford to compute and download such an array because we cannot assume that the result will fit into user’s memory. Download time can also be a performance issue.
Examples
>>> idadf[['sepal_length', 'petal_width'] < 5] >>> <ibmdbpy.filtering.FilterQuery at 0xa65ba90> >>> _.wherestr '("sepal_length" < 5 AND "petal_width" < 5)'
>>> idadf[idadf[['sepal_length', 'petal_width']] < 5] <ibmdbpy.frame.IdaDataFrame at 0xa73a860> >>> _.head() # filtered IdaDataFrame sepal_length sepal_width petal_length petal_width species 0 4.4 2.9 1.4 0.2 setosa 1 4.7 3.2 1.6 0.2 setosa 2 4.9 2.5 4.5 1.7 virginica 3 4.9 2.4 3.3 1.0 versicolor 4 4.6 3.2 1.4 0.2 setosa
>>> idadf[(idadf['sepal_length'] < 5) & (idadf[petal_width'] > 1.5)] <ibmdbpy.frame.IdaDataFrame at 0xa74b9b0> >>> _.head() sepal_length sepal_width petal_length petal_width species 0 4.9 2.5 4.5 1.7 virginica
- Attributes:
queryReturn an SQL query like “SELECT * FROM %s WHERE <WHERESTR>”, where <WHERESTR> is the value of the attribute “wherestr”.
- property query
Return an SQL query like “SELECT * FROM %s WHERE <WHERESTR>”, where <WHERESTR> is the value of the attribute “wherestr”.
- IdaDataFrame.__lt__(value)[source]
ibmdbpy.filtering.FilterQuery object when comparing self using “<”.
- IdaDataFrame.__le__(value)[source]
ibmdbpy.filtering.FilterQuery object when comparing self using “<=”.
- IdaDataFrame.__eq__(value)[source]
ibmdbpy.filtering.FilterQuery object when comparing self using “==”.
- IdaDataFrame.__ne__(value)[source]
ibmdbpy.filtering.FilterQuery object when comparing self using “!=”.
Feature Engineering
- IdaDataFrame.__setitem__(key, item)[source]
Enable the creation and aggregation of columns.
Examples
>>> idadf['new'] = idadf['sepal_length'] * idadf['sepalwidth'] # select a new column as the product of two existing columns
>>> idadf['sepal_length'] = idadf['sepal_length'] / idadf['sepal_length'].mean() # modify an existing column
- aggregation.aggregate_idadf(method, other, swap=False)
Modify internal state variables to represent the aggregation of columns of an IdaDataFrame or IdaSeries in ibmdbpy.
The following comparison operators are supported: +, *, /, -, //, %, **.
The syntax is similar to Pandas.
- Parameters:
- idadfIdaDataFrame or IdaSeries
IdaDataFrame or IdaSerie on the left (if swap is False)
- methodstr
Aggregation method that is computed: the following values are admissible: “add”,”mul”,”div”,”sub”,”floordiv”,”mod”,”neg”,”pow”
- other: Number or IdaDataFrame or IdaSeries
Another object that idadf will be aggregated with (on the right if swap is False).
- swapbool, default: False
Internally used to handle cases where the call is made reflexively, that is when the main IdaDataFrame/IdaSeries is not on the left. If swap is True, this also implies that other is not of type IdaDataFrame/IdaSeries.
- Returns:
- Aggregated IdaDataFrame or IdaSeries
- Raises:
- ValueError
Aggregation method not supported.
- TypeError
Type not supported for aggregation.
Notes
It is not possible to create aggregations between columns that are stored in different Db2 Warehouse tables.
Examples
>>> idairis['SepalLength'] = idairis['SepalLength'] * 2 ...
- IdaDataFrame.__add__(other)[source]
Perform an addition between self and another IdaDataFrame or number.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
Examples
>>> ida = idadf['sepal_length'] + 3
- IdaDataFrame.__radd__(other)[source]
Enable the reflexivity of the addtion operation.
Examples
>>> ida = idadf['sepal_length'] + 3
>>> ida = 3 + idadf['sepal_length']
- IdaDataFrame.__div__(other)[source]
Perform a division between self and another IdaDataFrame or number.
When __future__.division is not in effect.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
Examples
>>> ida = idadf['sepal_length'] / 3
- IdaDataFrame.__rdiv__(other)[source]
Enable the reflexivity of the division operation.
When __future__.division is not in effect.
Examples
>>> ida = idadf['sepal_length'] / 3
>>> ida = 3 / idadf['sepal_length']
- IdaDataFrame.__truediv__(other)[source]
Perform a division between self and another IdaDataFrame or number.
When __future__.division is in effect.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
Examples
>>> ida = idadf['sepal_length'] / 3
- IdaDataFrame.__rtruediv__(other)[source]
Enable the reflexivity of the division operation.
When __future__.division is in effect.
Examples
>>> ida = idadf['sepal_length'] / 3
>>> ida = 3 / idadf['sepal_length']
- IdaDataFrame.__floordiv__(other)[source]
Perform an integer division between self and another IdaDataFrame or number.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
Examples
>>> ida = idadf['sepal_length'] // 3
- IdaDataFrame.__rfloordiv__(other)[source]
Enable the reflexivity of the integer division operation.
Examples
>>> ida = idadf['sepal_length'] // 3
>>> ida = 3 // idadf['sepal_length']
- IdaDataFrame.__mod__(other)[source]
Perform a modulo operation between self and another IdaDataFrame or number.
Notes
Arithmetic operations make sense if self has only numeric columns.
Examples
>>> ida = idadf['sepal_length'] % 3
- IdaDataFrame.__rmod__(other)[source]
Enable the reflexivity of the modulo operation.
Examples
>>> ida = idadf['sepal_length'] % 3
>>> ida = 3 % idadf['sepal_length']
- IdaDataFrame.__mul__(other)[source]
Perform a multiplication between self and another IdaDataFrame or number.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
Examples
>>> ida = idadf['sepal_length'] * 3
- IdaDataFrame.__rmul__(other)[source]
Enable the reflexivity of the multiplication operation.
Examples
>>> ida = idadf['sepal_length'] % 3
>>> ida = 3 % idadf['sepal_length']
- IdaDataFrame.__neg__()[source]
Calculate the absolute negative of all columns in self.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
- IdaDataFrame.__pow__(other)[source]
Perform a power operation between self and another IdaDataFrame or number.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
Examples
>>> ida = idadf['sepal_length'] ** 3
- IdaDataFrame.__rpow__(other)[source]
Enable the reflexivity of the power operation.
Examples
>>> ida = idadf['sepal_length'] ** 3
>>> ida = 3 ** idadf['sepal_length']
- IdaDataFrame.__sub__(other)[source]
Perform a substraction between self and another IdaDataFrame or number.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
Examples
>>> ida = idadf['sepal_length'] - 3
- IdaDataFrame.__rsub__(other)[source]
Enable the reflexivity of the substraction operation.
Examples
>>> ida = idadf['sepal_length'] - 3
>>> ida = 3 - idadf['sepal_length']
- IdaDataFrame.__delitem__(item)[source]
Enable non-destructive deletion of columns using a Pandas style syntax. This happens inplace, which means that the current IdaDataFrame is modified.
Examples
>>> idadf = IdaDataFrame(idadb, "IRIS") >>> idadf.columns Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object') >>> del idadf['sepal_length'] >>> idadf.columns Index(['sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object')
DataBase Features
exists
is_view / is_table
get_primary_key
ida_query
ida_scalar_query
Data Exploration
head
- IdaDataFrame.head(nrow=5, sort=True)[source]
Print the n first rows of the instance, n is set to 5 by default.
- Parameters:
- nrowint > 0
Number of rows to be included in the result.
- sort: default is True
If set to True and no indexer is set the data will be sorted by the first numeric column or if no numeric column is available by the first column of the dataframe. If set to False and no indexer is set the row order is not guaranteed and can vary with each execution. For big tables this option might save query processing time.
- Returns:
- DataFrame or Series
The index of the corresponding row number and the columns are all columns of self. If the IdaDataFrame has only one column, it returns a Series.
Examples
>>> ida_iris.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa
tail
- IdaDataFrame.tail(nrow=5, sort=True)[source]
Print the n last rows of the instance, n is set to 5 by default.
- Parameters:
- nrowint > 0
The number of rows to be included in the result.
- sort: default is True
If set to True and no indexer is set the data will be sorted by the first numeric column or if no numeric column is available by the first column of the dataframe. If set to False and no indexer is set the row order is not guaranteed and can vary with each execution. For big tables this option might save query processing time.
- Returns:
- DataFrame
The index of the corresponding row number and the columns are all columns of self.
Examples
>>> ida_iris.tail() sepal_length sepal_width petal_length petal_width species 145 6.7 3.0 5.2 2.3 virginica 146 6.3 2.5 5.0 1.9 virginica 147 6.5 3.0 5.2 2.0 virginica 148 6.2 3.4 5.4 2.3 virginica 149 5.9 3.0 5.1 1.8 virginica
pivot_table
- IdaDataFrame.pivot_table(values=None, columns=None, max_entries=1000, sort=None, factor_threshold=20, interactive=False, aggfunc='count')[source]
Compute an aggregation function over all rows of each column that is specified as a value on the dataset. The result grouped by the columns defined in “columns”.
- Parameters:
- values: str or list or str optional
List of columns on which “aggfunc” is computed.
- columns: str or list or str optional
List of columns that is used as an index and by which the dataframe is grouped.
- max_entries: int, default=1000
The maximum number of cells to be part of the output. By default, set to 1000.
- sort: str, optional
- Admissible values are: “alpha” and “factors”.
If “alpha”, the index of the output is sorted according to the alphabetical order.
If “factors”, the index of the output will be sorted according to increasing number of the distinct values.
By default, the index will be sorted in the same order that is specified in “columns” argument.
- factor_threshold: int, default: 20
Number of distinct values above which a categorical column should not be considered categorical anymore and under which a numerical column column should not be considered numerical anymore.
- interactive: bool
If True, the user is asked if he wants to display the output, given its size.
- aggfunc: str
Aggregation function to be computed on each column specified in the argument “values”. Admissible values are: “count”, “sum”, “avg”. This entry is not case-sensitive.
- Returns:
- Pandas Series with Multi-index (columns)
Examples
>>> val = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'] >>> ida_iris.pivot_table(values= val, aggfunc="avg") species sepal_length setosa 5.006 versicolor 5.936 virginica 6.588 sepal_width setosa 3.428 versicolor 2.770 virginica 2.974 petal_length setosa 1.462 versicolor 4.260 virginica 5.552 petal_width setosa 0.246 versicolor 1.326 virginica 2.026 dtype: float64
join
- IdaDataFrame.join(other, on: str = None, how: str = 'left', lsuffix: str = '_x', rsuffix: str = '_y')[source]
Implement pandas-like interface to join tables
- Parameters:
- otherIdaDataFrame
IdaDataFrame to join to this object
- onstr, optional
name of column in both IdaDataFrames to do join on
- howstr, optional
type of join, possible types - ‘outer’, ‘inner’, ‘right’, ‘left’, ‘cross’
- lsuffixstr, optional
suffix to add to columns in this IdaDataFrame that are present in both IdaDataFrames
- rsuffixstr, optional
suffix to add to columns in other IdaDataFrame that are present in both IdaDataFrames
- Returns:
- IdaDataFrame
result of joining IdaDataFrames
Examples
>>> from nzpyida.sampledata.iris import iris >>> from nzpyida.sampledata.iris import iris as iris2 >>> iris.reset_index() >>> iris2.reset_index() >>> iris2.columns = ['id', 'sepal_length', 'sepal_width', 'petal_length', 'PETAL_WIDTH', 'CLASS'] >>> iris2['id'] = iris2['id'] + 50 >>> iris1_ida = idadb.as_idadataframe(iris, 'IRIS_TEST1', indexer='index') >>> iris2_ida = idadb.as_idadataframe(iris2, 'IRIS_TEST2', indexer='id') >>> iris_merge = iris1_ida.join(iris2_ida) >>> iris_merge.tail() index sepal_length_x sepal_width_x petal_length_x petal_width species id sepal_length_y sepal_width_y petal_length_y PETAL_WIDTH CLASS 95 145 6.7 3.0 5.2 2.3 virginica 145 5.7 3.0 4.2 1.2 versicolor 96 146 6.3 2.5 5.0 1.9 virginica 146 5.7 2.9 4.2 1.3 versicolor 97 147 6.5 3.0 5.2 2.0 virginica 147 6.2 2.9 4.3 1.3 versicolor 98 148 6.2 3.4 5.4 2.3 virginica 148 5.1 2.5 3.0 1.1 versicolor 99 149 5.9 3.0 5.1 1.8 virginica 149 5.7 2.8 4.1 1.3 versicolor >>> len(iris_merge) 100
merge
- IdaDataFrame.merge(right, how: str = 'inner', on=None, left_on=None, right_on=None, left_index: bool = False, right_index: bool = False, suffixes: List[str] = ['_x', '_y'], indicator: bool = False)[source]
Implement pandas-like interface to merge IdaDataFrames
- Parameters:
- rightIdaDataFrame
right IdaDataDrame to merge
- howstr, optional
type of join, possible types - ‘outer’, ‘inner’, ‘right’, ‘left’, ‘cross’
- onstr, optional
name of column in both IdaDataFrames to do join on
- left_onstr, optional
name of column in left IdaDataFrame to do join on
- right_onstr, optional
name of column in right IdaDataFrame to do join on
- left_indexbool, optional
whether to join on indexer of left IdaDataFrame
- right_indexbool, optional
whether to join on indexer of right IdaDataFrame
- suffixes: List[str], optional
list of suffixes to add to columns that are present in both IdaDataFrames
- indicator: bool optional
whether to add to output IdaDataFrame, column with information about the source of a given row
- Returns:
- IdaDataFrame
result of merging IdaDataFrames
Examples
>>> from nzpyida.sampledata.iris import iris >>> from nzpyida.sampledata.iris import iris >>> iris.reset_index() >>> iris2.reset_index() >>> iris2.columns = ['id', 'sepal_length', 'sepal_width', 'petal_length', 'PETAL_WIDTH', 'CLASS'] >>> iris2['id'] = iris2['id'] + 50 >>> iris1_ida = idadb.as_idadataframe(iris, 'IRIS_TEST1', indexer='index') >>> iris2_ida = idadb.as_idadataframe(iris2, 'IRIS_TEST2', indexer='id') >>> iris_merge = iris1_ida.merge(iris2_ida, left_on='index', right_on='id') >>> iris_merge.tail() index sepal_length_x sepal_width_x petal_length_x petal_width species id sepal_length_y sepal_width_y petal_length_y PETAL_WIDTH CLASS 95 145 6.7 3.0 5.2 2.3 virginica 145 5.7 3.0 4.2 1.2 versicolor 96 146 6.3 2.5 5.0 1.9 virginica 146 5.7 2.9 4.2 1.3 versicolor 97 147 6.5 3.0 5.2 2.0 virginica 147 6.2 2.9 4.3 1.3 versicolor 98 148 6.2 3.4 5.4 2.3 virginica 148 5.1 2.5 3.0 1.1 versicolor 99 149 5.9 3.0 5.1 1.8 virginica 149 5.7 2.8 4.1 1.3 versicolor >>> len(iris_merge) 100 >>> iris_merge = iris1_ida.merge() iris2_ida, left_on='index', right_index=True, how='outer', indicator=True, suffixes=("_a", "_b")) >>> iris_merge.head() index sepal_length_a sepal_width_a petal_length_a petal_width species id sepal_length_b sepal_width_b petal_length_b PETAL_WIDTH CLASS INDICATOR 0 None None None None None None 152 7.1 3.0 5.9 2.1 virginica right_only 1 None None None None None None 151 5.8 2.7 5.1 1.9 virginica right_only 2 None None None None None None 154 6.5 3.0 5.8 2.2 virginica right_only 3 None None None None None None 153 6.3 2.9 5.6 1.8 virginica right_only 4 None None None None None None 150 6.3 3.3 6.0 2.5 virginica right_only
concat
- join_tables.concat(axis: int = 0, join: str = 'outer', keys: List[str] = None)
Implement pandas-like interface to concateate IdaDataFrames
- Parameters:
- objsList[IdaDataFrame]
list of IdaDataFrames to be concatenated
- axisint, optional
axis to concatenate on
- joinstr, optional
type of join, possible types - ‘outer’, ‘inner’
- keysList[str], optional
if List is present then add column with table indetifier
- Returns:
- IdaDataFrame
reult of concataneting IdaDataFrames
Examples
>>> from nzpyida.sampledata.iris import iris >>> from nzpyida.sampledata.iris import iris as iris2 >>> iris.reset_index() >>> iris2.reset_index() >>> iris2.columns = ['id', 'sepal_length', 'sepal_width', 'petal_length', 'PETAL_WIDTH', 'CLASS'] >>> iris2['id'] = iris2['id'] + 50 >>> iris1_ida = idadb.as_idadataframe(iris, 'IRIS_TEST1', indexer='index') >>> iris2_ida = idadb.as_idadataframe(iris2, 'IRIS_TEST2', indexer='id') >>> iris_concat = nzpyida.concat([iris1_ida, iris2_ida]) >>> len(iris1_ida), len(iris2_ida), len(iris_concat) (150, 150, 300) >>> iris_concat.head() index sepal_length sepal_width petal_length petal_width species id PETAL_WIDTH CLASS 0 None 4.9 3.0 1.4 None None 51 0.2 setosa 1 None 4.6 3.1 1.5 None None 53 0.2 setosa 2 None 4.7 3.2 1.3 None None 52 0.2 setosa 3 None 5.1 3.5 1.4 None None 50 0.2 setosa 4 None 5.9 3.0 5.1 None None 199 1.8 virginica >>> iris_concat.tail() index sepal_length sepal_width petal_length petal_width species id PETAL_WIDTH CLASS 295 145 6.7 3.0 5.2 2.3 virginica None None None 296 146 6.3 2.5 5.0 1.9 virginica None None None 297 147 6.5 3.0 5.2 2.0 virginica None None None 298 148 6.2 3.4 5.4 2.3 virginica None None None 299 149 5.9 3.0 5.1 1.8 virginica None None None
sort
- IdaDataFrame.sort(columns=None, axis=0, ascending=True, inplace=False)[source]
Sort the IdaDataFrame row wise or column wise.
- Parameters:
- columnsstr or list of str
Columns that should be used to sort the rows in the IdaDataFrame. If columns is set to None and axis to 0, then the IdaDataFrame columns are sorted in lexicographical order.
- axisint (0/1)
Axis that is sorted. 0 for sorting row wise, 1 for sorting column wise.
- ascendingbool, default: True
Sorting order, True : ascending, False : descending
- inplacebool, default: False
The current object is modified or creates a modified copy. If False, the function creates a modified copy of the current dataframe. If True, the function modifies the current dataframe.
- Raises:
- ValueError
When sorting by column (column not None), the axis value must be 0 (rows).
A column does not belong to self.
The axis argument has a value other than 0 or 1.
Notes
If columns is set to None and axis to 0, this undoes all sorting the IdaDataFrame and returns the original sorting in the Db2 Warehouse database.
No actual changes are made in Db2 Warehouse, only the querying changes. Everything is registered in an InternalState object. Changes can be observed by using head and tail function.
- IdaDataFrame.groupby(by)[source]
Creates the groupby object
- Parameters:
- bystr
Column to group the Data Frame by
- Returns:
- IdaDataFrameGroupBy
Examples
>>> ida_iris.groupby("CLASS") <nzpyida.groupby.IdaDataFrameGroupBy at 0x11c288e10>
Obtaing IdaDataFrame with grouped data is possible with aggregate methods of IdaDataFrameGroupBy
IdaDataFrameGroupBy
- class nzpyida.groupby.IdaDataFrameGroupBy(in_df, columns_to_aggregate, by_column)[source]
Class representing groupby object, that is object created from IdaDataFrame. It implements Pandas-like object for grouping.
Examples
>>> ida_iris.groupby("CLASS") <nzpyida.groupby.IdaDataFrameGroupBy at 0x11c288e10>
>>> df.groupby("CLASS").mean().head() AVG_SEPAL_LENGTH AVG_SEPAL_WIDTH AVG_PETAL_LENGTH AVG_PETAL_WIDTH CLASS 0 5.006 3.418 1.464 0.244 Iris-setosa 1 5.936 2.770 4.260 1.326 Iris-versicolor 2 6.588 2.974 5.552 2.026 Iris-virginica
>>> mean_sepal_length = ida_iris[["SEPAL_LENGTH", "CLASS"]].groupby("CLASS").mean() >>> mean_sepal_length.head() AVG_SEPAL_LENGTH CLASS 0 5.006 Iris-setosa 1 5.936 Iris-versicolor 2 6.588 Iris-virginica
>>> iris_groupby = ida_iris[(ida_iris["SEPAL_LENGTH"]>4) & (ida_iris["SEPAL_WIDTH"]<3)][ >>> ["SEPAL_WIDTH", "SEPAL_LENGTH", "CLASS"]].groupby("CLASS") >>> iris_groupby <nzpyida.groupby.IdaDataFrameGroupBy object at 0x119863190> >>> iris_groupby.count().head() COUNT_SEPAL_WIDTH COUNT_SEPAL_LENGTH CLASS 0 2 2 Iris-setosa 1 21 21 Iris-virginica 2 34 34 Iris-versicolor
Methods
count()Aggregates all columns with COUNT method.
max()Aggregates all columns with MAX method.
mean()Aggregates all columns with AVG method.
min()Aggregates all columns with MIN method.
sum()Aggregates all columns with SUM method.
- count()[source]
Aggregates all columns with COUNT method. Counts all non null values in column with respect to grouped classes
- Returns:
- IdaDataFrame
Object with grouped data
- max()[source]
Aggregates all columns with MAX method. Selects the maximum value from column with respect to grouped classes
- Returns:
- IdaDataFrame
Object with grouped data
- mean()[source]
Aggregates all columns with AVG method. Calculates mean of values in column with respect to grouped classes
- Returns:
- IdaDataFrame
Object with grouped data
Descriptive Statistics
describe
- IdaDataFrame.describe(percentiles=[0.25, 0.5, 0.75])[source]
A basic statistical summary about current IdaDataFrame. If at least one numerical column exists, the summary includes:
The count of non-missing values for each numerical column.
The mean for each numerical column.
The standart deviation for each numerical column.
The minimum and maximum for each numerical column.
A list of percentiles set by the user (default : the quartiles).
- Parameters:
- idadfIdaDataFrame
- percentilesFloat or list of floats, default: [0.25, 0.50, 0.75].
percentiles to be computed on numerical columns. All values in percentiles must be > 0 and < 1.
- Returns:
- summary: DataFrame, where
Index is the name of the computed values.
Columns are either numerical or categorical columns of self.
cov (covariance)
corr (correlation)
- IdaDataFrame.corr(method='pearson', features=None, ignore_indexer=True)[source]
Compute the correlation matrix, composed of correlation coefficients between all pairs of columns in self. It must have at least two numeric columns.
- Parameters:
- methodstr, default: pearson
Method to be used to compute the correlation. By default, compute the pearson correlation coefficient. The Spearman rank correlation is also available. Admissible values are: “pearson”, “spearman”.
- Returns:
- correlation matrix: DataFrame
The axes are the columns of self and the values are the correlation coefficients.
Notes
For the Spearman rank correlation, the ordinal rank of columns is computed. For performance reasons this is easier to compute than the fractional rank traditionally computed for the Spearman rank correlation method. This strategy has the property that the sum of the ranking numbers is the same as under ordinal ranking. We then apply the pearson correlation coefficient method to these ranks.
quantile
- IdaDataFrame.quantile(q=0.5)[source]
Compute row wise quantiles for each numeric column.
- Parameters:
- qfloat or array-like, default 0.5 (50% quantile)
0 <= q <= 1, the quantile(s) to compute
- Returns:
- quantiles: Series or DataFrame
If q is an array, the function returns a DataFrame in which the index is q. The columns are the columns of sel, and the values are the quantiles. If q is a float, a Series is returned where the index is the columns of self and the values are the quantiles.
mad (mean absolute deviation)
min (minimum)
max (maximum)
count
count_distinct
std (standard deviation)
var (variance)
mean
sum
median
Import as DataFrame
as_dataframe
- IdaDataFrame.as_dataframe()[source]
Download and return an in-memory representation of the dataset as a Pandas DataFrame.
- Returns:
- DataFrame
Columns and records are the same as in self.
Examples
>>> iris = ida_iris.as_dataframe() >>> iris.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa