wide vs long format when saving data in pandas hdf5

wide vs long format when saving data in pandas hdf5 - pandas

pandas data frame are in general represented in long ( a lot of rows) or wide (a lot of columns) format.
I'm wondering which format is faster to read and occupies less memory when saved as hdf file (df.to_hdf).
Is there a general rule or some cases where one of the format should be preferred?

IMO long format is much more preferable as you will have much less metadata overhead (information about column names, dtypes, etc.).
In term of memory usage they are going to be more or less the same:
In [22]: long = pd.DataFrame(np.random.randint(0, 10**6, (10**4, 4)))
In [23]: wide = pd.DataFrame(np.random.randint(0, 10**6, (4, 10**4)))
In [24]: long.shape
Out[24]: (10000, 4)
In [25]: wide.shape
Out[25]: (4, 10000)
In [26]: sys.getsizeof(long)
Out[26]: 160104
In [27]: sys.getsizeof(wide)
Out[27]: 160104
In [28]: wide.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Columns: 10000 entries, 0 to 9999
dtypes: int32(10000)
memory usage: 156.3 KB
In [29]: long.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
0 10000 non-null int32
1 10000 non-null int32
2 10000 non-null int32
3 10000 non-null int32
dtypes: int32(4)
memory usage: 156.3 KB

Related

Extract human readable memory usage for Pandas data frame

I have a data frame:
pd.DataFrame({'A': range(1, 10000)})
I can get a nice human-readable thing saying that it has a memory usage of 78.2 KB using df.info():
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 9999 non-null int64
dtypes: int64(1)
memory usage: 78.2 KB
I can get an unhelpful statement with similar effect using df.memory_usage() (and this is how Pandas itself calculates its own memory usage) but would like to avoid having to roll my own. I've looked at the df.info source and traced the source of the string all the way to this line.
How is this specific string generated and how can I pull that out so I can print it to a log?
Nb I can't parse the df.info() output because it prints directly to buffer; calling str on it just returns None.
Nb This line also does not help, what is initialised is merely a boolean flag for whether memory usage should be printed at all.

You can create an instance of pandas.io.formats.info.DataFrameInfo and read the memory_usage_string property, which is exactly what df.info() does:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 9999 non-null int64
dtypes: int64(1)
memory usage: 78.2 KB
>>> pd.io.formats.info.DataFrameInfo(df).memory_usage_string.strip()
'78.2 KB'
If you're passing memory_usage to df.info, you can pass it directly to DataFrameInfo:
pd.io.formats.info.DataFrameInfo(df, memory_usage='deep').memory_usage_string.strip()

When plotting a pandas dataframe, the y-axis values are not displayed correctly

I have a dataframe (see link for image) and I've listed the info on the data frame. I use the pivot_table function to sum the total number of births for each year. The issue is that when I try to plot the dataframe, the y-axis values range from 0 to 2.0 instead of the minimum and maximum values from the M and F columns.
To verify that it's not my environment, I created a simple dataframe, with just a few values and plot the line graph for that dataframe and it works as expected. Does anyone know why this is happening? Attempting to set the values using ylim or yticks is not working. Ultimately, I will have to try other graphing utilities like matplotlib, but I'm curious as to why it's not working for such a simple dataframe and dataset.
Visit my github page for a working example <git#github.com:stevencorrea-chicago/stackoverflow_question.git>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 1690784 non-null object
1 sex 1690784 non-null object
2 births 1690784 non-null int64
3 year 1690784 non-null Int64
dtypes: Int64(1), int64(1), object(2)
memory usage: 53.2+ MB
new_df = df.pivot_table(values='births', index='year', columns='sex', aggfunc=sum)
new_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 131 entries, 1880 to 2010
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 F 131 non-null int64
1 M 131 non-null int64
dtypes: int64(2)
memory usage: 3.1+ KB

Dataframe size increases after saving to .h5 the first time

A pandas dataframe file size increases significantly after saving it as .h5 the first time. If I save loaded dataframe, the file size doesn't increase again. It makes me suspect that some kind of meta-data is being saved the first time. What is it the reason for this crease?
Is there an easy way to avoid it?
I can compress the file but I am making comparisons without compression. Would the problem scale differently with compression?
Example code below. The file size increases from 15.3 MB to 22.9 MB
import numpy as np
import pandas as pd
x = np.random.normal (0,1, 1000000)
y = x*2
dataset = pd.DataFrame({'Column1': x, 'Column2': y})
print (dataset.info(memory_usage='deep'))
dataset.to_hdf('data.h5', key='df', mode='w')
dataset2 = pd.read_hdf("data.h5")
print (dataset2.info(memory_usage='deep'))
dataset2.to_hdf('data2.h5', key='df', mode='w')
dataset3 = pd.read_hdf("data2.h5")
print (dataset3.info(memory_usage='deep'))
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 15.3 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB
None
It is happening because RangeIndex is converted to Int64Index on save. Is there a way to optimise this ? Looks like there's no way to drop the index:
https://github.com/pandas-dev/pandas/issues/8319

The best solution I found till now is to save as pickle:
dataset.to_pickle("datapkl.pkl")
less convenient option is to convert to numpy and save with h5py, but then loading and converting back to pandas takes a lot of time
a = dataset.to_numpy()
h5f = h5py.File('datah5.h5', 'w')
h5f.create_dataset('dataset_1', data=a)

sns.lmplot(x='coursetotal', y='grade', data=ml_full, col='assignment') error

My data:
In [30]: ml_full.columns
Out[30]: Index([u'coursetotal', u'email', u'assignment', u'grade'], dtype='object')
In [32]: ml_full.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 496 entries, 0 to 495
Data columns (total 4 columns):
coursetotal 496 non-null float64
email 496 non-null object
assignment 496 non-null object
grade 270 non-null float64
dtypes: float64(2), object(2)
memory usage: 15.6+ KB.
Then I try and plot it and get this error:
In [33]: sns.lmplot(x='coursetotal', y='grade', data=ml_full, col='assignment')
mtrand.pyx in mtrand.RandomState.randint (numpy/random/mtrand/mtrand.c:16157)()
ValueError: low >= high
I've reduced this to a small case. There are definitely NaN's in the data by the way in case that's what the problem is. I don't know how to further diagnose what the problem is with my data. Any tips?

Selecting columns from numpy recarray

I have an object from type numpy.core.records.recarray. I want to use it effectively as pandas dataframe. More precisely, I want to use a subset of its columns in order to obtain a new recarray, the same way you would do pandas_dataframe[[selected_columns]].
What's the easiest way to achieve this?

Without using pandas you can select a subset of the fields of a structured array (recarray). For example:
In [338]: dt=np.dtype('i,f,i,f')
In [340]: A=np.ones((3,),dtype=dt)
In [341]: A[:]=(1,2,3,4)
In [342]: A
Out[342]:
array([(1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<i4'), ('f3', '<f4')])
a subset of the fields.
In [343]: B=A[['f1','f3']].copy()
In [344]: B
Out[344]:
array([(2.0, 4.0), (2.0, 4.0), (2.0, 4.0)],
dtype=[('f1', '<f4'), ('f3', '<f4')])
that can be modified independently of A:
In [346]: B['f3']=[.1,.2,.3]
In [347]: B
Out[347]:
array([(2.0, 0.10000000149011612), (2.0, 0.20000000298023224),
(2.0, 0.30000001192092896)],
dtype=[('f1', '<f4'), ('f3', '<f4')])
In [348]: A
Out[348]:
array([(1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<i4'), ('f3', '<f4')])
The structured subset of fields is not highly developed. A[['f0','f1']] is enough for viewing, but it will warn or give an error if you try to modify that subset. That's why I used copy with B.
There's a set of functions that facilitate adding and removing fields from recarrays. I'll have to look up the access pattern. But mostly the construct a new dtype, and empty array, and then copy fields by name.
import numpy.lib.recfunctions as rf
update
With newer numpy versions, the multi-field index has changed
In [17]: B=A[['f1','f3']]
In [18]: B
Out[18]:
array([(2., 4.), (2., 4.), (2., 4.)],
dtype={'names':['f1','f3'], 'formats':['<f4','<f4'], 'offsets':[4,12], 'itemsize':16})
This B is a true view, referencing the same data buffer as A. The offsets lets it ignore the missing fields. Those fields can be removed with repack_fields as just documented.
But when putting this into a dataframe, it doesn't look like we need to do that.
In [19]: df = pd.DataFrame(A)
In [21]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 f0 3 non-null int32
1 f1 3 non-null float32
2 f2 3 non-null int32
3 f3 3 non-null float32
dtypes: float32(2), int32(2)
memory usage: 176.0 bytes
In [22]: df = pd.DataFrame(B)
In [24]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 f1 3 non-null float32
1 f3 3 non-null float32
dtypes: float32(2)
memory usage: 152.0 bytes
The frame created from B is smaller.
Sometimes when making a dataframe from an array, the array itself is used as the frame's memory. Changing values in the source array will change the values in the frame. But with structured arrays, pandas makes a copy of the data, with a different memory layout.
Columns of matching dtype are grouped into a common NumericBlock:
In [42]: pd.DataFrame(A)._data
Out[42]:
BlockManager
Items: Index(['f0', 'f1', 'f2', 'f3'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(1, 5, 2), 2 x 3, dtype: float32
NumericBlock: slice(0, 4, 2), 2 x 3, dtype: int32
In [43]: pd.DataFrame(B)._data
Out[43]:
BlockManager
Items: Index(['f1', 'f3'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(0, 2, 1), 2 x 3, dtype: float32

In addition to #hpaulj answer, you'll want to repack the copy, otherwise the copied subset will have the same size memory footprint as the original.
import numpy as np
# note that you have to import this library explicitly
import numpy.lib.recfunctions
# B has a subset of "colums" but uses the same amount of memory as A
B = A[['f1','f3']].copy()
# C has a smaller memory footprint
C = np.lib.recfunctions.repack_fields(B)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

wide vs long format when saving data in pandas hdf5 - pandas

pandas data frame are in general represented in long ( a lot of rows) or wide (a lot of columns) format. I'm wondering which format is faster to read and occupies less memory when saved as hdf file (df.to_hdf). Is there a general rule or some cases where one of the format should be preferred?

Related

Extract human readable memory usage for Pandas data frame

When plotting a pandas dataframe, the y-axis values are not displayed correctly

Dataframe size increases after saving to .h5 the first time

sns.lmplot(x='coursetotal', y='grade', data=ml_full, col='assignment') error

Selecting columns from numpy recarray

Categories

Resources