A pandas dataframe file size increases significantly after saving it as .h5 the first time. If I save loaded dataframe, the file size doesn't increase again. It makes me suspect that some kind of meta-data is being saved the first time. What is it the reason for this crease?
Is there an easy way to avoid it?
I can compress the file but I am making comparisons without compression. Would the problem scale differently with compression?
Example code below. The file size increases from 15.3 MB to 22.9 MB
import numpy as np
import pandas as pd
x = np.random.normal (0,1, 1000000)
y = x*2
dataset = pd.DataFrame({'Column1': x, 'Column2': y})
print (dataset.info(memory_usage='deep'))
dataset.to_hdf('data.h5', key='df', mode='w')
dataset2 = pd.read_hdf("data.h5")
print (dataset2.info(memory_usage='deep'))
dataset2.to_hdf('data2.h5', key='df', mode='w')
dataset3 = pd.read_hdf("data2.h5")
print (dataset3.info(memory_usage='deep'))
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 15.3 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB
None
It is happening because RangeIndex is converted to Int64Index on save. Is there a way to optimise this ? Looks like there's no way to drop the index:
https://github.com/pandas-dev/pandas/issues/8319
The best solution I found till now is to save as pickle:
dataset.to_pickle("datapkl.pkl")
less convenient option is to convert to numpy and save with h5py, but then loading and converting back to pandas takes a lot of time
a = dataset.to_numpy()
h5f = h5py.File('datah5.h5', 'w')
h5f.create_dataset('dataset_1', data=a)
Related
I have a data frame:
pd.DataFrame({'A': range(1, 10000)})
I can get a nice human-readable thing saying that it has a memory usage of 78.2 KB using df.info():
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 9999 non-null int64
dtypes: int64(1)
memory usage: 78.2 KB
I can get an unhelpful statement with similar effect using df.memory_usage() (and this is how Pandas itself calculates its own memory usage) but would like to avoid having to roll my own. I've looked at the df.info source and traced the source of the string all the way to this line.
How is this specific string generated and how can I pull that out so I can print it to a log?
Nb I can't parse the df.info() output because it prints directly to buffer; calling str on it just returns None.
Nb This line also does not help, what is initialised is merely a boolean flag for whether memory usage should be printed at all.
You can create an instance of pandas.io.formats.info.DataFrameInfo and read the memory_usage_string property, which is exactly what df.info() does:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 9999 non-null int64
dtypes: int64(1)
memory usage: 78.2 KB
>>> pd.io.formats.info.DataFrameInfo(df).memory_usage_string.strip()
'78.2 KB'
If you're passing memory_usage to df.info, you can pass it directly to DataFrameInfo:
pd.io.formats.info.DataFrameInfo(df, memory_usage='deep').memory_usage_string.strip()
While removing the zero and null values from pandas dataframe, the datatype of field gets changed.
df.info()
Output :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 2 columns):
budget 10866 non-null int64
revenue 10866 non-null int64
dtypes: int64(2)
memory usage: 509.4+ KB
After running below code to remove zero and null values the datatype got changed.
temp_list_to_check_zero_values=['budget', 'revenue']
df[temp_list_to_check_zero_values] = df[temp_list_to_check_zero_values].replace(0, np.NAN)
df.info()
Output :
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3854 entries, 0 to 10848
Data columns (total 2 columns):
budget 3854 non-null float64
revenue 3854 non-null float64
dtypes: float64(2)
memory usage: 210.8+ KB
To preserve the datatype, we used applymap
df[temp_list_to_check_zero_values] = df[temp_list_to_check_zero_values].applymap(np.int64)
But got error :
ModuleNotFoundError: No module named 'pandas.core.apply'
Do we need to install any specific library for using applymap() ?
Upgrading your pandas should solve the issue.
Try pip install pandas --upgrade
I have a dataframe (see link for image) and I've listed the info on the data frame. I use the pivot_table function to sum the total number of births for each year. The issue is that when I try to plot the dataframe, the y-axis values range from 0 to 2.0 instead of the minimum and maximum values from the M and F columns.
To verify that it's not my environment, I created a simple dataframe, with just a few values and plot the line graph for that dataframe and it works as expected. Does anyone know why this is happening? Attempting to set the values using ylim or yticks is not working. Ultimately, I will have to try other graphing utilities like matplotlib, but I'm curious as to why it's not working for such a simple dataframe and dataset.
Visit my github page for a working example <git#github.com:stevencorrea-chicago/stackoverflow_question.git>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 1690784 non-null object
1 sex 1690784 non-null object
2 births 1690784 non-null int64
3 year 1690784 non-null Int64
dtypes: Int64(1), int64(1), object(2)
memory usage: 53.2+ MB
new_df = df.pivot_table(values='births', index='year', columns='sex', aggfunc=sum)
new_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 131 entries, 1880 to 2010
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 F 131 non-null int64
1 M 131 non-null int64
dtypes: int64(2)
memory usage: 3.1+ KB
My data:
In [30]: ml_full.columns
Out[30]: Index([u'coursetotal', u'email', u'assignment', u'grade'], dtype='object')
In [32]: ml_full.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 496 entries, 0 to 495
Data columns (total 4 columns):
coursetotal 496 non-null float64
email 496 non-null object
assignment 496 non-null object
grade 270 non-null float64
dtypes: float64(2), object(2)
memory usage: 15.6+ KB.
Then I try and plot it and get this error:
In [33]: sns.lmplot(x='coursetotal', y='grade', data=ml_full, col='assignment')
mtrand.pyx in mtrand.RandomState.randint (numpy/random/mtrand/mtrand.c:16157)()
ValueError: low >= high
I've reduced this to a small case. There are definitely NaN's in the data by the way in case that's what the problem is. I don't know how to further diagnose what the problem is with my data. Any tips?
pandas data frame are in general represented in long ( a lot of rows) or wide (a lot of columns) format.
I'm wondering which format is faster to read and occupies less memory when saved as hdf file (df.to_hdf).
Is there a general rule or some cases where one of the format should be preferred?
IMO long format is much more preferable as you will have much less metadata overhead (information about column names, dtypes, etc.).
In term of memory usage they are going to be more or less the same:
In [22]: long = pd.DataFrame(np.random.randint(0, 10**6, (10**4, 4)))
In [23]: wide = pd.DataFrame(np.random.randint(0, 10**6, (4, 10**4)))
In [24]: long.shape
Out[24]: (10000, 4)
In [25]: wide.shape
Out[25]: (4, 10000)
In [26]: sys.getsizeof(long)
Out[26]: 160104
In [27]: sys.getsizeof(wide)
Out[27]: 160104
In [28]: wide.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Columns: 10000 entries, 0 to 9999
dtypes: int32(10000)
memory usage: 156.3 KB
In [29]: long.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
0 10000 non-null int32
1 10000 non-null int32
2 10000 non-null int32
3 10000 non-null int32
dtypes: int32(4)
memory usage: 156.3 KB