Extract human readable memory usage for Pandas data frame - pandas

I have a data frame:
pd.DataFrame({'A': range(1, 10000)})
I can get a nice human-readable thing saying that it has a memory usage of 78.2 KB using df.info():
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 9999 non-null int64
dtypes: int64(1)
memory usage: 78.2 KB
I can get an unhelpful statement with similar effect using df.memory_usage() (and this is how Pandas itself calculates its own memory usage) but would like to avoid having to roll my own. I've looked at the df.info source and traced the source of the string all the way to this line.
How is this specific string generated and how can I pull that out so I can print it to a log?
Nb I can't parse the df.info() output because it prints directly to buffer; calling str on it just returns None.
Nb This line also does not help, what is initialised is merely a boolean flag for whether memory usage should be printed at all.

You can create an instance of pandas.io.formats.info.DataFrameInfo and read the memory_usage_string property, which is exactly what df.info() does:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 9999 non-null int64
dtypes: int64(1)
memory usage: 78.2 KB
>>> pd.io.formats.info.DataFrameInfo(df).memory_usage_string.strip()
'78.2 KB'
If you're passing memory_usage to df.info, you can pass it directly to DataFrameInfo:
pd.io.formats.info.DataFrameInfo(df, memory_usage='deep').memory_usage_string.strip()

Related

No module named 'pandas.core.apply'

While removing the zero and null values from pandas dataframe, the datatype of field gets changed.
df.info()
Output :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 2 columns):
budget 10866 non-null int64
revenue 10866 non-null int64
dtypes: int64(2)
memory usage: 509.4+ KB
After running below code to remove zero and null values the datatype got changed.
temp_list_to_check_zero_values=['budget', 'revenue']
df[temp_list_to_check_zero_values] = df[temp_list_to_check_zero_values].replace(0, np.NAN)
df.info()
Output :
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3854 entries, 0 to 10848
Data columns (total 2 columns):
budget 3854 non-null float64
revenue 3854 non-null float64
dtypes: float64(2)
memory usage: 210.8+ KB
To preserve the datatype, we used applymap
df[temp_list_to_check_zero_values] = df[temp_list_to_check_zero_values].applymap(np.int64)
But got error :
ModuleNotFoundError: No module named 'pandas.core.apply'
Do we need to install any specific library for using applymap() ?
Upgrading your pandas should solve the issue.
Try pip install pandas --upgrade

When plotting a pandas dataframe, the y-axis values are not displayed correctly

I have a dataframe (see link for image) and I've listed the info on the data frame. I use the pivot_table function to sum the total number of births for each year. The issue is that when I try to plot the dataframe, the y-axis values range from 0 to 2.0 instead of the minimum and maximum values from the M and F columns.
To verify that it's not my environment, I created a simple dataframe, with just a few values and plot the line graph for that dataframe and it works as expected. Does anyone know why this is happening? Attempting to set the values using ylim or yticks is not working. Ultimately, I will have to try other graphing utilities like matplotlib, but I'm curious as to why it's not working for such a simple dataframe and dataset.
Visit my github page for a working example <git#github.com:stevencorrea-chicago/stackoverflow_question.git>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 1690784 non-null object
1 sex 1690784 non-null object
2 births 1690784 non-null int64
3 year 1690784 non-null Int64
dtypes: Int64(1), int64(1), object(2)
memory usage: 53.2+ MB
new_df = df.pivot_table(values='births', index='year', columns='sex', aggfunc=sum)
new_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 131 entries, 1880 to 2010
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 F 131 non-null int64
1 M 131 non-null int64
dtypes: int64(2)
memory usage: 3.1+ KB

Dataframe size increases after saving to .h5 the first time

A pandas dataframe file size increases significantly after saving it as .h5 the first time. If I save loaded dataframe, the file size doesn't increase again. It makes me suspect that some kind of meta-data is being saved the first time. What is it the reason for this crease?
Is there an easy way to avoid it?
I can compress the file but I am making comparisons without compression. Would the problem scale differently with compression?
Example code below. The file size increases from 15.3 MB to 22.9 MB
import numpy as np
import pandas as pd
x = np.random.normal (0,1, 1000000)
y = x*2
dataset = pd.DataFrame({'Column1': x, 'Column2': y})
print (dataset.info(memory_usage='deep'))
dataset.to_hdf('data.h5', key='df', mode='w')
dataset2 = pd.read_hdf("data.h5")
print (dataset2.info(memory_usage='deep'))
dataset2.to_hdf('data2.h5', key='df', mode='w')
dataset3 = pd.read_hdf("data2.h5")
print (dataset3.info(memory_usage='deep'))
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 15.3 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB
None
It is happening because RangeIndex is converted to Int64Index on save. Is there a way to optimise this ? Looks like there's no way to drop the index:
https://github.com/pandas-dev/pandas/issues/8319
The best solution I found till now is to save as pickle:
dataset.to_pickle("datapkl.pkl")
less convenient option is to convert to numpy and save with h5py, but then loading and converting back to pandas takes a lot of time
a = dataset.to_numpy()
h5f = h5py.File('datah5.h5', 'w')
h5f.create_dataset('dataset_1', data=a)

merging two pandas data frames with modin.pandas gives ValueError

In an attempt to make my pandas code faster I installed modin and tried to use it. A merge of two data frames that had previously worked gave me the following error:
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.frame.DataFrame'>
Here is the info of both data frames:
printing event_df.info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1980101 entries, 0 to 1980100
Data columns (total 5 columns):
other_id object
id object
category object
description object
date datetime64[ns]
dtypes: datetime64[ns](1), object(4)
memory usage: 75.5+ MB
printing other_df info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 752438 entries, 0 to 752437
Data columns (total 4 columns):
id 752438 non-null object
other_id 752438 non-null object
Value 752438 non-null object
Unit 752438 non-null object
dtypes: object(4)
memory usage: 23.0+ MB
Here are some rows from event_df:
other_id id category description date
08E5A97350FC8B00092F 1 some_string some_string 2019-04-09
17B71019E148415D 4 some_string some_string 2019-11-08
17B71019E148415D360 7 some_string some_string 2019-11-08
and here are 3 rows from other_df:
id other_id Value Unit
a01 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
a02 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
a03 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
I tried installing the version cited in this question Join two modin.pandas.DataFrame(s), but it didn't help.
Here's the line of code throwing the error:
joint_dataframe2 = pd.merge(event_df,other_df, on = ["id","other_id"])
It seems there is some problem with modin's merge functionality. Is there any workaround such as using pandas for the merge and using modin for a groupby.transform()? I tried overwriting the pandas import after the merge with import modin.pandas, but got an error saying pandas was referenced before assignment. Has anyone come across this problem and if so, is there a solution?
Your error reads like you were merging an instance of modin.pandas.dataframe.DataFrame with an instance of pandas.core.frame.DataFrame, which is not allowed.
If that's indeed the case, you could turn the pandas Dataframe into a modin Dataframe first, then you should be able to merge them, I believe.

wide vs long format when saving data in pandas hdf5

pandas data frame are in general represented in long ( a lot of rows) or wide (a lot of columns) format.
I'm wondering which format is faster to read and occupies less memory when saved as hdf file (df.to_hdf).
Is there a general rule or some cases where one of the format should be preferred?
IMO long format is much more preferable as you will have much less metadata overhead (information about column names, dtypes, etc.).
In term of memory usage they are going to be more or less the same:
In [22]: long = pd.DataFrame(np.random.randint(0, 10**6, (10**4, 4)))
In [23]: wide = pd.DataFrame(np.random.randint(0, 10**6, (4, 10**4)))
In [24]: long.shape
Out[24]: (10000, 4)
In [25]: wide.shape
Out[25]: (4, 10000)
In [26]: sys.getsizeof(long)
Out[26]: 160104
In [27]: sys.getsizeof(wide)
Out[27]: 160104
In [28]: wide.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Columns: 10000 entries, 0 to 9999
dtypes: int32(10000)
memory usage: 156.3 KB
In [29]: long.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
0 10000 non-null int32
1 10000 non-null int32
2 10000 non-null int32
3 10000 non-null int32
dtypes: int32(4)
memory usage: 156.3 KB