Pandas HDFStore duplicate items Error - pandas

Hello All and thanks in advance.
I'm trying to do a periodic storing of financial data to a database for later querying. I am using Pandas for almost all of the data coding. I want to append a dataframe I have created into an HDF database. I read the csv into a dataframe and index it by timestamp. and the DataFrame looks like:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 900 entries, 1378400701110 to 1378410270251
Data columns (total 23 columns):
....
...Columns with numbers of non-null values....
.....
dtypes: float64(19), int64(4)
store = pd.HDFStore('store1.h5')
store.append('df', df)
print store
<class 'pandas.io.pytables.HDFStore'>
File path: store1.h5
/df frame_table (typ->appendable,nrows->900,ncols->23,indexers->[index])
But when I then try to do anything with store,
print store['df']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 289, in __getitem__
return self.get(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 422, in get
return self._read_group(group)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 930, in _read_group
return s.read(**kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 3175, in read
mgr = BlockManager([block], [cols_, index_])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 1007, in __init__
self._set_ref_locs(do_refs=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 1117, in _set_ref_locs
"does not have _ref_locs set" % (block,labels))
AssertionError: cannot create BlockManager._ref_locs because block
[FloatBlock: [LastTrade, Bid1, Bid1Volume,....., Ask5Volume], 19 x 900, dtype float64]
with duplicate items
[Index([u'LastTrade', u'Bid1', u'Bid1Volume',..., u'Ask5Volume'], dtype=object)]
does not have _ref_locs set
I guess I am doing something wrong with the index, I'm quite new at this and have little knowhow.
EDIT:
The data frame construction looks like:
columns = ['TimeStamp', 'LastTrade', 'Bid1', 'Bid1Volume', 'Bid1', 'Bid1Volume', 'Bid2', 'Bid2Volume', 'Bid3', 'Bid3Volume', 'Bid4', 'Bid4Volume',
'Bid5', 'Bid5Volume', 'Ask1', 'Ask1Volume', 'Ask2', 'Ask2Volume', 'Ask3', 'Ask3Volume', 'Ask4', 'Ask4Volume', 'Ask5', 'Ask5Volume']
df = pd.read_csv('/20130905.csv', names=columns, index_col=[0])
df.head() looks like:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 1378400701110 to 1378400703105
Data columns (total 21 columns):
LastTrade 5 non-null values
Bid1 5 non-null values
Bid1Volume 5 non-null values
Bid1 5 non-null values
.................values
Ask4 5 non-null values
Ask4Volume 5 non-null values
dtypes: float64(17), int64(4)
There's too many columns for it to print out the contents. But for example:
print df['LastTrade'].iloc[10]
LastTrade 1.31202
Name: 1378400706093, dtype: float64
and Pandas version:
>>> pd.__version__
'0.12.0'
Any ideas would be thoroughly appreciated, thank you again.

Do you really have a duplicate 'Bid1' and 'Bid1Volume' columns?
Unrrelated, but you should also set the index to a datetime index
import pandas as pd
df.index = pd.to_datetime(df.index,unit='ms')
This is a bug, because the duplicate columns cross dtypes (not a big deal
but undetected).
Easiest just to not have duplicate columns.
Will be fixed in 0.13, see here: https://github.com/pydata/pandas/pull/4768

Related

Trying to sum columns in Pandas dataframe, issue with index it seems

Trying to sum columns in Pandas dataframe, issue with index it seems...
Part of dataset looks like this, for multiple years:
snapshot of dataset
CA_HousingTrend = CA_HousingTrend_temp.pivot_table(index='YEAR',columns='UNITSSTR', aggfunc='size')
dataframe looks like this now and
this is the properties
Trying to sum multi-family units so I am specifying the columns to sum
cols = ['05', '06']
CA_HousingTrend['sum_stats'] = CA_HousingTrend[cols].sum(axis=1)
This is the error I get:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/code.py", line 90, in runcode
exec(code, self.locals)
File "", line 5, in
File "/Users/alexandramaxim/Documents/Py/lib/python3.10/site-packages/pandas/core/frame.py", line 3511, in getitem
indexer = self.columns._get_indexer_strict(key, "columns")1
File "/Users/alexandramaxim/Documents/Py/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5782, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/Users/alexandramaxim/Documents/Py/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5842, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['05', '06'], dtype='object', name='UNITSSTR')] are in the [columns]"
Not sure if you need the index, but the pivot probably created a multi-index. Try this.
CA_HousingTrend = CA_HousingTrend_temp.pivot_table(index='YEAR',columns='UNITSSTR', aggfunc='size')
# A new dataframe just so you have something new to play with.
new_df = CA_Housing_Trend_temp.reset_index()

HDF5 bad performance for tables with non-numerical columns

I have pulled some big tables (typically 100M rows X 100 columns) from a database, with both numerical and textual columns. Now I need to save them to a remote machine which doesn't allow for setting up databases. After the data are saved there, the use case would be occasional reads of these files by different collaborators and in-place queries are favorable.
So I figured that HDF5 would be the file format to use. However, the performance seems really bad, both read and write, even worse than .csv files. I used a random subset (1M rows) of a big table to test the saving process in Python. Some info about this test dataframe:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 1 to 1000000
Data columns (total 49 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 xid 1000000 non-null int64
1 yid 1000000 non-null int64
2 timestamp 1000000 non-null datetime64[ns]
...
7 content1 361892 non-null object
8 content2 168397 non-null object
...
dtypes: datetime64[ns](14), float64(14), int64(5), object(16)
memory usage: 381.5+ MB
>>> df['content2'].str.len().describe()
count 168397.000000
mean 111.846975
std 427.148959
min 4.000000
25% 72.000000
50% 73.000000
75% 73.000000
max 13320.000000
I allocated 128G memory for this process. And when I saved this DF:
>>> timeit(lambda: df.to_csv('test.csv', index=False), number=5)
157.54342390294187
>>> store = pd.HDFStore('test.h5')
>>> timeit(lambda: store.put(key='df', value=df, append=False, index=False, complevel=9, complib='blosc:zstd', format='table',min_itemsize={'content2': 65535, 'content2': 65535}}), number=5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/apps/anaconda/2020.07/lib/python3.8/site-packages/pandas/io/pytables.py", line 1030, in put
self._write_to_group(
File "/opt/apps/anaconda/2020.07/lib/python3.8/site-packages/pandas/io/pytables.py", line 1697, in _write_to_group
s.write(
File "/opt/apps/anaconda/2020.07/lib/python3.8/site-packages/pandas/io/pytables.py", line 4137, in write
table = self._create_axes(
File "/opt/apps/anaconda/2020.07/lib/python3.8/site-packages/pandas/io/pytables.py", line 3806, in _create_axes
data_converted = _maybe_convert_for_string_atom(
File "/opt/apps/anaconda/2020.07/lib/python3.8/site-packages/pandas/io/pytables.py", line 4812, in _maybe_convert_for_string_atom
data_converted = _convert_string_array(data, encoding, errors).reshape(data.shape)
File "/opt/apps/anaconda/2020.07/lib/python3.8/site-packages/pandas/io/pytables.py", line 4857, in _convert_string_array
data = np.asarray(data, dtype=f"S{itemsize}")
File "/opt/apps/anaconda/2020.07/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
MemoryError: Unable to allocate 61.0 GiB for an array with shape (1, 1000000) and data type |S65535
Even if I used smaller subsets (like 100K) that didn't lead to this MemoryError, saving to HDF5 is significantly slower than to_csv. This seems surprising to me given the touted advantages of HDF5 when handling large datasets. Is it because HDF5 is still bad for handling text data? Note that the textual columns do have long-tail distributions and the content can be as long as 60K characters (which is why I set min_itemsize to 65535).
I would appreciate any suggestion on the most appropriate file format to use, if not HDF5, or any fixes to my current approach. Thanks!

merging two pandas data frames with modin.pandas gives ValueError

In an attempt to make my pandas code faster I installed modin and tried to use it. A merge of two data frames that had previously worked gave me the following error:
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.frame.DataFrame'>
Here is the info of both data frames:
printing event_df.info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1980101 entries, 0 to 1980100
Data columns (total 5 columns):
other_id object
id object
category object
description object
date datetime64[ns]
dtypes: datetime64[ns](1), object(4)
memory usage: 75.5+ MB
printing other_df info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 752438 entries, 0 to 752437
Data columns (total 4 columns):
id 752438 non-null object
other_id 752438 non-null object
Value 752438 non-null object
Unit 752438 non-null object
dtypes: object(4)
memory usage: 23.0+ MB
Here are some rows from event_df:
other_id id category description date
08E5A97350FC8B00092F 1 some_string some_string 2019-04-09
17B71019E148415D 4 some_string some_string 2019-11-08
17B71019E148415D360 7 some_string some_string 2019-11-08
and here are 3 rows from other_df:
id other_id Value Unit
a01 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
a02 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
a03 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
I tried installing the version cited in this question Join two modin.pandas.DataFrame(s), but it didn't help.
Here's the line of code throwing the error:
joint_dataframe2 = pd.merge(event_df,other_df, on = ["id","other_id"])
It seems there is some problem with modin's merge functionality. Is there any workaround such as using pandas for the merge and using modin for a groupby.transform()? I tried overwriting the pandas import after the merge with import modin.pandas, but got an error saying pandas was referenced before assignment. Has anyone come across this problem and if so, is there a solution?
Your error reads like you were merging an instance of modin.pandas.dataframe.DataFrame with an instance of pandas.core.frame.DataFrame, which is not allowed.
If that's indeed the case, you could turn the pandas Dataframe into a modin Dataframe first, then you should be able to merge them, I believe.

convert data that is in the form of object in a csv to a pivot

I have a file that is not beautiful and searchable so i downloaded it in the csv format. It contains 4 columns and 116424 rows.
I'm not able to plot its three columns namely Year, Age and Ratio onto a heat map.
The link for the csv file is: https://gist.github.com/JustGlowing/1f3d7ff0bba7f79651b00f754dc85bf1
import numpy as np
import pandas as pd
from pandas import DataFrame
from numpy.random import randn
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('new_file.csv')
print(df.info())
print(df.shape)
couple_columns = df[['Year','Age','Ratio']]
print(couple_columns.head())
Error
C:\Users\Pranav\AppData\Local\Programs\Python\Python36-32\python.exe C:/Users/Pranav/PycharmProjects/takenmind/Data_Visualization/a1.py
Traceback (most recent call last):
RangeIndex: 116424 entries, 0 to 116423
File "C:/Users/Pranav/PycharmProjects/takenmind/Data_Visualization/a1.py", line 12, in
Data columns (total 4 columns):
couple_columns = df[['Year','Age','Ratio']]
AREA 116424 non-null object
File "C:\Users\Pranav\AppData\Roaming\Python\Python36\site-packages\pandas\core\frame.py", line 2682, in getitem
YEAR 116424 non-null int64
AGE 116424 non-null object
RATIO 116424 non-null object
dtypes: int64(1), object(3)
memory usage: 2.2+ MB
None
(116424, 4)
return self._getitem_array(key)
File "C:\Users\Pranav\AppData\Roaming\Python\Python36\site-packages\pandas\core\frame.py", line 2726, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "C:\Users\Pranav\AppData\Roaming\Python\Python36\site-packages\pandas\core\indexing.py", line 1327, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: "['Year' 'Age' 'Ratio'] not in index"
It seems that your columns are uppercase from the info output: YEAR 116424 non-null int64. You should be able to get e.g. the year column with df[['YEAR']].
If you would rather use lowercase, you can use
df = pd.read_csv('new_file.csv').rename(columns=str.lower)
The csv has some text in the top 8 lines before your actual data begins. You can skip those by using the skiprows argument
df = pd.read_csv('f2m_ratios.csv', skiprows=8)
Lets say you want to plot heatmap only for one Area
df = df[df['Area'] == 'Afghanistan']
Before you plot a heatmap, you need data in a certain format (pivot table)
df = df.pivot('Year','Age','Ratio')
Now your dataframe is ready for a heatmap
sns.heatmap(df)

DataFrame.ix() in pandas - is there an option to catch situations when requested columns do not exist?

My code reads CSV file into pandas DataFrame - and processes it.
The code relies on column names - uses df.ix[,] to get the columns.
Recently some column names in the CSV file were changed (without notice).
But the code was not complaining and was silently producing wrong results.
The ix[,] construct doesn't check if column exists.
If it doesn't - it simply creates it and populate with NaN.
Here is the main idea of what was going on.
df1=DataFrame({'a':[1,2,3],'b':[4,5,6]}) # columns 'a' & 'b'
df2=df1.ix[:,['a','c']] # trying to get 'a' & 'c'
print df2
a c
0 1 NaN
1 2 NaN
2 3 NaN
So it doesn't produce an error or a warning.
Is there an alternative way to select specific columns with extra check that columns exist?
My current workaround is to use my own small utility function, something like this:
import sys, inspect
def validate_cols_or_exit(df,cols):
"""
Exits with error message if pandas DataFrame object df
doesn't have all columns from the provided list of columns
Example of usage:
validate_cols_or_exit(mydf,['col1','col2'])
"""
dfcols = list(df.columns)
valid_flag = True
for c in cols:
if c not in dfcols:
print "Error, non-existent DataFrame column found - ",c
valid_flag = False
if not valid_flag:
print "Error, non-existent DataFrame column(s) found in function ", inspect.stack()[1][3]
print "valid column names are:"
print "\n".join(df.columns)
sys.exit(1)
How about:
In [3]: df1[['a', 'c']]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/home/wesm/code/pandas/<ipython-input-3-2349e89f1bb5> in <module>()
----> 1 df1[['a', 'c']]
/home/wesm/code/pandas/pandas/core/frame.py in __getitem__(self, key)
1582 if com._is_bool_indexer(key):
1583 key = np.asarray(key, dtype=bool)
-> 1584 return self._getitem_array(key)
1585 elif isinstance(self.columns, MultiIndex):
1586 return self._getitem_multilevel(key)
/home/wesm/code/pandas/pandas/core/frame.py in _getitem_array(self, key)
1609 mask = indexer == -1
1610 if mask.any():
-> 1611 raise KeyError("No column(s) named: %s" % str(key[mask]))
1612 result = self.reindex(columns=key)
1613 if result.columns.name is None:
KeyError: 'No column(s) named: [c]'
Not sure you can constrain a DataFrame, but your helper function could be a lot simpler. (something like)
mismatch = set(cols).difference(set(dfcols))
if mismatch:
raise SystemExit('Unknown column(s): {}'.format(','.join(mismatch)))