How do I combine multiple pandas dataframes into an HDF5 object under one key/group? - pandas

I am parsing data from a large csv sized 800 GB. For each line of data, I save this as a pandas dataframe.
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
Now, I would like to save this into an HDF5 format, and query the h5 as if it was the entire csv file.
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
My approach so far has been:
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
store.append(hdf5_key, df, data_columns=csv_columns, index=False)
That is, I try to save each dataframe df into the HDF5 under one key. However, this fails:
Attribute 'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'
So, I could try to save everything into one pandas dataframe first, i.e.
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame()
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
total_df = pd.concat([total_df, df]) # creates one big CSV
and now store into HDF5 format
store.append(hdf5_key, total_df, data_columns=csv_columns, index=False)
However, I don't think I have the RAM/storage to save all csv lines into total_df into HDF5 format.
So, how do I append each "single-line" df into an HDF5 so that it ends up as one big dataframe (like the original csv)?
EDIT: Here's a concrete example of a csv file with different data types:
order start end value
1 1342 1357 category1
1 1459 1489 category7
1 1572 1601 category23
1 1587 1599 category2
1 1591 1639 category1
....
15 792 813 category13
15 892 913 category5
....

Your code should work, can you try the following code:
import pandas as pd
import numpy as np
store = pd.HDFStore("file.h5", "w")
hdf5_key = "single_key"
csv_columns = ["COL%d" % i for i in range(1, 56)]
for i in range(10):
df = pd.DataFrame(np.random.randn(1, len(csv_columns)), columns=csv_columns)
store.append(hdf5_key, df, data_column=csv_columns, index=False)
store.close()
If the code works, then there are something wrong with your data.

Related

Pandas read csv using column names included in a list

I'm quite new to Pandas.
I'm trying to create a dataframe reading thousands of csv files.
The files are not structured in the same way, but I want to extract only columns I'm interested in, so I created a list which inlcudes all the column names I want, but then i have an error cause not all of them are included in each dataset.
import pandas as pd
import numpy as np
import os
import glob
# select the csv folder
csv_folder= r'myPath'
# select all xlsx files within the folder
all_files = glob.glob(csv_folder + "/*.csv")
# Set the column names to include in the dataframe
columns_to_use = ['Name1', 'Name2', 'Name3', 'Name4', 'Name5', 'Name6']
# read one by one all the excel
for filename in all_files:
df = pd.read_csv(filename,
header=0,
usecols = columns_to_use)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-0d9670495660> in <module>
1 for filename in all_files:
----> 2 df = pd.read_csv(filename,
3 header=0,
4 usecols = columns_to_use)
5
ValueError: Usecols do not match columns, columns expected but not found: ['Name1', 'Name2', 'Name4']
How could I handle this issue by including a columns if this is present in the list?
Usa a callable for usecols, i.e. df = pd.read_csv(filename, header=0, usecols=lambda c: c in columns_to_use). From the docs of the usecols parameter:
If callable, the callable function will be evaluated against the
column names, returning names where the callable function evaluates to
True.
Working example that will only read col1 and not throw an error on missing col3:
import pandas as pd
import io
s = """col1,col2
1,2"""
df = pd.read_csv(io.StringIO(s), usecols=lambda c: c in ['col1', 'col3'])

Prepending to a dask dataframe in parquet storage

What is the recommended way to prepend data (a pandas dataframe) to an existing dask dataframe in parquet storage?
This test, for example, fails intermittently:
import dask.dataframe as dd
import numpy as np
import pandas as pd
def test_dask_intermittent_error(tmp_path):
df = pd.DataFrame(np.random.randn(100, 1), columns=['A'],
index=pd.date_range('20130101', periods=100, freq='T'))
dfs = np.array_split(df, 2)
dd1 = dd.from_pandas(dfs[0], npartitions=1)
dd2 = dd.from_pandas(dfs[1], npartitions=1)
dd2.to_parquet(tmp_path)
_ = (dd1
.append(dd.read_parquet(tmp_path))
.to_parquet(tmp_path))
assert_frame_equal(df,
dd.read_parquet(tmp_path).compute())
gives
.venv/lib/python3.7/site-packages/dask/dataframe/core.py:3812: in to_parquet
return to_parquet(self, path, *args, **kwargs)
...
fastparquet.util.ParquetException: Metadata parse failed: /private/var/folders/_1/m2pd_c9d3ggckp1c1p0z3v8r0000gn/T/pytest-of-jfaleiro/pytest-138/test_dask_intermittent_error0/part.0.parquet
We considered relying on a simple append and figuring out order after retrieval, but this seems to be hitting a different bug, i.e.:
def test_dask_prepend_as_append(tmp_path):
df = pd.DataFrame(np.random.randn(100, 1), columns=['A'],
index=pd.date_range('20130101', periods=100, freq='T'))
dfs = np.array_split(df, 2)
dd1 = dd.from_pandas(dfs[0], npartitions=1)
dd2 = dd.from_pandas(dfs[1], npartitions=1)
dd2.to_parquet(tmp_path)
dd1.to_parquet(tmp_path, append=True)
assert_frame_equal(df,
dd.read_parquet(tmp_path).compute())
gives
ValueError: Appended divisions overlapping with previous ones.
If you avoid using a "_metadata" file when writing (which you will with the default settings and pyarrow), then you could simply rename your files, to assure that the prepended partition occurs before the rest, when listed by glob. Normally, Dask will begin naming with a serial number 0.

How to add/edit text in pandas.io.parsers.TextFileReader

I have a large file in CSV. Since it is a large file(almost 7 GB) , it cannot be converted into a pandas dataframe.
import pandas as pd
df1 = pd.read_csv('tblViewPromotionDataVolume_202004070600.csv', sep='\t', iterator=True, chunksize=1000)
for chunk in df1:
print (chunk)
df1 is of type pandas.io.parsers.TextFileReader
Now i want to edit/add/insert some text(a new row) into this file , and convert it back to a pandas dataframe. Please let me know of possible solutions. Thanks in advance.
Here is DataFrame called chunk, so for processing use it, last for write to file use DataFrame.to_csv with mode='a' for append mode:
import pandas as pd
import os
infile = 'tblViewPromotionDataVolume_202004070600.csv'
outfile = 'out.csv'
df1 = pd.read_csv(infile, sep='\t', iterator=True, chunksize=1000)
for chunk in df1:
print (chunk)
#processing with chunk
# https://stackoverflow.com/a/30991707/2901002
# if file does not exist write header with first chunk
if not os.path.isfile(outfile):
chunk.to_csv(, sep='\t')
else: # else it exists so append without writing the header
chunk.to_csv('out.csv', sep='\t', mode='a', header=False)

Merge multiple csv files in python

Need help with merging multiple csv file
import pandas as pd
import glob
import csv
r1=glob.glob("path/*.csv")
wr1 = csv.writer(open("path/merge.csv",'wb'),delimiter = ',')
for files in r1:
rd=csv.reader(open(files,'r'), delimiter=',')
for row in rd:
print(row)
wr1.writerow(row)
I am getting a type error
TypeError: a bytes-like object is required, not 'str' not sure how to resolve this
Using pandas you can do it like this:
dfs = glob.glob('path/*.csv')
result = pd.concat([pd.read_csv(df) for df in dfs], ignore_index=True)
result.to_csv('path/merge.csv', ignore_index=True)

Concatenate a pandas dataframe to CSV file without reading the entire file

I have a quite large CSV file. I have a pandas dataframe that has exactly the columns with the CSV file.
I checked on stackoverflow and I see several answers suggested to read_csv then concatenate the read dataframe with the current one then write back to a CSV file.
But for a large file I think it is not the best way.
Can I concatenate a pandas dataframe to an existed CSV file without reading the whole file?
Update: Example
import pandas as pd
df1 = pd.DataFramce ({'a':1,'b':2}, index = [0])
df1.to_csv('my.csv')
df2 = pd.DataFrame ({'a':3, 'b':4}, index = [1])
# what to do here? I would like to concatenate df2 to my.csv
The expected my.csv
a b
0 1 2
1 3 4
Look at using mode='a' in to_csv:
MCVE:
df1 = pd.DataFrame ({'a':1,'b':2}, index = [0])
df1.to_csv('my.csv')
df2 = pd.DataFrame ({'a':3, 'b':4}, index = [1])
df2.to_csv('my.csv', mode='a', header=False)
!type my.csv #Windows machine use 'type' command or on unix use 'cat'
Output:
,a,b
0,1,2
1,3,4