Prepending to a dask dataframe in parquet storage - pandas

What is the recommended way to prepend data (a pandas dataframe) to an existing dask dataframe in parquet storage?
This test, for example, fails intermittently:
import dask.dataframe as dd
import numpy as np
import pandas as pd
def test_dask_intermittent_error(tmp_path):
df = pd.DataFrame(np.random.randn(100, 1), columns=['A'],
index=pd.date_range('20130101', periods=100, freq='T'))
dfs = np.array_split(df, 2)
dd1 = dd.from_pandas(dfs[0], npartitions=1)
dd2 = dd.from_pandas(dfs[1], npartitions=1)
dd2.to_parquet(tmp_path)
_ = (dd1
.append(dd.read_parquet(tmp_path))
.to_parquet(tmp_path))
assert_frame_equal(df,
dd.read_parquet(tmp_path).compute())
gives
.venv/lib/python3.7/site-packages/dask/dataframe/core.py:3812: in to_parquet
return to_parquet(self, path, *args, **kwargs)
...
fastparquet.util.ParquetException: Metadata parse failed: /private/var/folders/_1/m2pd_c9d3ggckp1c1p0z3v8r0000gn/T/pytest-of-jfaleiro/pytest-138/test_dask_intermittent_error0/part.0.parquet
We considered relying on a simple append and figuring out order after retrieval, but this seems to be hitting a different bug, i.e.:
def test_dask_prepend_as_append(tmp_path):
df = pd.DataFrame(np.random.randn(100, 1), columns=['A'],
index=pd.date_range('20130101', periods=100, freq='T'))
dfs = np.array_split(df, 2)
dd1 = dd.from_pandas(dfs[0], npartitions=1)
dd2 = dd.from_pandas(dfs[1], npartitions=1)
dd2.to_parquet(tmp_path)
dd1.to_parquet(tmp_path, append=True)
assert_frame_equal(df,
dd.read_parquet(tmp_path).compute())
gives
ValueError: Appended divisions overlapping with previous ones.

If you avoid using a "_metadata" file when writing (which you will with the default settings and pyarrow), then you could simply rename your files, to assure that the prepended partition occurs before the rest, when listed by glob. Normally, Dask will begin naming with a serial number 0.

Related

adding pandas df to dask df

One of the recent problems with dask I encountered was encodings that take a lot of time and I wanted to speed them up.
Problem: given a dask df (ddf), encode it, and return ddf.
Here is some code to start with:
# !pip install feature_engine
import dask.dataframe as dd
import pandas as pd
import numpy as np
from feature_engine.encoding import CountFrequencyEncoder
df = pd.DataFrame(np.random.randint(1, 5, (100,3)), columns=['a', 'b', 'c'])
# make it object cols
for col in df.columns:
df[col] = df[col].astype(str)
ddf = dd.from_pandas(df, npartitions=3)
x_freq = ddf.copy()
for col_idx, col_name in enumerate(x_freq.columns):
freq_enc = CountFrequencyEncoder(encoding_method='frequency')
col_to_encode = x_freq[col_name].to_frame().compute()
encoded_col = freq_enc.fit_transform(col_to_encode).rename(columns={col_name: col_name + '_freq'})
x_freq = dd.concat([x_freq, encoded_col], axis=1)
x_freq.head()
It will run fine as I would expect, adding pandas df to dask df - no problem.
But when I try another ddf, there is an error:
x_freq = x.copy()
# npartitions = x_freq.npartitions
# x_freq = x_freq.repartition(npartitions=npartitions).reset_index(drop=True)
for col_idx, col_name in enumerate(x_freq.columns):
freq_enc = CountFrequencyEncoder(encoding_method='frequency')
col_to_encode = x_freq[col_name].to_frame().compute()
encoded_col = freq_enc.fit_transform(col_to_encode).rename(columns={col_name: col_name + '_freq'})
x_freq = dd.concat([x_freq, encoded_col], axis=1)
break
x_freq.head()
Error is happening during concat:
ValueError: Unable to concatenate DataFrame with unknown division specifying axis=1
This is how I load "error" ddf:
ddf = dd.read_parquet(os.path.join(dir_list[0], '*.parquet'), engine='pyarrow').repartition(partition_size='100MB')
I read I should try repartition and/or reset index and/or use assign. Neither worked.
x_freq = x.copy()
in the second example is similar to:
x_freq = ddf.copy()
in the first example in a sense that x is just some ddf I'm trying to encode but it would be a lot of code to define it here.
Can anyone help, please?
Here's what I think might be going on.
Your parquet file probably doesn't have divisions information within it. You thus cannot just dd.concat, since it's not clear how the partitions align.
You can check this by
x_freq.known_divisions # is likely False
x_freq.divisions # is likely (None, None, None, None)
Since unknown divisions are the problem, you can re-create the issue by using the synthetic data in the first example
x_freq = ddf.clear_divisions().copy()
You might solve this problem by re-setting the index:
x_freq.reset_index().set_index(index_column_name)
where index_column_name is the name of the index column.
Consider also saving the data with the correct index afterwards so that it doesn't have to be calculated each time.
Note 1: Parallelization
By the way, since you're computing each column before working with it, you're not really utilizing dask's parallelization abilities. Here is a workflow that might utilize parallelization a bit better:
def count_frequency_encoder(s):
return s.replace(s.value_counts(normalize=True).compute().to_dict())
frequency_columns = {
f'{col_name}_freq': count_frequency_encoder(x_freq[col_name])
for col_name in x_freq.columns}
x_freq = x_freq.assign(**frequency_columns)
Note 2: to_frame
A tiny tip:
x_freq[col_name].to_frame()
is equivalent to
x_freq[[col_name]]

pandas df.to_parquet write to multiple smaller files

Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size?
I have a very large DataFrame (100M x 100), and am using df.to_parquet('data.snappy', engine='pyarrow', compression='snappy') to write to a file, but this results in a file that's about 4GB. I'd instead like this split into many ~100MB files.
I ended up using Dask:
import dask.dataframe as da
ddf = da.from_pandas(df, chunksize=5000000)
save_dir = '/path/to/save/'
ddf.to_parquet(save_dir)
This saves to multiple parquet files inside save_dir, where the number of rows of each sub-DataFrame is the chunksize. Depending on your dtypes and number of columns, you can adjust this to get files to the desired size.
One other option is to use the partition_cols option in pyarrow.parquet.write_to_dataset():
import pyarrow.parquet as pq
import numpy as np
# df is your dataframe
n_partition = 100
df["partition_idx"] = np.random.choice(range(n_partition), size=df.shape[0])
table = pq.Table.from_pandas(df, preserve_index=False)
pq.write_to_dataset(table, root_path="{path to dir}/", partition_cols=["partition_idx"])
Slice the dataframe and save each chunk to a folder, using just pandas api (without dask or pyarrow).
You can pass extra params to the parquet engine if you wish.
def df_to_parquet(df, target_dir, chunk_size=1000000, **parquet_wargs):
"""Writes pandas DataFrame to parquet format with pyarrow.
Args:
df: DataFrame
target_dir: local directory where parquet files are written to
chunk_size: number of rows stored in one chunk of parquet file. Defaults to 1000000.
"""
for i in range(0, len(df), chunk_size):
slc = df.iloc[i : i + chunk_size]
chunk = int(i/chunk_size)
fname = os.path.join(target_dir, f"part_{chunk:04d}.parquet")
slc.to_parquet(fname, engine="pyarrow", **parquet_wargs)
Keep each parquet size small, around 128MB. To do this:
import dask.dataframe as dd
# Get number of partitions required for nominal 128MB partition size
# "+ 1" for non full partition
size128MB = int(df.memory_usage().sum()/1e6/128) + 1
# Read
ddf = dd.from_pandas(df, npartitions=size128MB)
save_dir = '/path/to/save/'
ddf.to_parquet(save_dir)
cunk = 200000
i = 0
n = 0
while i<= len(all_df):
j = i + cunk
print((i, j))
tmpdf = all_df[i:j]
tmpdf.to_parquet(path=f"./append_data/part.{n}.parquet",engine='pyarrow', compression='snappy')
i = j
n = n + 1

How to add/edit text in pandas.io.parsers.TextFileReader

I have a large file in CSV. Since it is a large file(almost 7 GB) , it cannot be converted into a pandas dataframe.
import pandas as pd
df1 = pd.read_csv('tblViewPromotionDataVolume_202004070600.csv', sep='\t', iterator=True, chunksize=1000)
for chunk in df1:
print (chunk)
df1 is of type pandas.io.parsers.TextFileReader
Now i want to edit/add/insert some text(a new row) into this file , and convert it back to a pandas dataframe. Please let me know of possible solutions. Thanks in advance.
Here is DataFrame called chunk, so for processing use it, last for write to file use DataFrame.to_csv with mode='a' for append mode:
import pandas as pd
import os
infile = 'tblViewPromotionDataVolume_202004070600.csv'
outfile = 'out.csv'
df1 = pd.read_csv(infile, sep='\t', iterator=True, chunksize=1000)
for chunk in df1:
print (chunk)
#processing with chunk
# https://stackoverflow.com/a/30991707/2901002
# if file does not exist write header with first chunk
if not os.path.isfile(outfile):
chunk.to_csv(, sep='\t')
else: # else it exists so append without writing the header
chunk.to_csv('out.csv', sep='\t', mode='a', header=False)

How do I combine multiple pandas dataframes into an HDF5 object under one key/group?

I am parsing data from a large csv sized 800 GB. For each line of data, I save this as a pandas dataframe.
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
Now, I would like to save this into an HDF5 format, and query the h5 as if it was the entire csv file.
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
My approach so far has been:
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
store.append(hdf5_key, df, data_columns=csv_columns, index=False)
That is, I try to save each dataframe df into the HDF5 under one key. However, this fails:
Attribute 'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'
So, I could try to save everything into one pandas dataframe first, i.e.
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame()
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
total_df = pd.concat([total_df, df]) # creates one big CSV
and now store into HDF5 format
store.append(hdf5_key, total_df, data_columns=csv_columns, index=False)
However, I don't think I have the RAM/storage to save all csv lines into total_df into HDF5 format.
So, how do I append each "single-line" df into an HDF5 so that it ends up as one big dataframe (like the original csv)?
EDIT: Here's a concrete example of a csv file with different data types:
order start end value
1 1342 1357 category1
1 1459 1489 category7
1 1572 1601 category23
1 1587 1599 category2
1 1591 1639 category1
....
15 792 813 category13
15 892 913 category5
....
Your code should work, can you try the following code:
import pandas as pd
import numpy as np
store = pd.HDFStore("file.h5", "w")
hdf5_key = "single_key"
csv_columns = ["COL%d" % i for i in range(1, 56)]
for i in range(10):
df = pd.DataFrame(np.random.randn(1, len(csv_columns)), columns=csv_columns)
store.append(hdf5_key, df, data_column=csv_columns, index=False)
store.close()
If the code works, then there are something wrong with your data.

Reading variable column and row structure to Pandas by column amount

I need to create a Pandas DataFrame from a large file with space delimited values and row structure that is depended on the number of columns.
Raw data looks like this:
2008231.0 4891866.0 383842.0 2036693.0 4924388.0 375170.0
On one line or several, line breaks are ignored.
End result looks like this, if number of columns is three:
[(u'2008231.0', u'4891866.0', u'383842.0'),
(u'2036693.0', u'4924388.0', u'375170.0')]
Splitting the file into rows is depended on the number of columns which is stated in the meta part of the file.
Currently I split the file into one big list and split it into rows:
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
(code is from itertools examples)
Problem is, I end up with multiple copies of the data in memory. With 500MB+ files this eats up the memory fast and Pandas has some trouble reading lists this big with large MultiIndexes.
How can I use Pandas file reading functionality (read_csv, read_table, read_fwf) with this kind of data?
Or is there an other way of reading data into Pandas without auxiliary data structures?
Although it is possible to create a custom file-like object, this will be very slow compared to the normal usage of pd.read_table:
import pandas as pd
import re
filename = 'raw_data.csv'
class FileLike(file):
""" Modeled after FileWrapper
http://stackoverflow.com/a/14279543/190597 (Thorsten Kranz)
"""
def __init__(self, *args):
super(FileLike, self).__init__(*args)
self.buffer = []
def next(self):
if not self.buffer:
line = super(FileLike, self).next()
self.buffer = re.findall(r'(\S+\s+\S+\s+\S+)', line)
if self.buffer:
line = self.buffer.pop()
return line
with FileLike(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
When I try using FileLike on a 5.8M file (consisting of 200000 lines), the above code takes 3.9 seconds to run.
If I instead preprocess the data (splitting each line into 2 lines and writing the result to disk):
import fileinput
import sys
import re
filename = 'raw_data.csv'
for line in fileinput.input([filename], inplace = True, backup='.bak'):
for part in re.findall(r'(\S+\s+\S+\s+\S+)', line):
print(part)
then you can of course load the data normally into Pandas using pd.read_table:
with open(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
The time required to rewrite the file was ~0.6 seconds, and now loading the DataFrame took ~0.7 seconds.
So, it appears you will be better off rewriting your data to disk first.
I don't think there is a way to seperate rows with the same delimiter as columns.
One way around this is to reshape (this will most likely be a copy rather than a view, to keep the data contiguous) after creating a Series using read_csv:
s = pd.read_csv(file_name, lineterminator=' ', header=None)
df = pd.DataFrame(s.values.reshape(len(s)/n, n))
In your example:
In [1]: s = pd.read_csv('raw_data.csv', lineterminator=' ', header=None, squeeze=True)
In [2]: s
Out[2]:
0 2008231
1 4891866
2 383842
3 2036693
4 4924388
5 375170
Name: 0, dtype: float64
In [3]: pd.DataFrame(s.values.reshape(len(s)/3, 3))
Out[3]:
0 1 2
0 2008231 4891866 383842
1 2036693 4924388 375170