pandas df.to_parquet write to multiple smaller files - pandas

Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size?
I have a very large DataFrame (100M x 100), and am using df.to_parquet('data.snappy', engine='pyarrow', compression='snappy') to write to a file, but this results in a file that's about 4GB. I'd instead like this split into many ~100MB files.

I ended up using Dask:
import dask.dataframe as da
ddf = da.from_pandas(df, chunksize=5000000)
save_dir = '/path/to/save/'
ddf.to_parquet(save_dir)
This saves to multiple parquet files inside save_dir, where the number of rows of each sub-DataFrame is the chunksize. Depending on your dtypes and number of columns, you can adjust this to get files to the desired size.

One other option is to use the partition_cols option in pyarrow.parquet.write_to_dataset():
import pyarrow.parquet as pq
import numpy as np
# df is your dataframe
n_partition = 100
df["partition_idx"] = np.random.choice(range(n_partition), size=df.shape[0])
table = pq.Table.from_pandas(df, preserve_index=False)
pq.write_to_dataset(table, root_path="{path to dir}/", partition_cols=["partition_idx"])

Slice the dataframe and save each chunk to a folder, using just pandas api (without dask or pyarrow).
You can pass extra params to the parquet engine if you wish.
def df_to_parquet(df, target_dir, chunk_size=1000000, **parquet_wargs):
"""Writes pandas DataFrame to parquet format with pyarrow.
Args:
df: DataFrame
target_dir: local directory where parquet files are written to
chunk_size: number of rows stored in one chunk of parquet file. Defaults to 1000000.
"""
for i in range(0, len(df), chunk_size):
slc = df.iloc[i : i + chunk_size]
chunk = int(i/chunk_size)
fname = os.path.join(target_dir, f"part_{chunk:04d}.parquet")
slc.to_parquet(fname, engine="pyarrow", **parquet_wargs)

Keep each parquet size small, around 128MB. To do this:
import dask.dataframe as dd
# Get number of partitions required for nominal 128MB partition size
# "+ 1" for non full partition
size128MB = int(df.memory_usage().sum()/1e6/128) + 1
# Read
ddf = dd.from_pandas(df, npartitions=size128MB)
save_dir = '/path/to/save/'
ddf.to_parquet(save_dir)

cunk = 200000
i = 0
n = 0
while i<= len(all_df):
j = i + cunk
print((i, j))
tmpdf = all_df[i:j]
tmpdf.to_parquet(path=f"./append_data/part.{n}.parquet",engine='pyarrow', compression='snappy')
i = j
n = n + 1

Related

How can I sort and replace the rank of each item instead of it's value before sorting in a merged final csv?

I have 30 different csv files and each row begin with date and some similar features measured for each of 30 items daily. The value of each feature is not important, but the rank they gain after sorting in each day is important. How can I have one merged csv from 30 separate csv with the rank of each feature?
If your files are all the same format, you can do a loop and consolidate in a single data frame. From there you can manipulate as needed:
import pandas as pd
import glob
path = r'C:\path_to_files\files' # use your path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Other examples of the same thing: Import multiple csv files into pandas and concatenate into one DataFrame.

Dask appropriate for my goal? ```Compute()``` taking very long

I am doing the following in Dask as the df dataframe has 7 million rows and 50 columns so pandas is extremely slow. However, I might not be using Dask correctly or Dask might not be appropriate for my goal. I need to do some preprocessing on the df dataframe, which is mainly creating some new columns. And then eventually saving the df (I am saving to csv but I have also tried parquet). However, before I save, I believe I have to do compute(). And compute() is taking very long -- I left it running for 3 hours and it still wasn't done. I tried to persist() throughout the calculations but persist() also took a long time. Is this expected with Dask given the size of my data? Could this be because of the number of partitions (I have 20 logical processor and dask is using 24 partitions -- I have 128 GB of memory if this helps too)? Is there something I could do to speed this up?
import dask.dataframe as dd
import numpy as np
import pandas as pd
from re import match

from dask_ml.preprocessing import LabelEncoder



df1 = dd.read_csv("data1.csv")
df2 = dd.read_csv("data2.csv")
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
df['actual__adj'] = (df['actual'] * df['travel'] + 809 * df['stopped']) / (
df['travel_time'] + df['stopped_time'])
df['c_adj'] = 1 - df['actual_adj'] / df['free']

df['stopped_tom'] = 1 * (df['stopped'] > 0)

def func(df):
df = df.sort_values('region')
df['first_established'] = 1 * (df['region_d']==df['region_d'].min())
df['last_established'] = 1 * (df['region_d']==df['region_d'].max())
df['actual_established'] = df['noted_timeframe'].shift(1, fill_value=0)
df['actual_established_2'] = df['noted_timeframe'].shift(-1, fill_value=0)
df['time_1'] = df['time_book'].shift(1, fill_value=0)
df['time_2'] = df['time_book'].shift(-1, fill_value=0)
df['stopped_investing'] = df['stopped'].shift(1, fill_value=1)
return df

df = df.groupby('country').apply(func).reset_index(drop=True)
df['actual_diff'] = np.abs(df['actual'] - df['actual_book'])
df['length_diff'] = np.abs(df['length'] - df['length_book'])

df['Investment'] = df['lor_index'].values * 1000
df = df.compute().to_csv("path")
Saving to csv or parquet will by default trigger computation, so the last line should be:
df = df.to_csv("path_*.csv")
The asterisk is needed to specify the pattern of csv file names (each partition is saved into a separate file, unless you specify single_file=True).
My guess is that most of the computation time is spent on this step:
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
If one of the dfs is small enough to fit in memory, then it would be good to keep it as a pandas dataframe, see further tips in the documentation.

Extracting column from Array in python

I am beginner in Python and I am stuck with data which is array of 32763 number, separated by comma. Please find the data here data
I want to convert this into two column 1 from (0:16382) and 2nd column from (2:32763). in the end I want to plot column 1 as x axis and column 2 as Y axis. I tried the following code but I am not able to extract the columns
import numpy as np
import pandas as pd
import matplotlib as plt
data = np.genfromtxt('oscilloscope.txt',delimiter=',')
df = pd.DataFrame(data.flatten())
print(df)
and then I want to write the data in some file let us say data1 in the format as shown in attached pic
It is hard to answer without seeing the format of your data, but you can try
data = np.genfromtxt('oscilloscope.txt',delimiter=',')
print(data.shape) # here we check we got something useful
# this should split data into x,y at position 16381
x = data[:16381]
y = data[16381:]
# now you can create a dataframe and print to file
df = pd.DataFrame({'x':x, 'y':y})
df.to_csv('data1.csv', index=False)
Try this.
#input as dataframe df, its chunk_size, extract output as list. you can mention chunksize what you want.
def split_dataframe(df, chunk_size = 16382):
chunks = list()
num_chunks = len(df) // chunk_size + 1
for i in range(num_chunks):
chunks.append(df[i*chunk_size:(i+1)*chunk_size])
return chunks
or
np.array_split

Prepending to a dask dataframe in parquet storage

What is the recommended way to prepend data (a pandas dataframe) to an existing dask dataframe in parquet storage?
This test, for example, fails intermittently:
import dask.dataframe as dd
import numpy as np
import pandas as pd
def test_dask_intermittent_error(tmp_path):
df = pd.DataFrame(np.random.randn(100, 1), columns=['A'],
index=pd.date_range('20130101', periods=100, freq='T'))
dfs = np.array_split(df, 2)
dd1 = dd.from_pandas(dfs[0], npartitions=1)
dd2 = dd.from_pandas(dfs[1], npartitions=1)
dd2.to_parquet(tmp_path)
_ = (dd1
.append(dd.read_parquet(tmp_path))
.to_parquet(tmp_path))
assert_frame_equal(df,
dd.read_parquet(tmp_path).compute())
gives
.venv/lib/python3.7/site-packages/dask/dataframe/core.py:3812: in to_parquet
return to_parquet(self, path, *args, **kwargs)
...
fastparquet.util.ParquetException: Metadata parse failed: /private/var/folders/_1/m2pd_c9d3ggckp1c1p0z3v8r0000gn/T/pytest-of-jfaleiro/pytest-138/test_dask_intermittent_error0/part.0.parquet
We considered relying on a simple append and figuring out order after retrieval, but this seems to be hitting a different bug, i.e.:
def test_dask_prepend_as_append(tmp_path):
df = pd.DataFrame(np.random.randn(100, 1), columns=['A'],
index=pd.date_range('20130101', periods=100, freq='T'))
dfs = np.array_split(df, 2)
dd1 = dd.from_pandas(dfs[0], npartitions=1)
dd2 = dd.from_pandas(dfs[1], npartitions=1)
dd2.to_parquet(tmp_path)
dd1.to_parquet(tmp_path, append=True)
assert_frame_equal(df,
dd.read_parquet(tmp_path).compute())
gives
ValueError: Appended divisions overlapping with previous ones.
If you avoid using a "_metadata" file when writing (which you will with the default settings and pyarrow), then you could simply rename your files, to assure that the prepended partition occurs before the rest, when listed by glob. Normally, Dask will begin naming with a serial number 0.

Reading variable column and row structure to Pandas by column amount

I need to create a Pandas DataFrame from a large file with space delimited values and row structure that is depended on the number of columns.
Raw data looks like this:
2008231.0 4891866.0 383842.0 2036693.0 4924388.0 375170.0
On one line or several, line breaks are ignored.
End result looks like this, if number of columns is three:
[(u'2008231.0', u'4891866.0', u'383842.0'),
(u'2036693.0', u'4924388.0', u'375170.0')]
Splitting the file into rows is depended on the number of columns which is stated in the meta part of the file.
Currently I split the file into one big list and split it into rows:
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
(code is from itertools examples)
Problem is, I end up with multiple copies of the data in memory. With 500MB+ files this eats up the memory fast and Pandas has some trouble reading lists this big with large MultiIndexes.
How can I use Pandas file reading functionality (read_csv, read_table, read_fwf) with this kind of data?
Or is there an other way of reading data into Pandas without auxiliary data structures?
Although it is possible to create a custom file-like object, this will be very slow compared to the normal usage of pd.read_table:
import pandas as pd
import re
filename = 'raw_data.csv'
class FileLike(file):
""" Modeled after FileWrapper
http://stackoverflow.com/a/14279543/190597 (Thorsten Kranz)
"""
def __init__(self, *args):
super(FileLike, self).__init__(*args)
self.buffer = []
def next(self):
if not self.buffer:
line = super(FileLike, self).next()
self.buffer = re.findall(r'(\S+\s+\S+\s+\S+)', line)
if self.buffer:
line = self.buffer.pop()
return line
with FileLike(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
When I try using FileLike on a 5.8M file (consisting of 200000 lines), the above code takes 3.9 seconds to run.
If I instead preprocess the data (splitting each line into 2 lines and writing the result to disk):
import fileinput
import sys
import re
filename = 'raw_data.csv'
for line in fileinput.input([filename], inplace = True, backup='.bak'):
for part in re.findall(r'(\S+\s+\S+\s+\S+)', line):
print(part)
then you can of course load the data normally into Pandas using pd.read_table:
with open(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
The time required to rewrite the file was ~0.6 seconds, and now loading the DataFrame took ~0.7 seconds.
So, it appears you will be better off rewriting your data to disk first.
I don't think there is a way to seperate rows with the same delimiter as columns.
One way around this is to reshape (this will most likely be a copy rather than a view, to keep the data contiguous) after creating a Series using read_csv:
s = pd.read_csv(file_name, lineterminator=' ', header=None)
df = pd.DataFrame(s.values.reshape(len(s)/n, n))
In your example:
In [1]: s = pd.read_csv('raw_data.csv', lineterminator=' ', header=None, squeeze=True)
In [2]: s
Out[2]:
0 2008231
1 4891866
2 383842
3 2036693
4 4924388
5 375170
Name: 0, dtype: float64
In [3]: pd.DataFrame(s.values.reshape(len(s)/3, 3))
Out[3]:
0 1 2
0 2008231 4891866 383842
1 2036693 4924388 375170