How to I convert multiple Pandas DFs into a single Spark DF? - pandas

I have several Excel files that I need to load and pre-process before loading them into a Spark DF. I have a list of these files that need to be processed. I do something like this to read them in:
file_list_rdd = sc.emptyRDD()
for file_path in file_list:
current_file_rdd = sc.binaryFiles(file_path)
print(current_file_rdd.count())
file_list_rdd = file_list_rdd.union(current_file_rdd)
I then have some mapper function that turns file_list_rdd from a set of (path, bytes) tuples to (path, Pandas DataFrame) tuples. This allows me to use Pandas to read the Excel file and to manipulate the files so that they're uniform before making them into a Spark DataFrame.
How do I take an RDD of (file path, Pandas DF) tuples and turn it into a single Spark DF? I'm aware of functions that can do a single transformation, but not one that can do several.
My first attempt was something like this:
sqlCtx = SQLContext(sc)
def convert_pd_df_to_spark_df(item):
return sqlCtx.createDataFrame(item[0][1])
processed_excel_rdd.map(convert_pd_df_to_spark_df)
I'm guessing that didn't work because sqlCtx isn't distributed with the computation (it's a guess because the stack trace doesn't make much sense to me).
Thanks in advance for taking the time to read :).

Can be done using conversion to Arrow RecordBatches which Spark > 2.3 can process into a DF in a very efficient manner.
https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
This snippet monkey-patches spark to include a createFromPandasDataframesRDD method.
The createFromPandasDataframesRDD method accepts a RDD object of pandas DFs (Assumes same columns) and returns a single Spark DF.

I solved this by writing a function like this:
def pd_df_to_row(rdd_row):
key = rdd_row[0]
pd_df = rdd_row[1]
rows = list()
for index, series in pd_df.iterrows():
# Takes a row of a df, exports it as a dict, and then passes an unpacked-dict into the Row constructor
row_dict = {str(k):v for k,v in series.to_dict().items()}
rows.append(Row(**row_dict))
return rows
You can invoke it by calling something like:
processed_excel_rdd = processed_excel_rdd.flatMap(pd_df_to_row)
pd_df_to_row now has a collection of Spark Row objects. You can now say:
processed_excel_rdd.toDF()
There's probably something more efficient than the Series-> dict-> Row operation, but this got me through.

Why not make a list of the dataframes or filenames and then call union in a loop. Something like this:
If pandas dataframes:
dfs = [df1, df2, df3, df4]
sdf = None
for df in dfs:
if sdf:
sdf = sdf.union(spark.createDataFrame(df))
else:
sdf = spark.createDataFrame(df)
If filenames:
names = [name1, name2, name3, name4]
sdf = None
for name in names:
if sdf:
sdf = sdf.union(spark.createDataFrame(pd.read_excel(name))
else:
sdf = spark.createDataFrame(pd.read_excel(name))

Related

using Dask to load many CSV files with different columns

I have many CSV files saved in AWS s3 with the same first set of columns and a lot of optional columns. I don't want to download them one by one and than use pd.concat to read it, since this takes a lot of time and has to fit in to the computer memory. Instead, I'm trying to use Dask to load and sum up all of these files, when optional columns should should be treated as zeros.
If all columns where the same I could use:
import dask.dataframe as dd
addr = "s3://SOME_BASE_ADDRESS/*.csv"
df = dd.read_csv(addr)
df.groupby(["index"]).sum().compute()
but it doesn't work with files that don't have same number of columns, since Dask assumes it can use the first columns for all files:
File ".../lib/python3.7/site-packages/pandas/core/internals/managers.py", line 155, in set_axis
'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 64 elements, new values have 62 elements
According to this thread I can either read all headers in advanced (for example by writing them as I produce and save all of the small CSV's) or use something like this:
df = dd.concat([dd.read_csv(f) for f in filelist])
I wonder if this solution is actually faster/better than just directly use pandas? In general, I'd like to know what is the best (mainly fastest) way to tackle this issue?
It might be a good idea to use delayed to standardize dataframes before converting them to a dask dataframe (whether this is optimal for your use case is difficult to judge).
import dask.dataframe as dd
from dask import delayed
list_files = [...] # create a list of files inside s3 bucket
list_cols_to_keep = ['col1', 'col2']
#delayed
def standard_csv(file_path):
df = pd.read_csv(file_path)
df = df[list_cols_to_keep]
# add any other standardization routines, e.g. dtype conversion
return df
ddf = dd.from_delayed([standard_csv(f) for f in list_files])
I ended up giving up using Dask since it was too slow and used aws s3 sync to download the data and multiprocessing.Pool to read and concat them:
# download:
def sync_outputs(out_path):
local_dir_path = f"/tmp/outputs/"
safe_mkdir(os.path.dirname(local_dir_path))
cmd = f'aws s3 sync {job_output_dir} {local_dir_path} > /tmp/null' # the last part is to avoid prints
os.system(cmd)
return local_dir_path
# concat:
def read_csv(path):
return pd.read_csv(path,index_col=0)
def read_csvs_parallel(local_paths):
from multiprocessing import Pool
import os
with Pool(os.cpu_count()) as p:
csvs = list(tqdm(p.imap(read_csv, local_paths), desc='reading csvs', total=len(paths)))
return csvs
# all togeter:
def concat_csvs_parallel(out_path):
local_paths = sync_outputs(out_path)
csvs = read_csvs_parallel(local_paths)
df = pd.concat(csvs)
return df
aws s3 sync dowloaded about 1000 files (~1KB each) in about 30 second, and reading than with multiproccesing (8 cores) took 3 seconds, this was much faster than also downloading the files using multiprocessing (almost 2 minutes for 1000 files)

Repeat the task of exporting multiple Panda datames into multiple csv-files

I'm somewhat new to Pandas/Python (more into SAS), but my task is the following: I have four Pandas dataframes, and I would like to export each of them into a separate csv-file. The name of the csv should be the same as the original dataframe (forsyning.csv, inntak.csv etc).
So far I've made a list with the names of the dataframes, and then tried to put the list through a for-loop in order to generate one csv after another. But I've only made it half-way through. My code so far:
df_list = ['forsyning', 'inntak', 'behandling', 'transport']
for i in df_list:
i.to_csv('{}.csv'.format(i), index=False, decimal=',', sep=';')
What I believe is missing is a proper reference where it says "i.to_csv" in my code above as it now only give me the error "'str' object has no attribute 'to_csv'". I justs don't know how to twist this code the right way - appreciate any advice in this matter. Thanks.
If need write list of DataFrames to files you need 2 lists - first for DataFrames objects and second for new file names in strings:
df_list = [forsyning, inntak, behandling, transport]
names = ['forsyning', 'inntak', 'behandling', 'transport']
So for write use zip of both lists and write df:
for i, df in zip(names, df_list):
df.to_csv('{}.csv'.format(i), index=False, decimal=',', sep=';')
Or use dictionary of DataFrames and loop values by dict.items():
df_dict = {'forsyning': forsyning, 'inntak':inntak,
'behandling': behandling, 'transport': transport}
for i, df in df_dict.items():
df.to_csv('{}.csv'.format(i), index=False, decimal=',', sep=';')
Your df_list should have a list of dataframe objects. but rather you seem to have the dataframe names in str format as elements.
I believe your df_list should be:
df_list = [forsyning, inntak, behandling, transport]

Is there a way to export pandas dataframe info -- df.info() into an excel file?

I have a .csv file locally. I am reading the file with pandas. I want to move the df.info() result into an excel. Looks like df.info().to_excel does not work as it is not supported. Is there any way to do this?
I tried df.info().to_excel
import pandas as pd
from openpyxl.workbook import Workbook
pd.read_csv("file.csv",sep='|', error_bad_lines=False)
writer = pd.ExcelWriter('output.xlsx')
df.info()
df.info().to_excel(writer,sheet_name='info')
I want to show the dataframe info output in a single tab of the excel.
The easiest way for me is to get the same information in dataframes, but separately:
df_datatypes = pd.DataFrame(df.dtypes)
df_null_count = df.count()
Then write to excel as usual.
to_excel is a method of the DataFrame https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html and DataFrame.info() doesn't return a DataFrame
You can write the info to a text file like so:
import io
buffer = io.StringIO()
df.info(buf=buffer)
s = buffer.getvalue()
with open("df_info.txt", "w", encoding="utf-8") as f:
f.write(s)
You can modify this code by removing last two lines and parsing the s variable and creating a DataFrame out of it (in the way you would like this to appear in the excel file) and then use the to_excel() method.
I agree with #yl_low but you could have a more elegant solution as shown:
def get_dataframe_info(df):
"""
input
df -> DataFrame
output
df_null_counts -> DataFrame Info (sorted)
"""
df_types = pd.DataFrame(df.dtypes)
df_nulls = df.count()
df_null_count = pd.concat([df_types, df_nulls], axis=1)
df_null_count = df_null_count.reset_index()
# Reassign column names
col_names = ["features", "types", "non_null_counts"]
df_null_count.columns = col_names
# Add this to sort
df_null_count = df_null_count.sort_values(by=["null_counts"], ascending=False)
return df_null_count
You can do this in Python 3.
pd.DataFrame({"name": train.columns, "non-nulls": len(train)-train.isnull().sum().values, "nulls": train.isnull().sum().values, "type": train.dtypes.values}).to_excel("op.xlsx")
Just one line code (without non-null column);
df.dtypes.reset_index(name='Dtype').rename(columns={'index' : 'Column'}).to_excel(pd.ExcelWriter('Name.xlsx'), 'info')

How to concat multiple pandas dataframes into one dask dataframe larger than memory?

I am parsing tab-delimited data to create tabular data, which I would like to store in an HDF5.
My problem is I have to aggregate the data into one format, and then dump into HDF5. This is ~1 TB-sized data, so I naturally cannot fit this into RAM. Dask might be the best way to accomplish this task.
If I use parsing my data to fit into one pandas dataframe, I would do this:
import pandas as pd
import csv
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame() # create empty pandas DataFrame
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = pd.concat([total_df, df]) # creates one big dataframe
Using dask to do the same task, it appears users should try something like this:
import pandas as pd
import csv
import dask.dataframe as dd
import dask.array as da
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"] # define columns
readcsvfile = csv.reader(csvfile) # read in file, if csv
# somehow define empty dask dataframe total_df = dd.Dataframe()?
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = da.concatenate([total_df, df]) # creates one big dataframe
After creating a ~TB dataframe, I will save into hdf5.
My problem is that total_df does not fit into RAM, and must be saved to disk. Can dask dataframe accomplish this task?
Should I be trying something else? Would it be easier to create an HDF5 from multiple dask arrays, i.e. each column/field a dask array? Maybe partition the dataframes among several nodes and reduce at the end?
EDIT: For clarity, I am actually not reading directly from a csv file. I am aggregating, parsing, and formatting tabular data. So, readcsvfile = csv.reader(csvfile) is used above for clarity/brevity, but it's far more complicated than reading in a csv file.
Dask.dataframe handles larger-than-memory datasets through laziness. Appending concrete data to a dask.dataframe will not be productive.
If your data can be handled by pd.read_csv
The pandas.read_csv function is very flexible. You say above that your parsing process is very complex, but it might still be worth looking into the options for pd.read_csv to see if it will still work. The dask.dataframe.read_csv function supports these same arguments.
In particular if the concern is that your data is separated by tabs rather than commas this isn't an issue at all. Pandas supports a sep='\t' keyword, along with a few dozen other options.
Consider dask.bag
If you want to operate on textfiles line-by-line then consider using dask.bag to parse your data, starting as a bunch of text.
import dask.bag as db
b = db.read_text('myfile.tsv', blocksize=10000000) # break into 10MB chunks
records = b.str.split('\t').map(parse)
df = records.to_dataframe(columns=...)
Write to HDF5 file
Once you have dask.dataframe try the .to_hdf method:
df.to_hdf('myfile.hdf5', '/df')

Slice Spark’s DataFrame SQL by row (pyspark)

I have a Spark's Dataframe parquet file that can be read by spark as follows
df = sqlContext.read.parquet('path_to/example.parquet')
df.registerTempTable('temp_table')
I want to slice my dataframe, df, by row (i.e. equivalent to df.iloc[0:4000], df.iloc[4000:8000] etc. in Pandas dataframe) since I want to convert each small chunks to pandas dataframe to work on each later on. I only know how to do it by using sample random fraction i.e.
df_sample = df.sample(False, fraction=0.1) # sample 10 % of my data
df_pandas = df_sample.toPandas()
I would be great if there is a method to slice my dataframe df by row. Thanks in advance.
You can use monotonically_increasing_id() to add an ID column to your dataframe and use that to get a working set of any size.
import pyspark.sql.functions as f
# add an index column
df = df.withColumn('id', f.monotonically_increasing_id())
# Sort by index and get first 4000 rows
working_set = df.sort('id').limit(4000)
Then, you can remove the working set from your dataframe using subtract().
# Remove the working set, and use this `df` to get the next working set
df = df.subtract(working_set)
Rinse and repeat until you're done processing all rows. Not the ideal way to do things, but it works. Consider filtering out your Spark data frame to be used in Pandas.