Pandas Dataframe to RDD - pandas

Can I convert a Pandas DataFrame to RDD?
if isinstance(data2, pd.DataFrame):
print 'is Dataframe'
else:
print 'is NOT Dataframe'
is DataFrame
Here is the output when trying to use .rdd
dataRDD = data2.rdd
print dataRDD
AttributeError Traceback (most recent call last)
<ipython-input-56-7a9188b07317> in <module>()
----> 1 dataRDD = data2.rdd
2 print dataRDD
/usr/lib64/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
2148 return self[name]
2149 raise AttributeError("'%s' object has no attribute '%s'" %
-> 2150 (type(self).__name__, name))
2151
2152 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'rdd'
I would like to use Pandas Dataframe and not sqlContext to build as I'm not sure if all the functions in Pandas DF are available in Spark. If this is not possible, is there anyone that can provide an example of using Spark DF

Can I convert a Pandas Dataframe to RDD?
Well, yes you can do it. Pandas Data Frames
pdDF = pd.DataFrame([("foo", 1), ("bar", 2)], columns=("k", "v"))
print pdDF
## k v
## 0 foo 1
## 1 bar 2
can be converted to Spark Data Frames
spDF = sqlContext.createDataFrame(pdDF)
spDF.show()
## +---+-+
## | k|v|
## +---+-+
## |foo|1|
## |bar|2|
## +---+-+
and after that you can easily access underlying RDD
spDF.rdd.first()
## Row(k=u'foo', v=1)
Still, I think you have a wrong idea here. Pandas Data Frame is a local data structure. It is stored and processed locally on the driver. There is no data distribution or parallel processing and it doesn't use RDDs (hence no rdd attribute). Unlike Spark DataFrame it provides random access capabilities.
Spark DataFrame is distributed data structures using RDDs behind the scenes. It can be accessed using either raw SQL (sqlContext.sql) or SQL like API (df.where(col("foo") == "bar").groupBy(col("bar")).agg(sum(col("foobar")))). There is no random access and it is immutable (no equivalent of Pandas inplace). Every transformation returns new DataFrame.
If this is not possible, is there anyone that can provide an example of using Spark DF
Not really. It is far to broad topic for SO. Spark has a really good documentation and Databricks provides some additional resources. For starters you check these:
Introducing DataFrames in Spark for Large Scale Data Science
Spark SQL and DataFrame Guide

Related

using Dask to load many CSV files with different columns

I have many CSV files saved in AWS s3 with the same first set of columns and a lot of optional columns. I don't want to download them one by one and than use pd.concat to read it, since this takes a lot of time and has to fit in to the computer memory. Instead, I'm trying to use Dask to load and sum up all of these files, when optional columns should should be treated as zeros.
If all columns where the same I could use:
import dask.dataframe as dd
addr = "s3://SOME_BASE_ADDRESS/*.csv"
df = dd.read_csv(addr)
df.groupby(["index"]).sum().compute()
but it doesn't work with files that don't have same number of columns, since Dask assumes it can use the first columns for all files:
File ".../lib/python3.7/site-packages/pandas/core/internals/managers.py", line 155, in set_axis
'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 64 elements, new values have 62 elements
According to this thread I can either read all headers in advanced (for example by writing them as I produce and save all of the small CSV's) or use something like this:
df = dd.concat([dd.read_csv(f) for f in filelist])
I wonder if this solution is actually faster/better than just directly use pandas? In general, I'd like to know what is the best (mainly fastest) way to tackle this issue?
It might be a good idea to use delayed to standardize dataframes before converting them to a dask dataframe (whether this is optimal for your use case is difficult to judge).
import dask.dataframe as dd
from dask import delayed
list_files = [...] # create a list of files inside s3 bucket
list_cols_to_keep = ['col1', 'col2']
#delayed
def standard_csv(file_path):
df = pd.read_csv(file_path)
df = df[list_cols_to_keep]
# add any other standardization routines, e.g. dtype conversion
return df
ddf = dd.from_delayed([standard_csv(f) for f in list_files])
I ended up giving up using Dask since it was too slow and used aws s3 sync to download the data and multiprocessing.Pool to read and concat them:
# download:
def sync_outputs(out_path):
local_dir_path = f"/tmp/outputs/"
safe_mkdir(os.path.dirname(local_dir_path))
cmd = f'aws s3 sync {job_output_dir} {local_dir_path} > /tmp/null' # the last part is to avoid prints
os.system(cmd)
return local_dir_path
# concat:
def read_csv(path):
return pd.read_csv(path,index_col=0)
def read_csvs_parallel(local_paths):
from multiprocessing import Pool
import os
with Pool(os.cpu_count()) as p:
csvs = list(tqdm(p.imap(read_csv, local_paths), desc='reading csvs', total=len(paths)))
return csvs
# all togeter:
def concat_csvs_parallel(out_path):
local_paths = sync_outputs(out_path)
csvs = read_csvs_parallel(local_paths)
df = pd.concat(csvs)
return df
aws s3 sync dowloaded about 1000 files (~1KB each) in about 30 second, and reading than with multiproccesing (8 cores) took 3 seconds, this was much faster than also downloading the files using multiprocessing (almost 2 minutes for 1000 files)

Why does pySpark crash when using Apache Arrow for string types?

In an attempt to get some outlier plots on large datasets I need to convert a spark DataFrame to pandas. Turing to Apache Arrow a simple run is crashing my pyspark console when casting x as string (it works fine without the cast), why?
Using Python version 3.8.9 (default, Apr 10 2021 15:47:22)
Spark context Web UI available at http://6d0b1018a45a:4040
Spark context available as 'sc' (master = local[*], app id = local-1621164597906).
SparkSession available as 'spark'.
>>> import time
>>> from pyspark.sql.functions import rand
>>> from pyspark.sql import functions as F
>>> spark = SparkSession.builder.appName("Console_Test").getOrCreate()
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
21/05/16 11:31:03 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.
>>> a_df = spark.range(1 << 25).toDF("id").withColumn("x", rand())
>>> a_df = a_df.withColumn("id", F.col("id").cast("string"))
>>> start_t = time.time()
>>> a_pd = a_df.toPandas()
Killed
#
Additionally I noticed that options such as spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "5000")are seemingly without effect as the web ui shows records of significantly more than 5000 being assigned to the tasks.
Any indication on how to resolve the pyspark console crash or more directly render large scatter plots would be highly appreciated - I have (unsuccessfully) tried to find a way to apply Table.to_pandas(split_blocks=True, self_destruct=True)but did not get the able structure from a spark DataFrame.
You try to convert 33.5 mio (2^25) rows into a Pandas dataframe. This will lead to an OutOfMemoryError, as all data will be transfered to the Spark driver.
A way to find outliers would be to calculate the histogram for the column x and then filter down a_df to the relevant bins in Spark before creating the Pandas dataframe:
hist = a_df.select("x").rdd.flatMap(lambda x: x).histogram(10) #create 10 bins
hist is a tuple of two arrays: the first array contains the boundaries of the bins and the second array contains the numbers of elements in each bin:
([1.7855041778425118e-08,
0.1000000152099446,
0.20000001256484742,
0.30000000991975023,
0.40000000727465307,
0.5000000046295558,
0.6000000019844587,
0.6999999993393615,
0.7999999966942644,
0.8999999940491672,
0.99999999140407],
[3355812,
3356891,
3352364,
3352438,
3357564,
3356213,
3354933,
3355144,
3357241,
3355832])
rand creates uniformly distributed randon numbers, so the histogram in this case is not very interesting. But for real world distributions, the histogram will be useful.

Create Spark DataFrame from Pandas DataFrames inside RDD

I'm trying to convert a Pandas DataFrame on each worker node (an RDD where each element is a Pandas DataFrame) into a Spark DataFrame across all worker nodes.
Example:
def read_file_and_process_with_pandas(filename):
data = pd.read(filename)
"""
some additional operations using pandas functionality
here the data is a pandas dataframe, and I am using some datetime
indexing which isn't available for spark dataframes
"""
return data
filelist = ['file1.csv','file2.csv','file3.csv']
rdd = sc.parallelize(filelist)
rdd = rdd.map(read_file_and_process_with_pandas)
The previous operations work, so I have an RDD of Pandas DataFrames. How can I convert this then into a Spark DataFrame after I'm done with the Pandas processing?
I tried doing rdd = rdd.map(spark.createDataFrame), but when I do something like rdd.take(5), i get the following error:
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o103.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Is there a way to convert Pandas DataFrames in each worker node into a distributed DataFrame?
See this question: https://stackoverflow.com/a/51231046/7964197
I've had to deal with the same problem, which seems quite common (reading many files using pandas, e.g. excel/pickle/any other non-spark format, and converting the resulting RDD into a spark dataframe)
The supplied code adds a new method on the SparkSession that uses pyarrow to convert the pd.DataFrame objects into arrow record batches which are then directly converted to a pyspark.DataFrame object
spark_df = spark.createFromPandasDataframesRDD(prdd) # prdd is an RDD of pd.DataFrame objects
For large amounts of data, this is orders of magnitude faster than converting to an RDD of Row() objects.
Pandas dataframes can not direct convert to rdd.
You can create a Spark DataFrame from Pandas
spark_df = context.createDataFrame(pandas_df)
Reference: Introducing DataFrames in Apache Spark for Large Scale Data Science

How to I convert multiple Pandas DFs into a single Spark DF?

I have several Excel files that I need to load and pre-process before loading them into a Spark DF. I have a list of these files that need to be processed. I do something like this to read them in:
file_list_rdd = sc.emptyRDD()
for file_path in file_list:
current_file_rdd = sc.binaryFiles(file_path)
print(current_file_rdd.count())
file_list_rdd = file_list_rdd.union(current_file_rdd)
I then have some mapper function that turns file_list_rdd from a set of (path, bytes) tuples to (path, Pandas DataFrame) tuples. This allows me to use Pandas to read the Excel file and to manipulate the files so that they're uniform before making them into a Spark DataFrame.
How do I take an RDD of (file path, Pandas DF) tuples and turn it into a single Spark DF? I'm aware of functions that can do a single transformation, but not one that can do several.
My first attempt was something like this:
sqlCtx = SQLContext(sc)
def convert_pd_df_to_spark_df(item):
return sqlCtx.createDataFrame(item[0][1])
processed_excel_rdd.map(convert_pd_df_to_spark_df)
I'm guessing that didn't work because sqlCtx isn't distributed with the computation (it's a guess because the stack trace doesn't make much sense to me).
Thanks in advance for taking the time to read :).
Can be done using conversion to Arrow RecordBatches which Spark > 2.3 can process into a DF in a very efficient manner.
https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
This snippet monkey-patches spark to include a createFromPandasDataframesRDD method.
The createFromPandasDataframesRDD method accepts a RDD object of pandas DFs (Assumes same columns) and returns a single Spark DF.
I solved this by writing a function like this:
def pd_df_to_row(rdd_row):
key = rdd_row[0]
pd_df = rdd_row[1]
rows = list()
for index, series in pd_df.iterrows():
# Takes a row of a df, exports it as a dict, and then passes an unpacked-dict into the Row constructor
row_dict = {str(k):v for k,v in series.to_dict().items()}
rows.append(Row(**row_dict))
return rows
You can invoke it by calling something like:
processed_excel_rdd = processed_excel_rdd.flatMap(pd_df_to_row)
pd_df_to_row now has a collection of Spark Row objects. You can now say:
processed_excel_rdd.toDF()
There's probably something more efficient than the Series-> dict-> Row operation, but this got me through.
Why not make a list of the dataframes or filenames and then call union in a loop. Something like this:
If pandas dataframes:
dfs = [df1, df2, df3, df4]
sdf = None
for df in dfs:
if sdf:
sdf = sdf.union(spark.createDataFrame(df))
else:
sdf = spark.createDataFrame(df)
If filenames:
names = [name1, name2, name3, name4]
sdf = None
for name in names:
if sdf:
sdf = sdf.union(spark.createDataFrame(pd.read_excel(name))
else:
sdf = spark.createDataFrame(pd.read_excel(name))

How to concat multiple pandas dataframes into one dask dataframe larger than memory?

I am parsing tab-delimited data to create tabular data, which I would like to store in an HDF5.
My problem is I have to aggregate the data into one format, and then dump into HDF5. This is ~1 TB-sized data, so I naturally cannot fit this into RAM. Dask might be the best way to accomplish this task.
If I use parsing my data to fit into one pandas dataframe, I would do this:
import pandas as pd
import csv
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame() # create empty pandas DataFrame
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = pd.concat([total_df, df]) # creates one big dataframe
Using dask to do the same task, it appears users should try something like this:
import pandas as pd
import csv
import dask.dataframe as dd
import dask.array as da
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"] # define columns
readcsvfile = csv.reader(csvfile) # read in file, if csv
# somehow define empty dask dataframe total_df = dd.Dataframe()?
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = da.concatenate([total_df, df]) # creates one big dataframe
After creating a ~TB dataframe, I will save into hdf5.
My problem is that total_df does not fit into RAM, and must be saved to disk. Can dask dataframe accomplish this task?
Should I be trying something else? Would it be easier to create an HDF5 from multiple dask arrays, i.e. each column/field a dask array? Maybe partition the dataframes among several nodes and reduce at the end?
EDIT: For clarity, I am actually not reading directly from a csv file. I am aggregating, parsing, and formatting tabular data. So, readcsvfile = csv.reader(csvfile) is used above for clarity/brevity, but it's far more complicated than reading in a csv file.
Dask.dataframe handles larger-than-memory datasets through laziness. Appending concrete data to a dask.dataframe will not be productive.
If your data can be handled by pd.read_csv
The pandas.read_csv function is very flexible. You say above that your parsing process is very complex, but it might still be worth looking into the options for pd.read_csv to see if it will still work. The dask.dataframe.read_csv function supports these same arguments.
In particular if the concern is that your data is separated by tabs rather than commas this isn't an issue at all. Pandas supports a sep='\t' keyword, along with a few dozen other options.
Consider dask.bag
If you want to operate on textfiles line-by-line then consider using dask.bag to parse your data, starting as a bunch of text.
import dask.bag as db
b = db.read_text('myfile.tsv', blocksize=10000000) # break into 10MB chunks
records = b.str.split('\t').map(parse)
df = records.to_dataframe(columns=...)
Write to HDF5 file
Once you have dask.dataframe try the .to_hdf method:
df.to_hdf('myfile.hdf5', '/df')