Slice Spark’s DataFrame SQL by row (pyspark) - sql

I have a Spark's Dataframe parquet file that can be read by spark as follows
df = sqlContext.read.parquet('path_to/example.parquet')
df.registerTempTable('temp_table')
I want to slice my dataframe, df, by row (i.e. equivalent to df.iloc[0:4000], df.iloc[4000:8000] etc. in Pandas dataframe) since I want to convert each small chunks to pandas dataframe to work on each later on. I only know how to do it by using sample random fraction i.e.
df_sample = df.sample(False, fraction=0.1) # sample 10 % of my data
df_pandas = df_sample.toPandas()
I would be great if there is a method to slice my dataframe df by row. Thanks in advance.

You can use monotonically_increasing_id() to add an ID column to your dataframe and use that to get a working set of any size.
import pyspark.sql.functions as f
# add an index column
df = df.withColumn('id', f.monotonically_increasing_id())
# Sort by index and get first 4000 rows
working_set = df.sort('id').limit(4000)
Then, you can remove the working set from your dataframe using subtract().
# Remove the working set, and use this `df` to get the next working set
df = df.subtract(working_set)
Rinse and repeat until you're done processing all rows. Not the ideal way to do things, but it works. Consider filtering out your Spark data frame to be used in Pandas.

Related

Trying to Drop values by column (I convert these values to nan but could be anything) not working

Trying to drop NAs by column in Dask, given a certain threshold and I receive the error below.
I'm receiving the following error, but this should be working. Please advise.
reproducible example.
import pandas as pd
import dask
data = [['tom', 10], ['nick', 15], ['juli', 5]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
import numpy as np
df = df.replace(5, np.nan)
ddf = dd.from_pandas(df, npartitions = 2)
ddf.dropna(axis='columns')
Passing axis is not support for dask dataframes as of now. You cvan also print docstring of the function via ddf.dropna? and it will tell you the same:
Signature: ddf.dropna(how='any', subset=None, thresh=None)
Docstring:
Remove missing values.
This docstring was copied from pandas.core.frame.DataFrame.dropna.
Some inconsistencies with the Dask version may exist.
See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.
Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0 (Not supported in Dask)
Determine if rows or columns which contain missing values are
removed.
* 0, or 'index' : Drop rows which contain missing values.
* 1, or 'columns' : Drop columns which contain missing value.
.. versionchanged:: 1.0.0
Pass tuple or list to drop on multiple axes.
Only a single axis is allowed.
how : {'any', 'all'}, default 'any'
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.
* 'any' : If any NA values are present, drop that row or column.
* 'all' : If all values are NA, drop that row or column.
thresh : int, optional
Require that many non-NA values.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False (Not supported in Dask)
If True, do operation inplace and return None.
Returns
-------
DataFrame or None
DataFrame with NA entries dropped from it or None if ``inplace=True``.
Worth noting that Dask Documentation is copied from pandas for many instances like this. But wherever it does, it specifically states that:
This docstring was copied from pandas.core.frame.DataFrame.drop. Some
inconsistencies with the Dask version may exist.
Therefore its always best to check docstring for dask's pandas-driven functions instead of relying on documentation
The reason this isn't supported in dask is because it requires computing the entire dataframe in order for dask to know the shape of the result. This is very different from the row-wise case, where the number of columns and partitions won't change, so the operation can be scheduled without doing any work.
Dask does not allow some parts of the pandas API which seem like normal pandas operations which might be ported to dask, but in reality can't be scheduled without triggering compute on the current frame. You're running into this issue by design, because while .dropna(axis=0) would work just fine as a scheduled operation, .dropna(axis=1) would have a very different implication.
You can do this manually with the following:
ddf[ddf.columns[~ddf.isna().any(axis=0)]]
but the filtering operation ddf.columns[~ddf.isna().any(axis=0)] will trigger a compute on the whole dataframe. It probably makes sense to persist prior to running this if you can fit the dataframe in your cluster's memory.

Save output in CSV without losing previous data on that CSV in pandas dataframe

I'm doing sentiment analysis of Tweeter data. For this work, I've made some datasets in CSV format where different month in different dataset. When I do the preprocessing of every dataset individually, I want to save all dataset in 1 single CSV file. but when I write the below's code by using pandas dataframe:
df.to_csv('dataset.csv', index=False)
It removes previous data (Rows) of that dataset. Is there any way that I can keep the previous data too on that file? So that I can merge all data together. Thank you..........
It's not entirely clear what you want from your question, so this is just a guess, but something like this might be what you're looking for. if you keep assigning dataframes to df, then new data will overwrite the old data. Try reassigning them to differently named dataframes like df1 and `df21. Then you can merge them.
# vertically merge the multiple dataframes and reassign to new variable
df = pd.concat([df1, df2])
# save the dataframe
df.to_csv('my_dataset.csv', index=False)
In python you can use the open("file") method with the parameter 'a':
open("file", 'a').
The 'a' means "append" so you will add lines at the end of your file.
You can use the same parameter for the pandas.dataFrame.to_csv() method.
e.g:
import pandas as pd
# code where you get data and return df
pd.df.to_csv("file", mode='a')
#thehand0: Your code works, but it's inefficient, so it will take longer for your script to run.

Importing Numbers using DataFrame

I'm trying to import numbers from a xlsx file using pandas DataFrame. But I'm getting numbers in a slightly different format,
let's say the number is: 9582*****4
the number i get using this code is 9582*****4.0
df=pd.read_excel("Contacts.xlsx")
for i in range(len(df)):
print(df.iloc[i,0])
It was working just fine till last night.
i guess you need to change the data type from float to int
df=pd.read_excel("Contacts.xlsx")
df = df.astype(int) # for all columns
type df = df.astype({"Column_name": int}) # for specific column
for i in range(len(df)):
print(df.iloc[i,0])

Using Dask Delayed on Small/Partitioned Dataframes

I am working with time series data that is formatted as each row is a single instance of a ID/time/data. This means that the rows don't correspond 1 to 1 for each ID. Each ID has many rows across time.
I am trying to use dask delayed to have a function run on an entire ID sequence (it makes sense that the operation should be able to run on each individual ID at the same time since they don't affect each other). To do this I am first looping through each of the ID tags, pulling/locating all the data from that ID (with .loc in pandas, so it is a separate "mini" df), then delaying the function call on the mini df, adding a column with the delayed values and adding it to a list of all mini dfs. At the end of the for loop I want to call dask.compute() on all the mini-dfs at once but for some reason the mini df's values are still delayed. Below I will post some pseudocode about what I just tried to explain.
I have a feeling that this may not be the best way to go about this but it's what made sense at the time and I can't understand whats wrong so any help would be very much appreciated.
Here is what I am trying to do:
list_of_mini_dfs = []
for id in big_df:
curr_df = big_df.loc[big_df['id'] == id]
curr_df['new value 1'] = dask.delayed(myfunc)(args1)
curr_df['new value 2'] = dask.delayed(myfunc)(args2) #same func as previous line
list_of_mini_dfs.append(curr_df)
list_of_mini_dfs = dask.delayed(list_of_mini_dfs).compute()
Concat all mini dfs into new big df.
As you can see by the code I have to reach into my big/overall dataframe to pull out each ID's sequence of data since it is interspersed throughout the rows. I want to be able to call a delayed function on that single ID's data and then return the values from the function call into the big/overall dataframe.
Currently this method is not working, when I concat all the mini dataframes back together the two values I have delayed are still delayed, which leads me to think that it is due to the way I am delaying a function within a df and trying to compute the list of dataframes. I just can't see how to fix it.
Hopefully this was relatively clear and thank you for the help.
IIUC you are trying to do a sort of transform using dask.
import pandas as pd
import dask.dataframe as dd
import numpy as np
# generate big_df
dates = pd.date_range(start='2019-01-01',
end='2020-01-01')
l = len(dates)
out = []
for i in range(1000):
df = pd.DataFrame({"ID":[i]*l,
"date": dates,
"data0": np.random.randn(l),
"data1": np.random.randn(l)})
out.append(df)
big_df = pd.concat(out, ignore_index=True)\
.sample(frac=1)\
.reset_index(drop=True)
Now you want to apply your function fun on columns data0 and data1
Pandas
out = big_df.groupby("ID")[["data0","data1"]]\
.apply(fun)\
.reset_index()
df_pd = pd.merge(big_df, out, how="left", on="ID" )
Dask
df = dd.from_pandas(big_df, npartitions=4)
out = df.groupby("ID")[["data0","data1"]]\
.apply(fun, meta={'data0':'f8',
'data1':'f8'})\
.rename(columns={'data0': 'new_values0',
'data1': 'new_values1'})\
.compute() # Here you need to compute otherwise you'll get NaNs
df_dask = dd.merge(df, out,
how="left",
left_on=["ID"],
right_index=True)
The dask version is not necessarily faster than the pandas one. In particular if your df fits in RAM.

How to I convert multiple Pandas DFs into a single Spark DF?

I have several Excel files that I need to load and pre-process before loading them into a Spark DF. I have a list of these files that need to be processed. I do something like this to read them in:
file_list_rdd = sc.emptyRDD()
for file_path in file_list:
current_file_rdd = sc.binaryFiles(file_path)
print(current_file_rdd.count())
file_list_rdd = file_list_rdd.union(current_file_rdd)
I then have some mapper function that turns file_list_rdd from a set of (path, bytes) tuples to (path, Pandas DataFrame) tuples. This allows me to use Pandas to read the Excel file and to manipulate the files so that they're uniform before making them into a Spark DataFrame.
How do I take an RDD of (file path, Pandas DF) tuples and turn it into a single Spark DF? I'm aware of functions that can do a single transformation, but not one that can do several.
My first attempt was something like this:
sqlCtx = SQLContext(sc)
def convert_pd_df_to_spark_df(item):
return sqlCtx.createDataFrame(item[0][1])
processed_excel_rdd.map(convert_pd_df_to_spark_df)
I'm guessing that didn't work because sqlCtx isn't distributed with the computation (it's a guess because the stack trace doesn't make much sense to me).
Thanks in advance for taking the time to read :).
Can be done using conversion to Arrow RecordBatches which Spark > 2.3 can process into a DF in a very efficient manner.
https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
This snippet monkey-patches spark to include a createFromPandasDataframesRDD method.
The createFromPandasDataframesRDD method accepts a RDD object of pandas DFs (Assumes same columns) and returns a single Spark DF.
I solved this by writing a function like this:
def pd_df_to_row(rdd_row):
key = rdd_row[0]
pd_df = rdd_row[1]
rows = list()
for index, series in pd_df.iterrows():
# Takes a row of a df, exports it as a dict, and then passes an unpacked-dict into the Row constructor
row_dict = {str(k):v for k,v in series.to_dict().items()}
rows.append(Row(**row_dict))
return rows
You can invoke it by calling something like:
processed_excel_rdd = processed_excel_rdd.flatMap(pd_df_to_row)
pd_df_to_row now has a collection of Spark Row objects. You can now say:
processed_excel_rdd.toDF()
There's probably something more efficient than the Series-> dict-> Row operation, but this got me through.
Why not make a list of the dataframes or filenames and then call union in a loop. Something like this:
If pandas dataframes:
dfs = [df1, df2, df3, df4]
sdf = None
for df in dfs:
if sdf:
sdf = sdf.union(spark.createDataFrame(df))
else:
sdf = spark.createDataFrame(df)
If filenames:
names = [name1, name2, name3, name4]
sdf = None
for name in names:
if sdf:
sdf = sdf.union(spark.createDataFrame(pd.read_excel(name))
else:
sdf = spark.createDataFrame(pd.read_excel(name))