I do have a function which returns a Pyspark Pandas object (and I cannot change this to normal pandas object) and I was trying to unit test this function, but it is only throwing null object.
I see that Pytest could be used as an option, but if possible I would like to try with Unittest framework. Is there any way to use the Pyspark.Pandas with Unittest?
Related
I first sub in Modin for Pandas for the benefit of distributed work over multiple cores:
import modin.pandas as pd
from modin.config import Engine
Engine.put("dask")
After initializing my dataframe, I attempt to use:
df['bins'] = pd.cut(df[column],300)
I get this error:
TypeError: ('Could not serialize object of type function.', '<function PandasDataframe._build_mapreduce_func.<locals>._map_reduce_func at 0x7fbe78580680>')
Would be glad to get help.
I can't seem to get Modin to perform the way that I want out of the box, the way I expected.
I have a partition column in my Hive-style partitioned parquet dataset (written by PyArrow from Pandas Dataframe) with an entry like "TYPE=3860877578". When trying to read from this dataset, I get an error:
ArrowInvalid: error parsing '3760212050' as scalar of type int32
This is the first partition key that won't fit into an int32 (i.e. there are smaller integer values in other partitions - I think the inference must be done on the first one encountered). It looks like it should be possible to override the inferred type (to int64 or even string) at the dataset level, but I can't figure out how to get there from here :-) So far I have been using the Pandas.read_parquet() interface, and passing down filters, columns, etc. to PyArrow. I think I will need to use the PyArrow APIs directly, but don't know where to start.
How can I tell PyArrow to treat this column as an int64 or string type instead of trying to infer the type?
Example of dataset partition values that causes this problem:
/mydataset/TYPE=12345/*.parquet
/mydataset/TYPE=3760212050/*.parquet
Code that reproduces the problem with Pandas 1.1.1 and PyArrow 1.0.1:
import pandas as pd
# pyarrow is available and used
df = pd.read_parquet("mydataset")
The issue can't be avoided by avoiding the problematic value with filtering because the partition values all appear to be parsed prior to filtering, i.e
import pandas as pd
# pyarrow is available and used
df = pd.read_parquet("mydataset", filters=[[('TYPE','=','12345')]])
I figured out since my original post that I can do what I want with the PyArrow API directly like this:
from pyarrow.dataset import HivePartitioning
from pyarrow.parquet import ParquetDataset
import pyarrow as pa
partitioning = HivePartitioning(pa.schema([("TYPE", pa.int64())]))
df = ParquetDataset("mydataset",
filters=filters,
partitioning=partitioning,
use_legacy_dataset=False).read_pandas().to_pandas()
I'd like to be able to pass that info down through the Pandas read_parquet() interface but it doesn't appear possible at this time.
I have very large datasets in spark dataframes that are distributed across the nodes.
I can do simple statistics like mean, stdev, skewness, kurtosis etc using the spark libraries pyspark.sql.functions .
If I want to use advanced statistical tests like Jarque-Bera (JB) or Shapiro-Wilk(SW) etc, I use the python libraries like scipy since the standard apache pyspark libraries don't have them. But in order to do that, I have to convert the spark dataframe to pandas, which means forcing the data into the master node like so:
import scipy.stats as stats
pandas_df=spark_df.toPandas()
JBtest=stats.jarque_bera(pandas_df)
SWtest=stats.shapiro(pandas_df)
I have multiple features, and each feature ID corresponds to a dataset on which I want to perform the test statistic.
My question is:
Is there a way to apply these pythonic functions on a spark dataframe while the data is still distributed across the nodes, or do I need to create my own JB/SW test statistic functions in spark?
Thank you for any valuable insight
Yous should be able to define a vectorized user-defined function that wraps the Pandas function (https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html), like this:
from pyspark.sql.functions import pandas_udf, PandasUDFType
import scipy.stats as stats
#pandas_udf('double', PandasUDFType.SCALAR)
def vector_jarque_bera(x):
return stats.jarque_bera(x)
# test:
spark_df.withColumn('y', vector_jarque_bera(df['x']))
Note that the vectorized function column takes a column as its argument and returns a column.
(Nb. The #pandas_udf decorator is what transforms the Pandas function defined right below it into a vectorized function. Each element of the returned vector is itself a scalar, which is why the argument PandasUDFType.SCALAR is passed.)
Does any one has idea how to run pandas program on spark standalone cluster machine(windows)? the program developed using pycharm and pandas?
Here the issue is i am able to run from command prompt using spark-submit --master spark://sparkcas1:7077 project.py and getting results. but the activity(status) I am not seeing # workers and also Running Application status and Completed application status from spark web UI: :7077
in the pandas program I just included only one statement " from pyspark import SparkContext
import pandas as pd
from pyspark import SparkContext
# reading csv file from url
workbook_loc = "c:\\2020\Book1.xlsx"
df = pd.read_excel(workbook_loc, sheet_name='Sheet1')
# converting to dict
print(df)
What could be the issue?
Pandas code runs only on the driver and no workers are involved in this. So there is no point of using pandas code inside spark.
If you are using spark 3.0 you can run your pandas code distributed by converting the spark df as koalas
I have function that converts non numerical data in a dataframe to numerical.
import numpy as np
import pandas as pd
from concurrent import futures
def convert_to_num(df):
do stuff
return df
I am wanting to use the futures library to speed up this task. This is how I am using the library:
with futures.ThreadPoolExecutor() as executor:
df_test = executor.map(convert_to_num,df_sample)
First I do not see the variable df_test being created and second when I run df_test in I get this message:
<generator object Executor.map.<locals>.result_iterator at >
What am I doing wrong to not be able to use the futures library? Can I only use this library to iterate values into a function versus passing a entire dataframe to be edited?
The map method for the executor object, as per the documentation, takes the following arguments,
map(func, *iterables, timeout=None, chunksize=1)
From your example you only provide a single df (the df_sample) but you could provide a list of df_samples which are unpacked in as the iterables parameter.
For example,
Let us create a list of dataframes,
import concurrent.futures
import pandas as pd
df_samples = [pd.DataFrame({f"col{j}{i}": [j,i] for i in range(1,5)}) for j in range(1,5)]
Which would look like, df_samples
And now we add a function which will add an additional column to a df,
def add_x_column(df):
df['col_x'] = ['a', 'b']
return df
and now use the ThreadPoolExecutor to apply this function to the df_samples list in a concurrent manner. You would also need to make convert the generator object to a list to access the changed df's
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(add_x_column, df_samples))
Where the results would be the list of the resultant df's
Where the results would look like, df_results