Pandas Making multiple HTTP requests - pandas

I have below code that reads from a csv file a number of ticker symbols into a dataframe.
Each ticker calls the Web Api returning a dafaframe df which is then attached to the last one until complete. The code works , but when a large number of tickers is used the code slows down tremendously. I understand I can use multiprocessing and threads to speed up my code but dont know where to start and what would be the most suited in my particular case.
What code should I use to get my data into a combined daframe in the fastest possible manner?
import pandas as pd
import numpy as np
import json
tickers=pd.read_csv("D:/verhuizen/pensioen/MULTI.csv",names=['symbol','company'])
read_str='https://financialmodelingprep.com/api/v3/income-statement/AAPL?limit=120&apikey=demo'
df = pd.read_json (read_str)
df = pd.DataFrame(columns=df.columns)
for ind in range(len(tickers)):
read_str='https://financialmodelingprep.com/api/v3/income-statement/'+ tickers['symbol'][ind] +'?limit=120&apikey=demo'
df1 = pd.read_json (read_str)
df=pd.concat([df,df1], ignore_index=True)
df.set_index(['date','symbol'], inplace=True)
df.sort_index(inplace=True)
df.to_csv('D:/verhuizen/pensioen/MULTI_out.csv')
The code provided works fine for smaller data sets, but when I use a large number of tickers (>4,000) at some point I get the below error. Is this because the web api gets overloaded or is there another problem?
Traceback (most recent call last):
File "D:/Verhuizen/Pensioen/Equity_Extractor_2021.py", line 43, in <module>
data = pool.starmap(download_data, enumerate(TICKERS, start=1))
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 276, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x00C33E30>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
Process finished with exit code 1
It keeps giving the same error (for a larger amount of tickers)
code is exactly as provided:
def download_data(pool_id, symbols):
df = []
for symbol in symbols:
print("[{:02}]: {}".format(pool_id, symbol))
#do stuff here
read_str = BASEURL.format(symbol)
df.append(pd.read_json(read_str))
#df.append(pd.read_json(fake_data(symbol)))
return pd.concat(df, ignore_index=True)
It failed again with the pool.map, but one strange thing I noticed. Each time it fails it does so around 12,500 tickers (total is around 23,000 tickers) Similar error:
Traceback (most recent call last):
File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_naive.py", line 21, in <module>
data = pool.map(download_data, TICKERS)
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x078D1BF0>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
Process finished with exit code 1
I get the tickers also from a API call https://financialmodelingprep.com/api/v3/financial-statement-symbol-lists?apikey=demo (I noticed it does not work without subscription), I wanted to attach the data it as a csv file but I dont have sufficient rights. I dont think its a good idea to paste the returned data here...
I tried adding time.sleep(0.2) before return as suggested, but again I ge the same error at ticker 12,510. Strange everytime its around the same location. As there are multiple processes going on I cannot see at what point its breaking
Traceback (most recent call last):
File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_naive.py", line 24, in <module>
data = pool.map(download_data, TICKERS)
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x00F32C90>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
Process finished with exit code 1
Something very very strange is going on , I have split the data in chunks of 10,000 / 5,000 / 4,000 and 2,000 and each time the code breaks approx 100 tickers from the end. Clearly there is something going on that not right
import time
import pandas as pd
import multiprocessing
# get tickers from your csv
df=pd.read_csv('D:/Verhuizen/Pensioen/All_Symbols.csv',header=None)
# setting the Dataframe to a list (in total 23,000 tickers)
df=df[0]
TICKERS=df.tolist()
#Select how many tickers I want
TICKERS=TICKERS[0:2000]
BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"
def download_data(symbol):
print(symbol)
# do stuff here
read_str = BASEURL.format(symbol)
df = pd.read_json(read_str)
#time.sleep(0.2)
return df
if __name__ == "__main__":
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
data = pool.map(download_data, TICKERS)
df = pd.concat(data).set_index(["date", "symbol"]).sort_index()
df.to_csv('D:/verhuizen/pensioen/Income_2000.csv')
In this particular example the code breaks at position 1,903
RPAI
Traceback (most recent call last):
File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_testing.py", line 27, in <module>
data = pool.map(download_data, TICKERS)
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x0793EAF0>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'

First optimization is to avoid concatenate your dataframe at each iteration.
You can try something like that:
url = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"
df = []
for symbol in tickers["symbol"]:
read_str = url.format(symbol)
df.append(pd.read_json(read_str))
df = pd.concat(df, ignore_index=True)
If it's not sufficient, we will see to use async, threading or multiprocessing.
Edit:
The code below can do the job:
import pandas as pd
import numpy as np
import multiprocessing
import time
import random
PROCESSES = 4 # number of parallel process
CHUNKS = 6 # one process handle n symbols
# get tickers from your csv
TICKERS = ["BCDA", "WBAI", "NM", "ZKIN", "TNXP", "FLY", "MYSZ", "GASX", "SAVA", "GCE",
"XNET", "SRAX", "SINO", "LPCN", "XYF", "SNSS", "DRAD", "WLFC", "OILD", "JFIN",
"TAOP", "PIC", "DIVC", "MKGI", "CCNC", "AEI", "ZCMD", "YVR", "OCG", "IMTE",
"AZRX", "LIZI", "ORSN", "ASPU", "SHLL", "INOD", "NEXI", "INR", "SLN", "RHE-PA",
"MAX", "ARRY", "BDGE", "TOTA", "PFMT", "AMRH", "IDN", "OIS", "RMG", "IMV",
"CHFS", "SUMR", "NRG", "ULBR", "SJI", "HOML", "AMJL", "RUBY", "KBLMU", "ELP"]
# create a list of n sublist
TICKERS = [TICKERS[i:i + CHUNKS] for i in range(0, len(TICKERS), CHUNKS)]
BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"
def fake_data(symbol):
dti = pd.date_range("1985", "2020", freq="Y")
df = pd.DataFrame({"date": dti, "symbol": symbol,
"A": np.random.randint(0, 100, size=len(dti)),
"B": np.random.randint(0, 100, size=len(dti))})
time.sleep(random.random()) # to simulate network delay
return df.to_json()
def download_data(pool_id, symbols):
df = []
for symbol in symbols:
print("[{:02}]: {}".format(pool_id, symbol))
# do stuff here
# read_str = BASEURL.format(symbol)
# df.append(pd.read_json(read_str))
df.append(pd.read_json(fake_data(symbol)))
return pd.concat(df, ignore_index=True)
if __name__ == "__main__":
with multiprocessing.Pool(PROCESSES) as pool:
data = pool.starmap(download_data, enumerate(TICKERS, start=1))
df = pd.concat(data).set_index(["date", "symbol"]).sort_index()
In this example, I split the list of tickers into sublists for each process retrieves data for multiple symbols and limits overhead due to create and destroy processes.
The delay is to simulate the response time from the network connection and highlight the multiprocess behaviour.
Edit 2: simpler but naive version for your needs
import pandas as pd
import multiprocessing
# get tickers from your csv
TICKERS = ["BCDA", "WBAI", "NM", "ZKIN", "TNXP", "FLY", "MYSZ", "GASX", "SAVA", "GCE",
"XNET", "SRAX", "SINO", "LPCN", "XYF", "SNSS", "DRAD", "WLFC", "OILD", "JFIN",
"TAOP", "PIC", "DIVC", "MKGI", "CCNC", "AEI", "ZCMD", "YVR", "OCG", "IMTE",
"AZRX", "LIZI", "ORSN", "ASPU", "SHLL", "INOD", "NEXI", "INR", "SLN", "RHE-PA",
"MAX", "ARRY", "BDGE", "TOTA", "PFMT", "AMRH", "IDN", "OIS", "RMG", "IMV",
"CHFS", "SUMR", "NRG", "ULBR", "SJI", "HOML", "AMJL", "RUBY", "KBLMU", "ELP"]
BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"
def download_data(symbol):
print(symbol)
# do stuff here
read_str = BASEURL.format(symbol)
df = pd.read_json(read_str)
return df
if __name__ == "__main__":
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
data = pool.map(download_data, TICKERS)
df = pd.concat(data).set_index(["date", "symbol"]).sort_index()
Note about pool.map: for each symbol in TICKERS, create a process and call function download_data.

Related

using pandas.read_csv, how can one process all errors, receive all non-error data?

Data which, for me, generates an exception instead of invoking the 'on_bad_lines' handler is at:
https://opencalaccess.org/misc/NAMES_CD.TSV
I have this:
bad_lines = list()
def bad_line_finder(x):
bad_lines.append(str(x))
return None
for file in os.listdir(dir):
bad_lines = list()
try:
for df in pd.read_csv(f"{dir}/{file}",
sep='\t',
on_bad_lines=bad_line_finder,
engine='python',
chunksize=1000):
print(f"\n{target}")
df.info()
print(f"Bad Lines: {bad_lines}")
bad_lines = list()
except:
print("EXCEPTION:")
traceback.print_exc()
and this works great. There are errors in the files and the method handles them so that I can keep track of them. Except, why do i still see this:
EXCEPTION:
Traceback (most recent call last):
File "/home/ray/Projects/opencalaccess-data/import.py", line 41, in <module>
for df in pd.read_csv(f"{dir}/{file}",
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1698, in __next__
return self.get_chunk()
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1810, in get_chunk
return self.read(nrows=size)
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1778, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 250, in read
content = self._get_lines(rows)
File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 1114, in _get_lines
new_rows.append(next(self.data))
_csv.Error: ' ' expected after '"'
What is the "on_bad_lines" option doing if it does not handle all of the bad lines? Which of them will it handle and which will it not?
This is a government data source. There are format errors in the data that cannot be corrected by the agency, because they constitute the 0fficial record. So, I must fix them myself. But which of them throw exceptions and which do not?

pyspark RDDs strip attributes of numpy subclasses

I've been fighting an unexpected behavior when attempting to construct a subclass of numpy ndarray within a map call to a pyspark RDD. Specifically, the attribute that I added within the ndarray subclass appears to be stripped from the resulting RDD.
The following snippets contain the essence of the issue.
import numpy as np
class MyArray(np.ndarray):
def __new__(cls,shape,extra=None,*args):
obj = super().__new__(cls,shape,*args)
obj.extra = extra
return obj
def __array_finalize__(self,obj):
if obj is None:
return
self.extra = getattr(obj,"extra",None)
def shape_to_array(shape):
rval = MyArray(shape,extra=shape)
rval[:] = np.arange(np.product(shape)).reshape(shape)
return rval
If I invoke shape_to_array directly (not under pyspark), it behaves as expected:
x = shape_to_array((2,3,5))
print(x.extra)
outputs:
(2, 3, 5)
But, if I invoke shape_to_array via a map to an RDD of inputs, it goes wonky:
from pyspark.sql import SparkSession
sc = SparkSession.builder.appName("Steps").getOrCreate().sparkContext
rdd = sc.parallelize([(2,3,5),(2,4),(2,5)])
result = rdd.map(shape_to_array).cache()
print(result.map(lambda t:type(t)).collect())
print(result.map(lambda t:t.shape).collect())
print(result.map(lambda t:t.extra).collect())
Outputs:
[<class '__main__.MyArray'>, <class '__main__.MyArray'>, <class '__main__.MyArray'>]
[(2, 3, 5), (2, 4), (2, 5)]
22/10/15 15:48:02 ERROR Executor: Exception in task 7.0 in stage 2.0 (TID 23)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/3.3.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
process()
File "/usr/local/Cellar/apache-spark/3.3.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
serializer.dump_stream(out_iter, outfile)
File "/usr/local/Cellar/apache-spark/3.3.0/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/Cellar/apache-spark/3.3.0/libexec/python/lib/pyspark.zip/pyspark/util.py", line 81, in wrapper
return f(*args, **kwargs)
File "/var/folders/w7/42_p7mcd1y91_tjd0jzr8zbh0000gp/T/ipykernel_94831/2519313465.py", line 1, in <lambda>
AttributeError: 'MyArray' object has no attribute 'extra'
What happened to the extra attribute of the MyArray instances?
Thanks much for any/all suggestions
EDIT: A bit of additional info. If I add logging inside the shape_to_array function just before the return, I can verify that the extra attribute does exist on the DataArray object that is being returned. But when I attempt to access the DataArray elements in the RDD from the main driver, they're gone.
After a night of sleeping on this, I remembered that I have often had issues with pyspark RDDs where the error message had to do the return type not working with pickle.
I wasn't getting that error message this time because numpy.ndarray does work with pickle. BUT... the __reduce__ and __setstate__ methods of numpy.ndarray known nothing of the added extra attribute on the MyArray subclass. This is where extra was being stripped.
Adding the following two methods to MyArray solved everything.
def __reduce__(self):
mthd,cls,args = super().__reduce__(self)
return mthd, cls, args + (self.extra,)
def __setstate__(self,args):
super().__setstate__(args[:-1])
self.extra = args[-1]
Thank you to anyone who took some time to think about my question.

How to convert Pandas dataframe to PyArrow table with a union type in the schema?

I have a Pandas dataframe with a column that contains a list of dict/structs. One of the keys (thing in the example below) can have a value that is either an int or a string. Is there a way to define a PyArrow type that will allow this dataframe to be converted into a PyArrow table, for eventual output to a Parquet file?
I tried using pa.union for this, but I seem to be doing something not supported/implemented.
import pandas as pd
import pyarrow as pa
df = pd.DataFrame(data={"id": [1, 2], "dict": [{"thing": 1}, {"thing": "two"}]})
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("dict", pa.struct([
("thing", pa.union([
pa.field("int64", pa.int64()),
pa.field("string", pa.string()),
], "sparse"))
]))
])
t = pa.Table.from_pandas(df, schema=schema)
Result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/table.pxi", line 1394, in pyarrow.lib.Table.from_pandas
File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 587, in dataframe_to_arrays
arrays = [convert_column(c, f)
File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 587, in <listcomp>
arrays = [convert_column(c, f)
File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 574, in convert_column
raise e
File "/usr/local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 568, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 292, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: ('sparse_union', 'Conversion failed for column dict with type object')
The help text for pa.union doesn't give an example of how to use it.
>>> help(pa.union)
Help on built-in function union in module pyarrow.lib:
union(...)
union(children_fields, mode, type_codes=None)
Create UnionType from children fields.
A union is defined by an ordered sequence of types; each slot in the union
can have a value chosen from these types.
Parameters
----------
fields : sequence of Field values
Each field must have a UTF8-encoded name, and these field names are
part of the type metadata.
mode : str
Either 'dense' or 'sparse'.
type_codes : list of integers, default None
Returns
-------
type : DataType
It looks like it's not implemented yet in pyarrow 2.0.0:
import pandas as pd
import pyarrow as pa
union = pa.union([
pa.field("int64", pa.int64()),
pa.field("string", pa.string()),
], 'sparse')
pa.array([1, 'two'], union)
---------------------------------------------------------------------------
ArrowNotImplementedError Traceback (most recent call last)
<ipython-input-72-f7ec6792b124> in <module>
10 ], 'sparse')
11
---> 12 pa.array([1, 'two'], union)
/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/nix/store/aagq4nyc9m4ikjda1mykgv125v792zk7-python3-3.7.7-env/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowNotImplementedError: sparse_union
PyArrow has a built in method .from_pandas()
https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({
... 'int': [1, 2],
... 'str': ['a', 'b']
... })
pa.Table.from_pandas(df)
<pyarrow.lib.Table object at 0x7f05d1fb1b40>

concatenate results after multiprocessing

I have a function which is creating a data frame by doing multiprocessing on a df:-
Suppose if I am having 10 rows in my df so the function processor will process all 10 rows separately. what I want is to concatenate all the output of the function processor and make one data frame.
def processor(dff):
"""
reading data from a data frame and doing all sorts of data manipulation
for multiprocessing
"""
return df
def main(infile, mdebug):
global debug
debug = mdebug
try:
lines = sum(1 for line in open(infile))
except Exception as err:
print("Error {} opening file: {}").format(err, infile)
sys.exit(2000)
if debug >= 2:
print(infile)
try:
dff = pd.read_csv(infile)
except Exception as err:
print("Error {}, opening file: {}").format(err, infile)
sys.exit(2000)
df_split = np.array_split(dff, (lines+1))
cores = multiprocessing.cpu_count()
cores = 64
# pool = Pool(cores)
pool = Pool(lines-1)
for n, frame in enumerate(pool.imap(processor, df_split), start=1):
if frame is not None:
frame.to_csv('{}'.format(n))
pool.close()
pool.join()
if __name__ == "__main__":
args = parse_args()
"""
print "Debug is: {}".format(args.debug)
"""
if args.debug >= 1:
print("Running in debug mode: "), args.debug
main(infile=args.infile, mdebug=args.debug)
you can use either the data frame constructor or concat to solve your problem. the appropriate one to use depends on details of your code that you haven't included
here's a more complete example:
import numpy as np
import pandas as pd
# create dummy dataset
dff = pd.DataFrame(np.random.rand(101, 5), columns=list('abcde'))
# process data
with Pool() as pool:
result = pool.map(processor, np.array_split(dff, 7))
# put it all back together in one dataframe
result = np.concat(result)

Getting sum by grouping other column

I have a dataframe as follows
Occupation, Genre, Rating
I have taken sum of all rating as totalRating. Now I want to create neeew column w_rating which take (rating >3)/totalRating for particular Occupation,Genre Combination. My dataframe name is joinedRDD so i amwriting below query
resultDF = joinedDF3.groupby([joinedDF3["Occupation"],joinedDF3["Genre"]]).withColumn(wa_rating, sum(Rating>3)/totalRating).collect()
but it is showing error
AttributeError: 'GroupedData' object has no attribute 'withColumn'
So it is clear from error that we cannot use withColumn with groupby
So my question is how to do it?
Below is my updated code.
from pyspark.sql import SparkSession
from pyspark.sql.types import (StructField,StructType,IntegerType,StringType)
from pyspark.sql import Row
from pyspark.sql.functions import sum
import pyspark.sql.functions as F
from pyspark.sql.functions import lit
spark = SparkSession.builder.appName("Movielens Analysis").getOrCreate()
def refineMovieDF(row):
genre=[]
movieData =row[0].split("|")
for i in range(len(movieData)-5):
if int(movieData[i+5]) ==1:
genre.append((int(movieData[0]),i))
return genre
ratingSchema =StructType(fields=[StructField("UserId",IntegerType(),True),StructField("MovieId",IntegerType(),True),StructField("Rating",IntegerType(),True),StructField("TimeStamp",IntegerType(),True)])
ratingsDF = spark.read.load("ml-100k/u.data", format="csv",sep="\t", inferSchema=True, header=False,schema=ratingSchema)
genreSchema =StructType(fields=[StructField("Genre",StringType(),True),StructField("GenreId",IntegerType(),True)])
genreDF = spark.read.load("ml-100k/u.genre",format="csv",sep="|",inferSchema=True, header=False,schema=genreSchema)
userSchema =StructType(fields=[StructField("UserId",IntegerType(),True),StructField("Age",IntegerType(),True),StructField("Gender",StringType(),True),StructField("Occupation",StringType(),True),StructField("ZipCode",IntegerType(),True)])
usersDF = spark.read.load("ml-100k/u.user",format="csv",sep="|",inferSchema=True, header=False,schema=userSchema)
movieSchema =StructType(fields=[StructField("MovieRow",StringType(),True)])
movieDF = spark.read.load("ml-100k/u.item",format="csv",inferSchema=True, header=False,schema=movieSchema)
movieRefinedRDD = movieDF.rdd.flatMap(refineMovieDF)
movieSchema =StructType(fields=[StructField("MovieId",IntegerType(),True),StructField("GenreId",IntegerType(),True)])
movieRefinedDf = spark.createDataFrame(movieRefinedRDD, movieSchema)
joinedDF1 = ratingsDF.join(usersDF,ratingsDF.UserId==usersDF.UserId).select(usersDF["Occupation"],ratingsDF["Rating"],ratingsDF["MovieId"])
joinedDF3 = joinedDF1.join(joinedDF2,joinedDF1.MovieId == joinedDF2.MovieId).select(joinedDF1["Occupation"],joinedDF1["Rating"],joinedDF2["Genre"])
totalRating = joinedDF3.groupBy().sum("Rating").collect()
resultDF = joinedDF3.groupby([joinedDF3["Occupation"],joinedDF3["Genre"]]).agg((sum(joinedDF3["Rating"]>3)/totalRating).alias(wa_rating)).collect()
print(resultDF)
Now I am getting below error.
2019-08-06 22:24:20 INFO BlockManagerInfo:54 - Removed broadcast_11_piece0 on 10.0.2.15:58903 in memory (size: 4.3 KB, free: 413.8 MB)
Traceback (most recent call last):
File "/home/cloudera/workspace/MovielensAnalysis.py", line 59, in <module>
resultDF = joinedDF3.groupby([joinedDF3["Occupation"],joinedDF3["Genre"]]).agg((sum(joinedDF3["Rating"]>3)/totalRating).alias(wa_rating)).collect()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/column.py", line 116, in _
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o129.divide.: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [[572536]]