Not able to save large spark dataframe as pickle - pandas

I have large dataframe (little more than 20G), trying to save that as pickle object to be later used in the another process.
I have tried different configuration, below are the latest one.
executor_cores=4
executor_memory='20g'
driver_memory='40g'
deploy_mode='client'
max_executors_dynamic='spark.dynamicAllocation.maxExecutors=400'
num_executors_static=300
spark_driver_memoryOverhead='5g'
spark_executor_memoryOverhead='2g'
spark_driver_maxResultSize='8g'
spark_kryoserializer_buffer_max='1g'
Note:- I cannot increase spark_driver_maxResultSize more than 8G.
I have also tried saving dataframe as hdfs files and then tried to save it as pickel but getting same error messsage as earlier.
My understanding is, when we use pandas.pickle it brings all the data into one driver and then create pickle object. As data size is more than driver_max_result_size code is failing. (Code has worked earlier for 2G data).
Do we have any worksround to solve this problem?
big_data_frame.toPandas().to_pickle('{}/result_file_01.pickle'.format(result_dir))
big_data_frame.write.save('{}/result_file_01.pickle'.format(result_dir), format='parquet', mode='append')
df_to_pickel=sqlContext.read.format('parquet').load(file_path)
df_to_pickel.toPandas().to_pickle('{}/scoring__{}.pickle'.format(afs_dir, rd.strftime('%Y%m%d')))
Error message
Py4JJavaError: An error occurred while calling o1638.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 955 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)

Saving as pickle file is an RDD function in Spark, not dataframe. To save your frame using pickle, run
big_data_frame.rdd.saveAsPickleFile(filename)
If you are working with big data, it is never a good idea to run either collect or toPandas in spark as it collects everything in memory, crashing the system. I would suggest you to use parquet or any other format for saving your data as RDD functions are in maintenance mode, which means spark is not introducing any new features to it rapidly.
To read the file, try
pickle_rdd = sc.pickleFile(filename).collect()
df = spark.createDataFrame(pickle_rdd)

Related

Spark - Failed to load collect frame - "RetryingBlockFetcher - Exception while beginning fetch"

We have a Scala Spark application, that reads something like 70K records from the DB to a data frame, each record has 2 fields.
After reading the data from the DB, we make minor mapping and load this as a broadcast for later usage.
Now, in local environment, there is an exception, timeout from the RetryingBlockFetcher while running the following code:
dataframe.select("id", "mapping_id")
.rdd.map(row => row.getString(0) -> row.getLong(1))
.collectAsMap().toMap
The exception is:
2022-06-06 10:08:13.077 task-result-getter-2 ERROR
org.apache.spark.network.shuffle.RetryingBlockFetcher Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /1.1.1.1:62788
at
org.apache.spark.network.client.
TransportClientFactory.createClient(Transpor .tClientFactory.java:253)
at
org.apache.spark.network.client.
TransportClientFactory.createClient(TransportClientFactory.java:195)
at
org.apache.spark.network.netty.
NettyBlockTransferService$$anon$2.
createAndStart(NettyBlockTransferService.scala:122)
In the local environment, I simply create the spark session with local "spark.master"
When I limit the max of records to 20K, it works well.
Can you please help? maybe I need to configure something in my local environment in order that the original code will work properly?
Update:
I tried to change a lot of Spark-related configurations in my local environment, both memory, a number of executors, timeout-related settings, and more, but nothing helped! I just got the timeout after more time...
I realized that the data frame that I'm reading from the DB has 1 partition of 62K records, while trying to repartition with 2 or more partitions the process worked correctly and I managed to map and collect as needed.
Any idea why this solves the issue? Is there a configuration in the spark that can solve this instead of repartition?
Thanks!

Flink s3 read error: Data read has a different length than the expected

Using flink 1.7.0, but also seen on flink 1.8.0. We are getting frequent but somewhat random errors when reading gzipped objects from S3 through the flink .readFile source:
org.apache.flink.fs.s3base.shaded.com.amazonaws.SdkClientException: Data read has a different length than the expected: dataLength=9713156; expectedLength=9770429; includeSkipped=true; in.getClass()=class org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client$2; markedSupported=false; marked=0; resetSinceLastMarked=false; markCount=0; resetCount=0
at org.apache.flink.fs.s3base.shaded.com.amazonaws.util.LengthCheckInputStream.checkLength(LengthCheckInputStream.java:151)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:93)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:76)
at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.s3a.S3AInputStream.closeStream(S3AInputStream.java:529)
at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:490)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at org.apache.flink.fs.s3.common.hadoop.HadoopDataInputStream.close(HadoopDataInputStream.java:89)
at java.util.zip.InflaterInputStream.close(InflaterInputStream.java:227)
at java.util.zip.GZIPInputStream.close(GZIPInputStream.java:136)
at org.apache.flink.api.common.io.InputStreamFSInputWrapper.close(InputStreamFSInputWrapper.java:46)
at org.apache.flink.api.common.io.FileInputFormat.close(FileInputFormat.java:861)
at org.apache.flink.api.common.io.DelimitedInputFormat.close(DelimitedInputFormat.java:536)
at org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator$SplitReader.run(ContinuousFileReaderOperator.java:336)
ys
Within a given job, we generally see many / most of the jobs read successfully, but there's pretty much always at least one failure (say out of 50 files).
It seems this error is actually originating from the AWS client, so perhaps flink has nothing to do with it, but I'm hopeful someone might have an insight as to how to make this work reliably.
When the error occurs, it ends up killing the source and canceling all the connected operators. I'm still new to flink, but I would think that this is something that could be recoverable from a previous snapshot? Should I expect that flink will retry reading the file when this kind of exception occurs?
Maybe you can try to add more connection for s3a like
flink:
...
config: |
fs.s3a.connection.maximum: 320

Memory error when running medium sized merge function ipython notebook jupyter

I'm trying to merge around 100 dataframes with a for loop and am getting a memory error. I'm using ipython jupyter notebook
Here is a sample of the data:
timestamp Namecoin_cap
0 2013-04-28 5969081
1 2013-04-29 7006114
2 2013-04-30 7049003
Each frame is around 1000 lines long
Here's the error in detail, I've also include my merge function.
My system is currently using up 64% of it memory
I have searched for similar issues but it seems most are for very large arrays >1GB, my data is relatively small in comparison.
EDIT: Something is suspicious. I wrote a beta program before, this was to test with 4 dataframes, i just exported that through pickle and it is 500kb. Now when i try to export the 100 frames one I get a memory error. It does however export a file that is 2GB. So i suspect somewhere down the line my code has created some kind of loop, creating a very large file. NB the 100 frames are stored in a dictionary
EDIT2: I have exported the scrypt to .py
http://pastebin.com/GqaHr7xc
This is a .xlsx that cointains asset names the script needs
The script fetches data regarding various assets, then cleans it up and saves each asset to a data frame in a dictionary
I'd be really appreciative if someone could have a look and see if there's anything immediately wrong. Other wise please advise on what tests I can run.
EDIT3: I'm finding it really hard to understand why this is happening, the code worked fine in the beta, all i have done now is add more assets.
EDIT4: I ran I size check on the object (dict of dfs) and it is 1,066,793 bytes
EDIT5: The problem is in the merge function for coin 37
for coin in coins[:37]:
data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')
This is when the error occurs. for coin in coins[:36]:' doesn't produce an error howeverfor coin in coins[:37]:' produces the error, any ideas ?
EDIT6: the 36th element is 'Syscoin', i did coins.remove('Syscoin') however the memory problem still occurs. So it seems to be a problem with the 36th element in coins no matter what the coin is
EDIT7: goCards suggestions seemed to work however the next part of the code:
merged = data2['merged']
merged['Total_MC'] = merged.drop('timestamp',axis=1).sum(axis=1)
Produces a memory error. I'm stumped
In regard to storage, I would recommend using a simple csv over pickle. Csv is a more generic format. It is human readable,and you can check your data quality easier especially as your data grows.
file_template_string='%s.csv'
for eachKey in dfDict:
filename = file_template_string%(eachKey)
dfDict[eachKey].to_csv(filename)
If you need to date the files you can also put a timestamp in the filename.
import time
from datetime import datetime
cur = time.time()
cur = datetime.fromtimestamp(cur)
file_template_string = "%s_{0}.csv".format(cur.strftime("%m_%d_%Y_%H_%M_%S"))
There are some obvious errors in your code.
for coin in coins: #line 61,89
for coin in data: #should be
df = data2['Namecoin'] #line 87
keys = data2.keys()
keys.remove('Namecoin')
for coin in keys:
df = pd.merge(left=df,right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')
Same issue happened to me!
"MemoryError:" by notebook on execution of pandas. I have also screen printed quite lot of observations before issued happened.
Reinstalling Anaconda didn't help. Later realized that i was working with IPython notebook instead Jupyter notebook. Switched to Jupyter notebook. Everything worked fine!

File: 0: Unexpected from Google BigQuery load job

I've a compressed json file (900MB, newline delimited) and load into a new table via bq command and get the load failure:
e.g.
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values mtdataset.mytable gs://xxx/data.gz schema.json
Waiting on bqjob_r3ec270ec14181ca7_000001461d860737_1 ... (1049s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r3ec270ec14181ca7_000001461d860737_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 0: Unexpected. Please try again.
Why the error?
I tried again with the --max_bad_records, still not useful error message
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values --max_bad_records 2 XXX.test23 gs://XXX/20140521/file1.gz schema.json
Waiting on bqjob_r518616022f1db99d_000001461f023f58_1 ... (319s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r518616022f1db99d_000001461f023f58_1': Unexpected. Please try again.
And also cannot find any useful message in the console.
To BigQuery team, can you have a look using the job ID?
As far I know there are two error sections on a job. There is one error result, and that's what you see now. And there is a second, which should be a stream of errors. This second is important as you could have errors in it, but the actual job might succeed.
Also you can set the --max_bad_records=3 on the BQ tool. Check here for more params https://developers.google.com/bigquery/bq-command-line-tool
You probably have an error that is for each line, so you should try a sample set from this big file first.
Also there is an open feature request to improve the error message, you can star (vote) this ticket https://code.google.com/p/google-bigquery-tools/issues/detail?id=13
This answer will be picked up by the BQ team, so for them I am sharing that: We need an endpoint where we can query based on a jobid, the state, or the stream of errors. It would help a lot to get a full list of errors, it would help debugging the BQ jobs. This could be easy to implement.
I looked up this job in the BigQuery logs, and unfortunately, there isn't any more information than "failed to read" somewhere after about 930 MB have been read.
I've filed a bug that we're dropping important error information in one code path and submitted a fix. However, this fix won't be live until next week, and all that will do is give us more diagnostic information.
Since this is repeatable, it isn't likely a transient error reading from GCS. That means one of two problems: we have trouble decoding the .gz file, or there is something wrong with that particular GCS object.
For the first issue, you could try decompressing the file and re-uploading it as uncompressed. While it may sound like a pain to send gigabytes of data over the network, the good news is that the import will be faster since it can be done in parallel (we can't import a compressed file in parallel since it can only be read sequentially).
For the second issue (which is somewhat less likely) you could try downloading the file yourself to make sure you don't get errors, or try re-uploading the same file and seeing if that works.

Bigquery : Unexpected. Please try again when loading a 53GB CSV/ 1.4GB gZIP

I was trying to load 1.4Gb gZIP data in to my BigQuery table and i am getting the error Unexpected. Please try again consistently
job_7f1aa8d29ae641459c82243530eb1c65
I was trying to load a structure Row ID,Order Priority,Discount,Unit Price,Shipping Cost,Customer ID,Customer Name,Ship Mode,Product Category,Product Sub-Category,Product Base Margin,Region,State or Province,City,Postal Code,Order Date,Ship Date,Profit,Quantity ordered new,Sales,Order ID
the error is not clear on whats going wrong.
anyone else encountered this error?
Thanks.
It looks like your job ran out of time-- a 53 GB CSV file is a lot to process in one thread. Best practice is to either split your data in multiple chunks, or upload uncompressed data which can be processed in parallel.
I'm in the process of raising the allowed time somewhat, and we'll work on improving the error message when this happens.