Why spark is reading more data that I expect it to read using read schema? - sql

In my spark job, I'm reading a huge table (parquet) with more than 30 columns. To limit the size of data read I specify schema with one column only (I need only this one). Unfortunately, when reading the info in spark UI I get the information that the size of files read equals 1123.8 GiB but filesystem read data size total equals 417.0 GiB. I was expecting that if I take one from 30 columns the filesystem read data size total will be around 1/30 of the initial size, not almost half.
Could you explain to me why is that happening?

Related

write large pyspark dataframe to s3 very slow

This question is relevant to my previous question at aggregate multiple columns in sql table as json or array
I post some updates/follow-up questions here because I got a new problem.
I would like to query a table on presto database from pyspark hive and create a pyspark dataframe based on it. I have to save the dataframe to s3 faster and then read it as parquet (or any other formats as long as it can be read/written fast) from s3 efficiently.
In order to keep the size as small as possible, I have aggregated some columns into a json object.
The original table (> 10^9 rows, some columns (e.g. obj_desc) may have more than 30 English words):
id. cat_name. cat_desc. obj_name. obj_desc. obj_num
1. furniture living office desk 4 corners 1.5.
1 furniture. living office. chair. 4 legs. 0.8
1. furniture. restroom. tub. white wide. 2.7
1. cloth. fashion. T-shirt. black large. 1.1
I have aggregated some columns to json object.
aggregation_cols = ['cat_name','cat_desc','obj_name','obj_description', 'obj_num'] # they are all string
df_temp = df.withColumn("cat_obj_metadata", F.to_json(F.struct([x for x in aggregation_cols]))).drop(*agg_cols)
df_temp_agg = df_temp.groupBy('id').agg(F.collect_list('cat_obj_metadata').alias('cat_obj_metadata'))
df_temp_agg.cache()
df_temp_agg.printSchema()
# df_temp_agg.count() # this cost a very long time but still cannot return result so I am not sure how large it is.
df_temp_agg.repartition(1024) # not sure what optimal one should be?
df_temp_agg.write.parquet(s3_path, mode='overwrite') # this cost a long time (> 12 hours) but no return.
I work on a m4.4xlarge with 4 nodes and all cores look not busy.
I also checked the s3 bucket, no folder created at "s3_path".
For other small dataframe, I can see the "s3_path" can be created when "write.parquet()" is run. But, for this large dataframe, nothing fodlers or files are created on "s3_path".
Because the
df_meta_agg.write.parquet()
never returns, I am. not sure what errors could happen here on spark cluster or on s3.
Anybody could help me about this ? thanks

Google BigQuery fails with "Resources exceeded during query execution: UDF out of memory" when loading Parquet file

We use the BigQuery Java API to upload data from local data source as described here. When uploading a Parquet file with 18 columns (16 string, 1 float64, 1 timestamp) and 13 Mio rows (e.g. 17GB of data) the upload fails with the following exception:
Resources exceeded during query execution: UDF out of memory.; Failed
to read Parquet file . This might happen if the file contains a row
that is too large, or if the total size of the pages loaded for the
queried columns is too large.
However when uploading the same data using CSV (17.5GB of data) the upload succeeds. My questions are:
What is the difference when uploading Parquet or CSV?
What query is executed during upload?
Is it possible to increase the memory for this query?
Thanks
Tobias
Parquet is columnar data format, which means that loading data requires reading all columns. In parquet, columns are divided into pages. BigQuery keeps entire uncompressed pages for each column in memory while reading data from them. If the input file contains too many columns, BigQuery workers can hit Out of Memory errors.
Even when a precise limit is not enforced as it happens with other formats, it is recommended that records should in the range of 50 Mb, loading larger records may lead to resourcesExceeded errors.
Taking into account the above considerations, it would be great to clarify the following points:
What is the maximum size of rows in your Parquet file?
What is the maximum page size per column?
This info can be retrieved by publicly available tool.
If you think about increasing the alocated memory for queries, you need to read about Bigquery slots.
In my case, I ran bq load --autodetect --source_format=PARQUET ... which failed with the same error (resources exceeded during query execution). Finally, I had to split the data into multiple Parquet files so that they would be loaded in batches.

Unable to extract data in a single .csv file from Google Big Query (though data is smaller than 1GB)

I am able to export the data in 4 different files of about 90 MB each. (which doesn't make sense)
I have read the limitations of Google Big Query and it says that data with more than 1 GB in size cannot be downloaded in a single CSV file.
My data size is about 250 - 300 MB in size.
This is what usually I do to export data from GBQ:
I saved the table in Google Big Query (as it has more than 16000 rows)
Then exported it in the Bucket using as follows:
gs://[your_bucket]/file-name-*.csv
I think 2M rows of data is less than 1 GB. (Let me know if I am wrong)
Can I get this data in a single csv file ?
Thank you.
You should take out the wildcard from the name of the blob you want to write to. This tells BQ you want to export as multiple files.
So you should rather export to gs://[your_bucket]/file-name.csv
As you noted, this won't work if your data is bigger than 1GB, but you should be fine if total is about 300MB.
You can get node.js readable stream that contains result of your query (https://cloud.google.com/nodejs/docs/reference/bigquery/2.0.x/BigQuery#createQueryStream).
Chunk of data is a row of result set.
And then write data (row by row) to csv (locally or to cloud storage).

BQ Load error : Avro parsing error in position 893786302. Size of data block 27406834 is larger than the maximum allowed value 16777216

To BigQuery experts,
I am working on the process which requires us to represent customers shopping history in way where we concatenate all last 12 months of transactions in a single column for Solr faceting using prefixes.
while trying to load this data in BIG Query, we are getting below row limit exceed error. Is there any way to get around this? the actual tuple size is around 64 mb where as the avro limit is 16mb.
[ ~]$ bq load --source_format=AVRO --allow_quoted_newlines --max_bad_records=10 "syw-dw-prod":"MAP_ETL_STG.mde_golden_tbl" "gs://data/final/tbl1/tbl/part-m-00005.avro"
Waiting on bqjob_r7e84784c187b9a6f_0000015ee7349c47_1 ... (5s) Current status: DONE
BigQuery error in load operation: Error processing job 'syw-dw-prod:bqjob_r7e84784c187b9a6f_0000015ee7349c47_1': Avro parsing error in position 893786302. Size of data
block 27406834 is larger than the maximum allowed value 16777216.
Update: This is no longer true, the limit has been lifted.
BigQuery's limit on loaded Avro file's block size is 16MB (https://cloud.google.com/bigquery/quotas#import). Unless each row is actually greater than 16MB, you should be able to split up the rows into more blocks to stay within the 16MB block limit. Using a compression codec may reduce the block size.

Increasing Spark Read and Parquet Conversion Performance for Gzipped Text File

Use case:
A> Have Text Gzipped files in AWS s3 location
B> Hive Table created on top of the file, to access the data from the file as Table
C> Using Spark Dataframe to read the table and converting into Parquet Data with Snappy Compression
D> Number of fields in the table is 25, which includes 2 partition columns. Data Type is String except for two fields which has Decimal as data type.
Used following Spark Option: --executor-memory 37G --executor-cores 5 --num-executors 20
Cluster Size - 10 Data Nodes of type r3.8xLarge
Found the number of vCores used in AWS EMR is always equal to the number of files, may be because gzip files are not splittable. Gzipped files are coming from different system and size of files are around 8 GB.
Total Time taken is more than 2 hours for Parquet conversion for 6 files with total size 29.8GB.
Is there a way to improve the performance via Spark, using version 2.0.2?
Code Snippet:
val srcDF = spark.sql(stgQuery)
srcDF.write.partitionBy("data_date","batch_number").options(Map("compression"->"snappy","spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version"->"2","spark.speculation"->"false")).mode(SaveMode.Overwrite).parquet(finalPath)
It doesn't matter how many nodes you ask for, or how many cores there are, if you have 6 files, six threads will be assigned to work on them. Try to do one of
save in a splittable format (snappy)
get the source to save their data is many smaller files
do some incremental conversion into a new format as you go along (e.g a single spark-streaming core polling for new gzip files, then saving elsewhere into snappy files. Maybe try with AWS-Lambda as the trigger for this, to save dedicating a single VM to the task.