How to process data with Data Lake Analytics into multiple files with max size? - azure-data-lake

I am processing a huge amount of small JSON files with Azure Data Lake Analytics and I want to save the result into multiple JSON files (if it is needed) with max size (e.g. 128MB)
It this possible?
I know, that there is an option to write custom outputter, but it writes row by row only, thus I have no info about whole file size. (I guess).
There is FILE.LENGTH() property in U-SQL, which gives me the size of each extracted file. Is it possible to use it to repeatedly call output with different files and pass to it only files that fit my size limit?
Thank you for help

Here is an example of what you can do with FILE.LENGTH.
#yourData =
EXTRACT
// ... columns to extract
, file_size = FILE.LENGTH()
FROM "/mydata/{*}" //input files path
USING Extractors.Csv();
#res =
SELECT *
FROM #yourData
WHERE file_size < 100000; //Your file size

Related

very different file size when export data from bigquery to GCS

I am export data from BQ to GCS with the following query
export_query = f"""
EXPORT DATA
OPTIONS(
uri='{uri}',
format='PARQUET',
overwrite=true,
compression='GZIP')
AS {query}"""
and I am seeing the resulting files are of very different size, as a few of them are 10x larger than the rest. I am wondering why this happened..And how can I make sure the files all have similar size?
BigQuery supports the maximum table size exported to a single file is 1 GB. For exporting data more than 1 GB, a wildcard can be used to export the data into multiple files. When exporting data to multiple files the size of the file varies as mentioned in the documentation.You can check for the possible options for the destinationUris property in this link.
When you export data to multiple files, the size of the files will vary, this is because the number of files will depend on the number of workers that are starting to export a table/query to GCS in parquet format. However, combining the results in one file would require an additional shuffling step to ensure that all of the data ends up on the same partition, which is not something that BigQuery currently does.
If you want to customize the number of files then you need to use dataflow.

How to overcome the 2GB limit for a single column value in Spark

I am ingesting json files where the entire data payload is on a single row, single column.
This column is an array of complex objects that I want to explode so that each object represents a row.
I'm using a Databricks notebook and spark.read.json() to load the file contents to a dataframe.
This results in a dataframe with a single row, and the data payload in a single column.(let's call it obj_array)
The problem I'm having is that the obj_array column is greater than 2GB so Spark cannot handle the explode() function.
Are there any alternatives to splitting the json file into more manageable chunks?
Thanks.
Code example...
#set path to file
jsonFilePath='/mnt/datalake/jsonfiles/filename.json
#read file to dataframe
#entitySchema is a schema struct previously extracted from a sample file
rawdf=spark.read.option("multiline","true").schema(entitySchema).format("json").load(jsonFilePath)
#rawdf contains a single row of file_name,timestamp_created, and obj_array #obj_array is an array field containing the entire data payload (>2GB)
explodeddf=rawdf.selectExpr("file_name","timestamp_created","explode(obj_array) as data")
#this column explosion fails due to obj_array exceeding 2GB
When you hit limits like this you need to re-frame the problem. Spark is choking on 2Gigs in a column and that a pretty reasonable choke point. Why not write your own custom data reader.(Presenstation) That emits records in the way that you deem reasonable? (Likely the best solution to leave the files as is.)
You could probably read all the records in with a simple text read and then "paint" in columns after. You could use SQL tricks to try to expand and fill rows with windows/lag.
You could do file level cleaning/formatting to make the data more manageable for the out of the box tools to work with.

pyspark writing lot of smaller files in output

I'm using pyspark to process some data and write the output to S3. I have created a table in athena which will be used to query this data.
Data is in the form of json strings (one per line) and spark code reads the file, partition it based on certain fields and write to S3.
For a 1.1 GB file, I see that spark is writing 36 files with 5 MB approx per file size. when reading athena documentation I see that optimal file size is ~128 MB . https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
sparkSess = SparkSession.builder\
.appName("testApp")\
.config("spark.debug.maxToStringFields", "1000")\
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")\
.getOrCreate()
sparkCtx = sparkSess.sparkContext
deltaRdd = sparkCtx.textFile(filePath)
df = sparkSess.createDataFrame(deltaRdd, schema)
try:
df.write.partitionBy('field1','field2','field3')\
.json(path, mode='overwrite', compression=compression)
except Exception as e:
print (e)
why spark is writing such smaller files. Is there any way to control file size.
Is there any way to control file size?
There are some control mechanism. However they are not explicit.
The s3 drivers are not part of spark itself. They are part of the hadoop installation which ships with spark emr. The s3 block size can be set within
/etc/hadoop/core-site.xml config file.
However by default it should be around 128 mb.
why spark is writing such smaller files
Spark will adhere to the hadoop block size. However you can use partionBy before writing.
Lets say you use partionBy("date").write.csv("s3://products/").
Spark will create a subfolder with the date for each partition. Within
each partioned folder spark will again try to create chunks and try to adhere to the fs.s3a.block.size.
e.g
s3:/products/date=20191127/00000.csv
s3:/products/date=20191127/00001.csv
s3:/products/date=20200101/00000.csv
In the example above - a particular partition can just be smaller than a blocksize of 128mb.
So just double check your block size in /etc/hadoop/core-site.xml and wether you need to partition the data frame with partitionBy before writing.
Edit:
Similar post also suggests to repartition the dataframe to match the partitionBy scheme
df.repartition('field1','field2','field3')
.write.partitionBy('field1','field2','field3')
writer.partitionBy operates on the existing dataframe partitions. It will not repartition the original dataframe. Hence if the overall dataframe is paritioned differently, there is nested partitioning happening.

Unable to extract data in a single .csv file from Google Big Query (though data is smaller than 1GB)

I am able to export the data in 4 different files of about 90 MB each. (which doesn't make sense)
I have read the limitations of Google Big Query and it says that data with more than 1 GB in size cannot be downloaded in a single CSV file.
My data size is about 250 - 300 MB in size.
This is what usually I do to export data from GBQ:
I saved the table in Google Big Query (as it has more than 16000 rows)
Then exported it in the Bucket using as follows:
gs://[your_bucket]/file-name-*.csv
I think 2M rows of data is less than 1 GB. (Let me know if I am wrong)
Can I get this data in a single csv file ?
Thank you.
You should take out the wildcard from the name of the blob you want to write to. This tells BQ you want to export as multiple files.
So you should rather export to gs://[your_bucket]/file-name.csv
As you noted, this won't work if your data is bigger than 1GB, but you should be fine if total is about 300MB.
You can get node.js readable stream that contains result of your query (https://cloud.google.com/nodejs/docs/reference/bigquery/2.0.x/BigQuery#createQueryStream).
Chunk of data is a row of result set.
And then write data (row by row) to csv (locally or to cloud storage).

how to limit the size of the file that exporting from bigquery to gcs?

I Used the python code for exporting data from bigquery to gcs,and then using gsutil to export to s3!But after exporting to gcs ,I noticed the some files are more tha 5 GB,which gsutil cannnot deal?So I want to know the way for limiting the size
So after the issue tracker, the correct way to take this is.
Single URI ['gs://[YOUR_BUCKET]/file-name.json']
Use a single URI if you want BigQuery to export your data to a single
file. The maximum exported data with this method is 1 GB.
Please note that data size is up to a maximum of 1GB, and the 1GB is not for the file size that is exported.
Single wildcard URI ['gs://[YOUR_BUCKET]/file-name-*.json']
Use a single wildcard URI if you think your exported data set will be
larger than 1 GB. BigQuery shards your data into multiple files based
on the provided pattern. Exported files size may vary, and files won't
be equally in size.
So again you need to use this method when your data size is above 1 GB, and the resulting files size may vary, and may go beyond the 1 GB, as you mentioned 5GB and 160Mb pair would happen on this method.
Multiple wildcard URIs
['gs://my-bucket/file-name-1-*.json',
'gs://my-bucket/file-name-2-*.json',
'gs://my-bucket/file-name-3-*.json']
Use multiple wildcard URIs if you want to partition the export output.
You would use this option if you're running a parallel processing job
with a service like Hadoop on Google Cloud Platform. Determine how
many workers are available to process the job, and create one URI per
worker. BigQuery treats each URI location as a partition, and uses
parallel processing to shard your data into multiple files in each
location.
the same applies here as well, exported file sizes may vary beyond 1 GB.
Try using single wildcard URI
See documentation for Exporting data into one or more files
Use a single wildcard URI if you think your exported data will be
larger than BigQuery's 1 GB per file maximum value. BigQuery shards
your data into multiple files based on the provided pattern. If you
use a wildcard in a URI component other than the file name, be sure
the path component does not exist before exporting your data.
Property definition:
['gs://[YOUR_BUCKET]/file-name-*.json']
Creates:
gs://my-bucket/file-name-000000000000.json
gs://my-bucket/file-name-000000000001.json
gs://my-bucket/file-name-000000000002.json ...
Property definition:
['gs://[YOUR_BUCKET]/path-component-*/file-name.json']
Creates:
gs://my-bucket/path-component-000000000000/file-name.json
gs://my-bucket/path-component-000000000001/file-name.json
gs://my-bucket/path-component-000000000002/file-name.json