How do I use awswrangler to read only the first few N rows of a parquet file stored in S3? - pandas

I am trying to use awswrangler to read into a pandas dataframe an arbitrarily-large parquet file stored in S3, but limiting my query to the first N rows due to the file's size (and my poor bandwidth).
I cannot see how to do it, or whether it is even possible without relocating.
Could I use chunked=INTEGER and abort after reading the first chunk, say, and if so how?
I have come across this incomplete solution (last N rows ;) ) using pyarrow - Read last N rows of S3 parquet table - but a time-based filter would not be ideal for me and the accepted solution doesn't even get to the end of the story (helpful as it is).
Or is there another way without first downloading the file (which I could probably have done by now)?
Thanks!

You can do that with awswrangler using S3 Select. For example:
import awswrangler as wr
df = wr.s3.select_query(
sql="SELECT * FROM s3object s limit 5",
path="s3://amazon-reviews-pds/parquet/product_category=Gift_Card/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
input_serialization="Parquet",
input_serialization_params={},
use_threads=True,
)
would return 5 rows only from the S3 object.
This is not possible with other read methods because the entire object must be pulled locally before reading it. With S3 select, the filtering is done on the server side instead

Related

How to overcome the 2GB limit for a single column value in Spark

I am ingesting json files where the entire data payload is on a single row, single column.
This column is an array of complex objects that I want to explode so that each object represents a row.
I'm using a Databricks notebook and spark.read.json() to load the file contents to a dataframe.
This results in a dataframe with a single row, and the data payload in a single column.(let's call it obj_array)
The problem I'm having is that the obj_array column is greater than 2GB so Spark cannot handle the explode() function.
Are there any alternatives to splitting the json file into more manageable chunks?
Thanks.
Code example...
#set path to file
jsonFilePath='/mnt/datalake/jsonfiles/filename.json
#read file to dataframe
#entitySchema is a schema struct previously extracted from a sample file
rawdf=spark.read.option("multiline","true").schema(entitySchema).format("json").load(jsonFilePath)
#rawdf contains a single row of file_name,timestamp_created, and obj_array #obj_array is an array field containing the entire data payload (>2GB)
explodeddf=rawdf.selectExpr("file_name","timestamp_created","explode(obj_array) as data")
#this column explosion fails due to obj_array exceeding 2GB
When you hit limits like this you need to re-frame the problem. Spark is choking on 2Gigs in a column and that a pretty reasonable choke point. Why not write your own custom data reader.(Presenstation) That emits records in the way that you deem reasonable? (Likely the best solution to leave the files as is.)
You could probably read all the records in with a simple text read and then "paint" in columns after. You could use SQL tricks to try to expand and fill rows with windows/lag.
You could do file level cleaning/formatting to make the data more manageable for the out of the box tools to work with.

Saving single parquet file on spark at only one partition while using multi partitions

I'm trying to save my dataframe parquet file as a single file instead of multi parts files (?? I don't know the exact term for this, pardon me. I'm stupid in English)
here is how I made my df
val df = spark.read.format("jdbc").option("url","jdbc:postgresql://ipaddr:port/address").option("header","true").load()
and I have 2 ip addresses where I run different master / work servers.
for ex,
ip : 10.10.10.1 runs 1 master server
ip : 10.10.10.2 runs 2 work servers (and these work servers work for ip1)
and I'm trying to save my df as parquet file only in master server (ip1)
normally I would save the file with (I use spark on master server using scala)
df.write.parquet("originals.parquet")
then the df gets saved as parquet in both servers (obviously since it's spark)
but from now I'm trying to save the df as one single file, while keeping the multi process for better speed yet saving the file only on one side of the server
so I've tried using
df.coalesce(1).write.parquet("test.parquet")
and
df.coalesce(1).write.format("parquet").mode("append").save("testestest.parquet")
but the result for both are same as the original write.parquet resulting the df being saved in both servers.
I guess it's because of my lack of understanding in using spark and also the function of df, coalesce..
I was told that using coalesce or partition will help me saving the file only on one server, but I want to know why is it still being saved in both servers. Is it how it's supposed to be and is it me who understood wrongly while studying the use of coalesce?? Or is it my way of writing the scala query that made coalesce not working effectively..?
I've also found that using pandas are okay to save the df into one file but my df is very very large and also I want a fast result so I don't think I'm supposed to use pandas for big files like my df (correct me if i'm worng, please.)
Also I don't quite understand how people explain 'partition' and 'coalesce' are different cause 'coalesce' is minizing the movement of the files, can somebody explain it to me in easier way please?
To resume: Why is my use of coalesce / partition to save a paquet file into only one partition not working? (Is saving the file on one partition not possible at all? I just realized, maybe using coalesce/partition is just to save 1 parquet file in EACH partition AND NOT at ONE partition as I want)

pyspark writing lot of smaller files in output

I'm using pyspark to process some data and write the output to S3. I have created a table in athena which will be used to query this data.
Data is in the form of json strings (one per line) and spark code reads the file, partition it based on certain fields and write to S3.
For a 1.1 GB file, I see that spark is writing 36 files with 5 MB approx per file size. when reading athena documentation I see that optimal file size is ~128 MB . https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
sparkSess = SparkSession.builder\
.appName("testApp")\
.config("spark.debug.maxToStringFields", "1000")\
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")\
.getOrCreate()
sparkCtx = sparkSess.sparkContext
deltaRdd = sparkCtx.textFile(filePath)
df = sparkSess.createDataFrame(deltaRdd, schema)
try:
df.write.partitionBy('field1','field2','field3')\
.json(path, mode='overwrite', compression=compression)
except Exception as e:
print (e)
why spark is writing such smaller files. Is there any way to control file size.
Is there any way to control file size?
There are some control mechanism. However they are not explicit.
The s3 drivers are not part of spark itself. They are part of the hadoop installation which ships with spark emr. The s3 block size can be set within
/etc/hadoop/core-site.xml config file.
However by default it should be around 128 mb.
why spark is writing such smaller files
Spark will adhere to the hadoop block size. However you can use partionBy before writing.
Lets say you use partionBy("date").write.csv("s3://products/").
Spark will create a subfolder with the date for each partition. Within
each partioned folder spark will again try to create chunks and try to adhere to the fs.s3a.block.size.
e.g
s3:/products/date=20191127/00000.csv
s3:/products/date=20191127/00001.csv
s3:/products/date=20200101/00000.csv
In the example above - a particular partition can just be smaller than a blocksize of 128mb.
So just double check your block size in /etc/hadoop/core-site.xml and wether you need to partition the data frame with partitionBy before writing.
Edit:
Similar post also suggests to repartition the dataframe to match the partitionBy scheme
df.repartition('field1','field2','field3')
.write.partitionBy('field1','field2','field3')
writer.partitionBy operates on the existing dataframe partitions. It will not repartition the original dataframe. Hence if the overall dataframe is paritioned differently, there is nested partitioning happening.

How to iterate through BigQuery query results and write it to file

I need to query the Google BigQuery table and export the results to gzipped file.
This is my current code. The requirement is that the each row data should be new line (\n) delemited.
def batch_job_handler(args):
credentials = Credentials.from_service_account_info(service_account_info)
client = Client(project=service_account_info.get("project_id"),
credentials=credentials)
query_job = client.query(QUERY_STRING)
results = query_job.result() # Result's total_rows is 1300000 records
with gzip.open("data/query_result.json.gz", "wb") as file:
data = ""
for res in results:
data += json.dumps(dict(list(res.items()))) + "\n"
break
file.write(bytes(data, encoding="utf-8"))
The above solution works perfectly fine for small number of result but, gets too slow if result has 1300000 records.
Is it because of this line: json.dumps(dict(list(res.items()))) + "\n" as I am constructing a huge string by concatenating each records by new line.
As I am running this program in AWS batch, it is consuming too much time. I Need help on iterating over the result and writing to a file for millions of records in a faster way.
Check out the new BigQuery Storage API for quick reads:
https://cloud.google.com/bigquery/docs/reference/storage
For an example of the API at work, see this project:
https://github.com/GoogleCloudPlatform/spark-bigquery-connector
It has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:
Direct Streaming
It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using an Avro wire format.
Filtering
The new API allows column and limited predicate filtering to only read the data you are interested in.
Column Filtering
Since BigQuery is backed by a columnar datastore, it can efficiently stream data without reading all columns.
Predicate Filtering
The Storage API supports limited pushdown of predicate filters. It supports a single comparison to a literal
You should (in most cases) point your output from BigQuery query to a temp table and export that temp table to a Google Cloud Storage Bucket. From that bucket, you can download stuff locally. This is the fastest route to have results available locally. All else will be painfully slow, especially iterating over the results as BQ is not designed for that.

Loading a spark DF from S3, multiple files. Which of these approaches is best?

I have a s3 bucket with partitioned data underlying Athena. Using Athena I see there are 104 billion rows in my table. This about 2 years of data.
Let's call it big_table.
Partitioning is by day, by hour so 07-12-2018-00,01,02 ... 24 for each day. Athena field is partition_datetime.
In my use case I need the data from 1 month only, which is about 400 million rows.
So the question has arisen - load directly from:
1. files
spark.load(['s3://my_bucket/my_schema/my_table_directory/07-01-2018-00/file.snappy.parquet',\
's3://my_bucket/my_schema/my_table_directory/07-01-2018-01/file.snappy.parquet' ],\
.
.
.
's3://my_bucket/my_schema/my_table_directory/07-31-2018-23/file.snappy.parquet'])
or 2. via pyspark using SQL
df = spark.read.parquet('s3://my_bucket/my_schema/my_table_directory')
df = df.registerTempTable('tmp')
df = spark.sql("select * from my_schema.my_table_directory where partition_datetime >= '07-01-2018-00' and partition_datetime < '08-01-2018-00'")
I think #1 is more efficient because we are only bringing in the data for the period in question.
2 seems inefficient to me because the entire 104 billion rows (or more accurately partition_datetime fields) have to be traversed to satisfy the SELECT. I'm counseled that this really isn't an issue because of lazy execution and there is never a df with all 104 billion rows. I still say at some point each partition must be visited by the SELECT, therefore option 1 is more efficient.
I am interested in other opinions on this. Please chime in
What you are saying might be true, but it is not efficient as it will never scale. If you want data for three months, you cannot specify 90 lines of code in your load command. It is just not a good idea when it comes to big data. You can always perform operations on a dataset that big by using a spark standalone or a YARN cluster.
You could use wildcards in your path to load only files in a given range.
spark.read.parquet('s3://my_bucket/my_schema/my_table_directory/07-{01,02,03}-2018-*/')
or
spark.read.parquet('s3://my_bucket/my_schema/my_table_directory/07-*-2018-*/')
Thom, you are right. #1 is more efficient and the way to do it. However, you can create a collection of list of files to read and then ask spark to read those files only.
This blog might be helpful for your situation.