Loading a spark DF from S3, multiple files. Which of these approaches is best? - amazon-s3

I have a s3 bucket with partitioned data underlying Athena. Using Athena I see there are 104 billion rows in my table. This about 2 years of data.
Let's call it big_table.
Partitioning is by day, by hour so 07-12-2018-00,01,02 ... 24 for each day. Athena field is partition_datetime.
In my use case I need the data from 1 month only, which is about 400 million rows.
So the question has arisen - load directly from:
1. files
spark.load(['s3://my_bucket/my_schema/my_table_directory/07-01-2018-00/file.snappy.parquet',\
's3://my_bucket/my_schema/my_table_directory/07-01-2018-01/file.snappy.parquet' ],\
.
.
.
's3://my_bucket/my_schema/my_table_directory/07-31-2018-23/file.snappy.parquet'])
or 2. via pyspark using SQL
df = spark.read.parquet('s3://my_bucket/my_schema/my_table_directory')
df = df.registerTempTable('tmp')
df = spark.sql("select * from my_schema.my_table_directory where partition_datetime >= '07-01-2018-00' and partition_datetime < '08-01-2018-00'")
I think #1 is more efficient because we are only bringing in the data for the period in question.
2 seems inefficient to me because the entire 104 billion rows (or more accurately partition_datetime fields) have to be traversed to satisfy the SELECT. I'm counseled that this really isn't an issue because of lazy execution and there is never a df with all 104 billion rows. I still say at some point each partition must be visited by the SELECT, therefore option 1 is more efficient.
I am interested in other opinions on this. Please chime in

What you are saying might be true, but it is not efficient as it will never scale. If you want data for three months, you cannot specify 90 lines of code in your load command. It is just not a good idea when it comes to big data. You can always perform operations on a dataset that big by using a spark standalone or a YARN cluster.

You could use wildcards in your path to load only files in a given range.
spark.read.parquet('s3://my_bucket/my_schema/my_table_directory/07-{01,02,03}-2018-*/')
or
spark.read.parquet('s3://my_bucket/my_schema/my_table_directory/07-*-2018-*/')

Thom, you are right. #1 is more efficient and the way to do it. However, you can create a collection of list of files to read and then ask spark to read those files only.
This blog might be helpful for your situation.

Related

How do I use awswrangler to read only the first few N rows of a parquet file stored in S3?

I am trying to use awswrangler to read into a pandas dataframe an arbitrarily-large parquet file stored in S3, but limiting my query to the first N rows due to the file's size (and my poor bandwidth).
I cannot see how to do it, or whether it is even possible without relocating.
Could I use chunked=INTEGER and abort after reading the first chunk, say, and if so how?
I have come across this incomplete solution (last N rows ;) ) using pyarrow - Read last N rows of S3 parquet table - but a time-based filter would not be ideal for me and the accepted solution doesn't even get to the end of the story (helpful as it is).
Or is there another way without first downloading the file (which I could probably have done by now)?
Thanks!
You can do that with awswrangler using S3 Select. For example:
import awswrangler as wr
df = wr.s3.select_query(
sql="SELECT * FROM s3object s limit 5",
path="s3://amazon-reviews-pds/parquet/product_category=Gift_Card/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
input_serialization="Parquet",
input_serialization_params={},
use_threads=True,
)
would return 5 rows only from the S3 object.
This is not possible with other read methods because the entire object must be pulled locally before reading it. With S3 select, the filtering is done on the server side instead

write large pyspark dataframe to s3 very slow

This question is relevant to my previous question at aggregate multiple columns in sql table as json or array
I post some updates/follow-up questions here because I got a new problem.
I would like to query a table on presto database from pyspark hive and create a pyspark dataframe based on it. I have to save the dataframe to s3 faster and then read it as parquet (or any other formats as long as it can be read/written fast) from s3 efficiently.
In order to keep the size as small as possible, I have aggregated some columns into a json object.
The original table (> 10^9 rows, some columns (e.g. obj_desc) may have more than 30 English words):
id. cat_name. cat_desc. obj_name. obj_desc. obj_num
1. furniture living office desk 4 corners 1.5.
1 furniture. living office. chair. 4 legs. 0.8
1. furniture. restroom. tub. white wide. 2.7
1. cloth. fashion. T-shirt. black large. 1.1
I have aggregated some columns to json object.
aggregation_cols = ['cat_name','cat_desc','obj_name','obj_description', 'obj_num'] # they are all string
df_temp = df.withColumn("cat_obj_metadata", F.to_json(F.struct([x for x in aggregation_cols]))).drop(*agg_cols)
df_temp_agg = df_temp.groupBy('id').agg(F.collect_list('cat_obj_metadata').alias('cat_obj_metadata'))
df_temp_agg.cache()
df_temp_agg.printSchema()
# df_temp_agg.count() # this cost a very long time but still cannot return result so I am not sure how large it is.
df_temp_agg.repartition(1024) # not sure what optimal one should be?
df_temp_agg.write.parquet(s3_path, mode='overwrite') # this cost a long time (> 12 hours) but no return.
I work on a m4.4xlarge with 4 nodes and all cores look not busy.
I also checked the s3 bucket, no folder created at "s3_path".
For other small dataframe, I can see the "s3_path" can be created when "write.parquet()" is run. But, for this large dataframe, nothing fodlers or files are created on "s3_path".
Because the
df_meta_agg.write.parquet()
never returns, I am. not sure what errors could happen here on spark cluster or on s3.
Anybody could help me about this ? thanks

Output Dataframe to CSV File using Repartition and Coalesce

Currently, I am working on a single node Hadoop and I wrote a job to output a sorted dataframe with only one partition to one single csv file. And I discovered several outcomes when using repartition differently.
At first, I used orderBy to sort the data and then used repartition to output a CSV file, but the output was sorted in chunks instead of in an overall manner.
Then, I tried to discard repartition function, but the output was only a part of the records. I realized without using repartition spark will output 200 CSV files instead of 1, even though I am working on a one partition dataframe.
Thus, what I did next were placing repartition(1), repartition(1, "column of partition"), repartition(20) function before orderBy. Yet output remained the same with 200 CSV files.
So I used the coalesce(1) function before orderBy, and the problem was fixed.
I do not understand why working on a single partitioned dataframe has to use repartition and coalesce, and how the aforesaid processes affect the output. Grateful if someone can elaborate a little.
Spark has relevant parameters here:
spark.sql.shuffle.partitions and spark.default.parallelism.
When you perform operations like sort in your case, it triggers something called a shuffle operation
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
That will split your dataframe to spark.sql.shuffle.partitions partitions.
I also struggled with the same problem as you do and did not find any elegant solution.
Spark generally doesn’t have a great concept of ordered data, because all your data is split accross multiple partitions. And every time you call an operation that requires a shuffle your ordering will be changed.
For this reason, you’re better off only sorting your data in spark for the operations that really need it.
Forcing your data into a single file will break when the dataset gets larger
As Miroslav points out your data gets shuffled between partitions every time you trigger what’s called a shuffle stage (this is things like grouping or join or window operations)
You can set the number of shuffle partitions in the spark Config - the default is 200
Calling repartition before a group by operation is kind of pointless because spark needs to reparation your data again to execute the groupby
Coalesce operations sometimes get pushed into the shuffle stage by spark. So maybe that’s why it worked. Either that or because you called it after the groupby operation
A good way to understand what’s going on with your query is to start using the spark UI - it’s normally available at http://localhost:4040
More info here https://spark.apache.org/docs/3.0.0-preview/web-ui.html

Pyspark - identical dataframe filter operation gives different output

I’m facing a particularly bizarre issue while firing filter queries on a spark dataframe. Here's a screenshot of the filter command I'm trying to run:
As you can see, I'm trying to run the same command multiple times. Each time, it's giving a different number of rows. It is actually meant to return 6 records, but it ends up showing a random number of records every time.
FYI, The underlying data source (from which I'm creating the dataframe) is an Avro file in a Hadoop data lake.
This query only gives me consistent results if I cache the dataframe. But this is not always possible for me because the dataframe might be very huge and hence would choke up memory resources if I cache it.
What might be the possible reasons for this random behavior? Any advice on how to fix it?
Many thanks :)

Pandas join is slow

edit at Oct 16 2017: I think I found the problem, it seems to be a bug in pandas core. It can't merge/join anything over 145k rows. 144k rows it can do without an issue. Pandas version 0.20.3, running on Fedora 26.
----Original post----
I have a medium size amount of data to process (about 200k rows with about 40 columns). I've optimised a lot of the code, but the only trouble I have now is joining the columns.
I receive the data in an unfortunate structure and need to extract the data in a certain way, then put it all into a dataframe.
Basically I extract 2 arrays at a time (each 200k rows long). One array is the timestamp, the other array is the values.
Here I create a dataframe, and use the timestamp as the index.
When I extract the second block of data, I do the same and create a new dataframe using the new values + timestamp.
I need to join the two dataframes on the index. The timestamps can be slightly different, so I use a join method using the 'outer' method, to keep the new timestamps. Basically I follow the documentation below.
result = left.join(right, how='outer')
https://pandas.pydata.org/pandas-docs/stable/merging.html#joining-on-index
This however is way to slow. I left it for about 15 mins and it still hadn't finished processing, so I killed the process.
Can anyone help? Any hints/tips?
edit:
It's a work thing, so I can't give out the data sorry. But it's just two long dataframes, each with a timestamp as the index, and a single column for the values.
The code is just as described above.
data_df.join(variable_df, how='outer')
I forgot to answer this. It's not really a bug in pandas.
The timestamp was a nanosecond timestamp, and joining them on the index like this was causing a massive slow down. Basically it was better to join on a column - made it all much faster.