I have a very simple Hive table with the below structure.
CREATE EXTERNAL TABLE table1(
col1 STRING,
col2 STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://path/';
The directory this table is being pointed to has just ONE file of size 51 KB.
From the pyspark shell (with all default values):
df = sparksession.sql("SELECT * from table1")
df.rdd.getNumPartitions()
The number of partitions being returned is weird. Sometimes it returned 64 and sometimes 81.
My expectation was to see 1 or 2 partitions utmost. Any thoughts on why I see that many partitions?
Thanks.
As you stated that number of partitions returned sometimes it returned 64 and sometimes 81 because its up to the spark that in how many partitions it want to store the data even if you use the repartition command then also its a request to the spark to shuffle the data into given re partitions if spark thinks its not possible then it will take the decision by itself and store the data in random number of partitions.
Hope this explanation solves your query.
Related
In my spark job, I'm reading a huge table (parquet) with more than 30 columns. To limit the size of data read I specify schema with one column only (I need only this one). Unfortunately, when reading the info in spark UI I get the information that the size of files read equals 1123.8 GiB but filesystem read data size total equals 417.0 GiB. I was expecting that if I take one from 30 columns the filesystem read data size total will be around 1/30 of the initial size, not almost half.
Could you explain to me why is that happening?
This question is relevant to my previous question at aggregate multiple columns in sql table as json or array
I post some updates/follow-up questions here because I got a new problem.
I would like to query a table on presto database from pyspark hive and create a pyspark dataframe based on it. I have to save the dataframe to s3 faster and then read it as parquet (or any other formats as long as it can be read/written fast) from s3 efficiently.
In order to keep the size as small as possible, I have aggregated some columns into a json object.
The original table (> 10^9 rows, some columns (e.g. obj_desc) may have more than 30 English words):
id. cat_name. cat_desc. obj_name. obj_desc. obj_num
1. furniture living office desk 4 corners 1.5.
1 furniture. living office. chair. 4 legs. 0.8
1. furniture. restroom. tub. white wide. 2.7
1. cloth. fashion. T-shirt. black large. 1.1
I have aggregated some columns to json object.
aggregation_cols = ['cat_name','cat_desc','obj_name','obj_description', 'obj_num'] # they are all string
df_temp = df.withColumn("cat_obj_metadata", F.to_json(F.struct([x for x in aggregation_cols]))).drop(*agg_cols)
df_temp_agg = df_temp.groupBy('id').agg(F.collect_list('cat_obj_metadata').alias('cat_obj_metadata'))
df_temp_agg.cache()
df_temp_agg.printSchema()
# df_temp_agg.count() # this cost a very long time but still cannot return result so I am not sure how large it is.
df_temp_agg.repartition(1024) # not sure what optimal one should be?
df_temp_agg.write.parquet(s3_path, mode='overwrite') # this cost a long time (> 12 hours) but no return.
I work on a m4.4xlarge with 4 nodes and all cores look not busy.
I also checked the s3 bucket, no folder created at "s3_path".
For other small dataframe, I can see the "s3_path" can be created when "write.parquet()" is run. But, for this large dataframe, nothing fodlers or files are created on "s3_path".
Because the
df_meta_agg.write.parquet()
never returns, I am. not sure what errors could happen here on spark cluster or on s3.
Anybody could help me about this ? thanks
I have a s3 bucket with partitioned data underlying Athena. Using Athena I see there are 104 billion rows in my table. This about 2 years of data.
Let's call it big_table.
Partitioning is by day, by hour so 07-12-2018-00,01,02 ... 24 for each day. Athena field is partition_datetime.
In my use case I need the data from 1 month only, which is about 400 million rows.
So the question has arisen - load directly from:
1. files
spark.load(['s3://my_bucket/my_schema/my_table_directory/07-01-2018-00/file.snappy.parquet',\
's3://my_bucket/my_schema/my_table_directory/07-01-2018-01/file.snappy.parquet' ],\
.
.
.
's3://my_bucket/my_schema/my_table_directory/07-31-2018-23/file.snappy.parquet'])
or 2. via pyspark using SQL
df = spark.read.parquet('s3://my_bucket/my_schema/my_table_directory')
df = df.registerTempTable('tmp')
df = spark.sql("select * from my_schema.my_table_directory where partition_datetime >= '07-01-2018-00' and partition_datetime < '08-01-2018-00'")
I think #1 is more efficient because we are only bringing in the data for the period in question.
2 seems inefficient to me because the entire 104 billion rows (or more accurately partition_datetime fields) have to be traversed to satisfy the SELECT. I'm counseled that this really isn't an issue because of lazy execution and there is never a df with all 104 billion rows. I still say at some point each partition must be visited by the SELECT, therefore option 1 is more efficient.
I am interested in other opinions on this. Please chime in
What you are saying might be true, but it is not efficient as it will never scale. If you want data for three months, you cannot specify 90 lines of code in your load command. It is just not a good idea when it comes to big data. You can always perform operations on a dataset that big by using a spark standalone or a YARN cluster.
You could use wildcards in your path to load only files in a given range.
spark.read.parquet('s3://my_bucket/my_schema/my_table_directory/07-{01,02,03}-2018-*/')
or
spark.read.parquet('s3://my_bucket/my_schema/my_table_directory/07-*-2018-*/')
Thom, you are right. #1 is more efficient and the way to do it. However, you can create a collection of list of files to read and then ask spark to read those files only.
This blog might be helpful for your situation.
Starting to work with pyspark and run into a bottleneck I have created with my code:
I'm "grouping by" pyspark 2.2.0 dataframe into partitions by drive_id
and writing each partition (group) into its own location on S3.
I need it to define Athena table on S3 location partitioned by drive_id - this allows me to read data very efficiently if queried by drive_id.
#df is spark dataframe
g=df.groupBy(df.drive_id)
rows=sorted(g.count().collect())
#each row is a parition
for row in rows:
w=df.where((col("drive_id") == row.drive_id))
w.write.mode('append').parquet("s3n://s3bucket/parquet/drives/"+str(table)+"/drive_id="+str(row.drive_id) )
The problem is that the loop makes processing serial and writes drive partitions only one by one.
Obviously this doesn't scale well because single partition write task is quite small and parallelizing it doesn't give much.
How do I replace the loop with single write command that will write all partitions into different locations ins a single operation?
This operation should parallelize to run on spark workers, not driver.
I figured out the answer - surprisingly simple.
dataframe.write.parquet has optional parameter partitionBy(names_of_partitioning_columns).
So no need in the "group by" and no need in the loop:
using the single line:
df.write.partitionBy(drive_id).parquet("s3n://s3bucket/dir")
creates partitions in standard hive format "s3n://s3bucket/dir/drive_id=123"
When I unload a table from amazon redshift to S3, it always splits the table into two parts no matter how small the table. I have read the redshift documentation regarding unloading, but no answers other than it says sometimes it splits the table (I've never seen it not do that). I have two questions:
Has anybody every seen a case where only one file is created?
Is there a way to force redshift to unload into a single file?
Amazon recently added support for unloading to a single file by using PARALLEL OFF in the UNLOAD statement. Note that you still can end up with more than one file if it is bigger than 6.2GB.
By default, each slice creates one file (explanation below). There is a known workaround - adding a LIMIT to the outermost query will force the leader node to process whole response - thus it will create only one file.
SELECT * FROM (YOUR_QUERY) LIMIT 2147483647;
This only works as long as your inner query returns fewer than 2^31 - 1 records, as a LIMIT clause takes an unsigned integer argument.
How files are created? http://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html
Amazon Redshift splits the results of a select statement across a set of files, one or more files per node slice, to simplify parallel reloading of the data.
So now we know that at least one file per slice is created. But what is a slice? http://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html
The number of slices is equal to the number of processor cores on the node. For example, each XL compute node has two slices, and each 8XL compute node has 16 slices.
It seems that the minimal number of slices is 2, and it will grow larger when more nodes or more powerful nodes is added.
As of May 6, 2014 UNLOAD queries support a new PARALLEL options. Passing PARALLEL OFF will output a single file if your data is less than 6.2 gigs (data is split into 6.2 GB chunks).