How to optimize spark to write data to s3 by joining 2 tables of 20TB each - dataframe

We are joining 2 tables, each 20TB in a spark job.
Running it on r5.4x large EMR cluster.
How can we optimize it, can anyone share few parameters to finetune the job and help it run quicker.
More Details :
Both input tables are timestamp partitioned.
Size of files in each partition is around 128MB each.
Input and output file format is parquet.
Output is written to an external glue table and stored on s3.
Thanks in advance.

Related

AWS athena, can we write query results in distributed manner upon query?

I'm trying to compare performance of SELECT Vs. CTAS.
The reason CTAS is faster for bigger data is b.c. data format and its ability to write query results in distributed manner into multiple parquet files.
All athena queries are written to S3 then read from there (I may be wrong), is there way to distributed writing query result of regular select into single file? So without bucketing nor partioning.

Querying Glue Partitions through Athena while being overwritten?

I have a Glue table on S3 where partitions are populated through Spark save mode overwrite (script executed through Glue job).
What is expected behavior from Athena if we are querying such partitions while they are being overwritten?
If you rewrite files while queries are running you may run into errors like "HIVE_FILESYSTEM_ERROR: Incorrect fileSize 1234567 for file".
The reason is that during query planning all the files are listed on S3, and among other things the file sizes are used to divide up the work between the worker nodes. If a file is splittable, which includes file formats like ORC and Parquet, as well as uncompressed text formats (e.g. JSON, CSV), parts of it (called splits) may be processed by different nodes.
If the file changes between query planning and query execution the plan is no longer valid and the query execution fails.
New partitions are being picked up by Athena as long as you set enableUpdateCatalog = True when writing. If you just overwrite the content of existing partitions, Athena will be able to query the data, as long as you don't have a schema mismatch.

Moving bigquery data to Redshift

I'm in need to move my bigquery table to redshift.
Currently I have a python job that is fetching data from redshift, and it is incremental loading my data on the redshift.
This python job is reading bigquery data, creating a csv file in the server, drops the same on s3 and the readshift table reads the data from the file on s3. But now the time size would be very big so the server won't be able to handle it.
Do you guys happen to know anything better than this ?
The new 7 tables on bigquery I would need to move, is around 1 TB each, with repeated column set. (I am doing an unnest join to flattening it)
You could actually move the data from Big Query to a Cloud Storage Bucket by following the instructions here. After that, you can easily move the data from the Cloud Storage bucket to the Amazon s3 bucket by running:
gsutil rsync -d -r gs://your-gs-bucket s3://your-s3-bucket
The documentation for this can be found here

Hive Table deletion and query processing

as per my understanding on Hive concepts, if we load the dataset into hive table, the data file will move from source path to hive warehouse within HDFS, and HDFS was set to three replicas for the data.
these questions might look silly but as i am beginner, i want clear my doubts.
my questions are:
1) if i delete the hive table, will it delete data file from hive warehouse only or along other two replicas from HDFS also?
2)if we are processing query on hive table, will that query be done as distributed processing?
per say, one data file is of size 1GB (interns 8 blocks x 128MB), and as we have three replication factor, there would be total 24 blocks available for this file
will our hive query be distributed among all the data blocks or it would be processed on hive warehouse blocks only?
Thanks in advance..
If you do "load data inpath" from a HDFS path the data will be moved from source to destination HDFS path,
If you do "load data local inpath", it doesn't move data from local to HDFS path, instead it copies
For your question
If you delete file in HDFS all the replicas are deleted.
If you have a 1gb file (8 blocks) with 3 replication factor, when you trigger the query in hive CLI, it converts your query to MR. It process only 8 blocks, in case of the datanode failure of the triggered job, it accesses the 2nd replica on a different node and processes the data (speculative execution)

Using Hive with Pig

My hive query has multiple outer joins and takes very long to execute. I was wondering if it would make sense to break it into multiple smaller queries and use pig to work the transformations.
Is there a way I could query hive tables or read hive table data within a pig script?
Thanks
The goal of the Howl project is to allow Pig and Hive to share a single metadata repository. Once Howl is mature, you'll be able to run PigLatin and HiveQL queries over the
same tables. For now, you can try to work with the data as it is stored in HDFS.
Note that Howl has been renamed to HCatalog.