Pyspark partition data by a column and write parquet - dataframe

I need to write parquet files in seperate s3 keys by values in a column. The column city has thousands of values. Iteration using for loop, filtering dataframe by each column value and then writing parquet is very slow. Is there any way to partition the dataframe by the column city and write the parquet files?
What I am currently doing -
for city in cities:
print(city)
spark_df.filter(spark_df.city == city).write.mode('overwrite').parquet(f'reporting/date={date_string}/city={city}')

partitionBy function solves the issue
spark_df.write.partitionBy('date', 'city').parquet('reporting')

Related

Copy parquet file content into an SQL temp table and include partition key as column

I have multiple parquet files in S3 that are partitioned by date, like so:
s3://mybucket/myfolder/date=2022-01-01/file.parquet
s3://mybucket/myfolder/date=2022-01-02/file.parquet
and so on.
All of the files follow the same schema, except some which is why I am using the FILLRECORD (to fill the files with NULL values in case a column is not present). Now I want to load the content of all these files into an SQL temp table in redshift, like so:
DROP TABLE IF EXISTS table;
CREATE TEMP TABLE table
(
var1 bigint,
var2 bigint,
date timestamp
);
COPY table
FROM 's3://mybucket/myfolder/'
access_key_id 'id'secret_access_key 'key'
PARQUET FILLRECORD;
The problem is that the date column is not a column in the parquet files which is why the date column in the resulting table is NULL. I am trying to find a way to use the date to be inserted into the temp table.
Is there any way to do this?
I believe there are only 2 approaches to this:
Perform N COPY commands, one per S3 partition value, and populate the date column with the same information as the partition key value as a literal. A simple script can issue the SQL to Redshift. The issue with this is that you are issuing many COPY commands and if each partition in S3 has only 1 parquet file (or a few files) this will not take advantage of Redshift's parallelism.
Define the region of S3 with the partitioned parquet files as a Redshift partitioned external table and then INSERT INTO (SELECT * from );. The external table knows about the partition key and can insert this information into the local table. The down side is that you need to define the external schema and table and if this is a one time process, you will want to then tear these down after.
There are some other ways to attack this but none that are worth the effort or will be very slow.

How can I rename the data of a pandas dataframe column?

To be specific. I have a pandas dataframe column called id, its data is 10,20,30, etc
how can I rename the column's data all the way from 1,2,3,4, etc in ascending order instead of 10,20,30?

Apache Hive Partition & Bucketing Structure

In Apache Hive, how does the directory structure looks after a huge dataset is being partitioned and then bucketed?
For Ex - I have a customer dataset for a country, the data is being partitioned by state and then bucketed by city. How do we get to know how many files will be present in a city bucket?
A partition is a directory, and each partition corresponds to a specific value of the partitioned column.
Within a table or a partition/directory, buckets are organized as files. The number of buckets is predefined when creating a table with CLUSTERED BY (sth) INTO K BUCKETS. There will be ONE file for each individual bucket. Hive assigns records to buckets based on their hash value calculated by the bucketed column, and a mod is taken by the num of buckets K.
Maximum number of bucketing is 256 . For more details kindly refer below link:
[What is the difference between partitioning and bucketing a table in Hive ?

One file per partition (Coalesce per pertition) while inserting data into hive table

I have a table created in hive stored in s3 location.
It has about 10 columns and is partitioned on 3 columns month, year and city , in the same order.
I am running a spark job that creates a dataframe(2 billion rows) and writes into this table.
val partitions:Seq[Column] = Seq(col("month"),col("year"),col("city"))
df.repartition(partitions: _*).selectExpr(cs.map(_.name): _*).write.mode("overwrite").insertInto(s"$tableName")
selectExpr(cs.map(_.name): _*) reorders the columns in the dataframe to align with the ordering in the table.
When i run the above command to insert into the table, I see there are so many staging files and multiple small files created under each city.
s3://s3Root/tableName/month/year/city/file1.csv
file2.csv
...
file200.csv
I am hoping to get a single file under each city per year per month.
To coalesce per partition.
Expected:
s3://s3Root/tableName/month/year/city/file.csv
Any help is appreciated.
to achieve one file by partition you should use
.partitionBy("")
val partitions:Seq[Column] = Seq(col("month"),col("year"),col("city"))
df.repartition(partitions: _*).selectExpr(cs.map(_.name): _*).write.partitionBy(partitions: _*).mode("overwrite").insertInto(s"$tableName")
I think that you could avoid to make the repartition before, if you do only the partitionBy,files will be partitioned by one per partition.

to populate bucketed tables in hive

I have created a hive table with gender as bucket column.
create table userinfoBucketed(userid INT,age INT,gender STRING,occupation STRING,zipcode STRING) CLUSTERED BY(gender) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;
loading following data from text file into table
(user id | age | gender | occupation | zip code) :
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
9|29|M|student|01002
10|53|M|lawyer|90703
I have set the hive.enforce.bucketing property to true;
set hive.enforce.bucketing=true;
1, when inserted data into table using load command, buckets are not created. all the data stored in one bucket
load data local inpath '/home/mainnode/u.user' into table userinfobucketed;
Question1, Why the data is not split into 2 buckets?
2, when inserted data into table from other table, data stored in 2 buckets. here is the command I have executed:
insert into table userinfobucketed select * from userinfo where gender='M';
Now bucket1(000000_0) has below data:
1|24|M|technician|85711
4|24|M|technician|43537
6|42|M|executive|98101
7|57|M|administrator|91344
bucket2(000001_0) has below data:
3|23|M|writer|32067
Question2, I do not understand why data got stored into 2 buckets even though all same records has the same gender.
Then I again inserted data into the table using the below command.
insert into table userinfobucketed select * from userinfo where gender='F';
Now 2 more extra buckets (000000_0_copy_1,000001_0_copy_1) are created and data stored into those instead of inserting data into existing buckets. Now that makes total buckets to 4 even though create table is configured into 2 buckets.
Question3 ; Why the extra buckets got created into of copying into existing buckets
please clarify
Thanks
Sean
Q1: Why doesn't this work to insert into a bucketed table?
load data local inpath '/home/mainnode/u.user' into table userinfobucketed;
A1: Take a look at this tutorial for inserting into bucketed tables. Hive does not support loading to bucketed tables directly from a flat file using LOAD DATA INPATH, so you have to LOAD the data into a regular table first then INSERT OVERWRITE into your bucketed table.
Q2: Why was the inserted data split into 2 buckets even though all records had the same value for the bucket column?
A2: Hmm. This is abnormal behavior. You should never see records with the same bucket column value getting hashed into different buckets. I suspect you did not drop the table and recreate it after trying the LOAD DATA INPATH method above in Q1. If that were the case, new buckets would be created on the insert, disregarding what's in the existing buckets, which leads us to the next question...
Q3: Why were extra buckets created instead of inserting into existing buckets?
A3: Hive does not append new data to files on insert. Even though you told Hive that your table is bucketed, it only hashes the data you are currently inserting; it does not take into account the data already in the table.
To maintain the number of buckets set in the table definition, you will have to hash all the data together every time you do an insert, and use INSERT OVERWRITE instead of INSERT INTO to overwrite the table.
Generally this is much easier to do if your table is partitioned, so you're not copying and re-hashing your whole table every time you have to do an insert. And speaking of partitioning, since it is such low cardinality, gender is much better suited as a partition value than a bucket value. This article does a pretty good job at explaining this concept.
Bucketing is driven by the hash of the column. Apparently M and F are resulting in the same hash. You might consider making the gender part of the partitioning key - to ensure they end up in different physical files.