Adding timestamp to redshift copy from hdfs - hive

I am new to Redshift , I am using a connector written in spark which does the job of exporting data from hive table by select * from query to redshift table .
Now , as by the data in redshift is always appended , I was wondering if it is possible to add time stamp to the data appended to redshift without modifying the original hive table data , so that a simple sort on time stamp would generate the latest output.

Related

Copy parquet file content into an SQL temp table and include partition key as column

I have multiple parquet files in S3 that are partitioned by date, like so:
s3://mybucket/myfolder/date=2022-01-01/file.parquet
s3://mybucket/myfolder/date=2022-01-02/file.parquet
and so on.
All of the files follow the same schema, except some which is why I am using the FILLRECORD (to fill the files with NULL values in case a column is not present). Now I want to load the content of all these files into an SQL temp table in redshift, like so:
DROP TABLE IF EXISTS table;
CREATE TEMP TABLE table
(
var1 bigint,
var2 bigint,
date timestamp
);
COPY table
FROM 's3://mybucket/myfolder/'
access_key_id 'id'secret_access_key 'key'
PARQUET FILLRECORD;
The problem is that the date column is not a column in the parquet files which is why the date column in the resulting table is NULL. I am trying to find a way to use the date to be inserted into the temp table.
Is there any way to do this?
I believe there are only 2 approaches to this:
Perform N COPY commands, one per S3 partition value, and populate the date column with the same information as the partition key value as a literal. A simple script can issue the SQL to Redshift. The issue with this is that you are issuing many COPY commands and if each partition in S3 has only 1 parquet file (or a few files) this will not take advantage of Redshift's parallelism.
Define the region of S3 with the partitioned parquet files as a Redshift partitioned external table and then INSERT INTO (SELECT * from );. The external table knows about the partition key and can insert this information into the local table. The down side is that you need to define the external schema and table and if this is a one time process, you will want to then tear these down after.
There are some other ways to attack this but none that are worth the effort or will be very slow.

Scheduled query to append data to partitioned BigQuery table - incompatible table partitioning specification

I am trying to append data to a table partitioned by month using the BQ Console.
The SQL used to create the table and partition was:
CREATE TABLE xxxxxx
PARTITION BY DATE_TRUNC(Event_Dt, MONTH)
I used Event_Dt as the partitioned field in BQ Console:
The scheduled query does not run and I get the following error message:
"Incompatible table partitioning specification. Destination table exists with partitioning specification interval(type:MONTH,field:Event_Dt), but transfer target partitioning specification is interval(type:DAY,field:Event_Dt). Please retry after updating either the destination table or the transfer partitioning specification."
How do I enter Event_Dt in the BQ Console to indicate that it is partitioned by month and not day?
I solved my problem. All I needed to do was remove Event_Dt from the Destination table partitioning field in the BQ Console. The partitioned table updated successfully when I left the field blank.

How can I migrate an ingestion time partitioned table between regions in BigQuery?

I want to migrate my ingestion time partitioned table from one region to another (using custom Python scripts), but when I extract and load them, they all fall into today's partition, as this is when they are ingested into the table.
How can I make sure the new table contains the same ingestion time partition structure as the original?
You can extract specific partitions within a table using the Python client library. So, instead of extracting the entire table, you can specify which partition you want with the decorator syntax (project.dataset.table$YYYYMMDD for daily or project.dataset.table$YYYYMMDDHH for hourly ingestion partitioned tables) and then simply load them per partition using the same decorator format.
Here is some code;
table_partition = client.get_table("my-project.my_dataset.mytable$20200727")
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
job_config.destination_format = (bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON)
extract_job = client.extract_table(table_partition, "gs://my-bucket", location="US", job_config=job_config)
job_result = extract_job.result()
You can find a list of all your ingestion time partitions using this query:
SELECT DISTINCT _PARTITIONTIME AS pt
FROM `my-project.my_dataset.mytable`

Multiple Parquet files while writing to Hive Table(Incremental)

Having a Hive table that's partitioned
CREATE EXTERNAL TABLE IF NOT EXISTS CUSTOMER_PART (
NAME string ,
AGE int ,
YEAR INT)
PARTITIONED BY (CUSTOMER_ID decimal(15,0))
STORED AS PARQUET LOCATION 'HDFS LOCATION'
The first LOAD is done from ORACLE to HIVE via PYSPARK using
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID) SELECT NAME, AGE, YEAR, CUSTOMER_ID FROM CUSTOMER;
Which works fine and creates partition dynamically during the run. Now coming to data loading incrementally everyday creates individual files for a single record under the partition.
INSERT INTO TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3; --Assume this gives me the latest record in the database
Is there a possibility to have the value appended to the existing parquet file under the partition until it reaches it block size, without having smaller files created for each insert.
Rewriting the whole partition is one option but I would prefer not to do this
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3;
The following properties are set for the Hive
set hive.execution.engine=tez; -- TEZ execution engine
set hive.merge.tezfiles=true; -- Notifying that merge step is required
set hive.merge.smallfiles.avgsize=128000000; --128MB
set hive.merge.size.per.task=128000000; -- 128MB
Which still doesn't help with daily inserts. Any alternate approach that can be followed would be really helpful.
As Per my knowledge we cant store the single file for daily partition data since data will be stored by different part files for each day partition.
Since you mention that you are importing the data from Oracle DB so you can import the entire data each time from oracle DB and overwrite into HDFS. By this way you can maintain the single part file.
Also HDFS is not recommended for small amount data.
I could think of the following approaches for this case:
Approach1:
Recreating the Hive Table, i.e after loading incremental data into CUSTOMER_PART table.
Create a temp_CUSTOMER_PART table with entire snapshot of CUSTOMER_PART table data.
Run overwrite the final table CUSTOMER_PART selecting from temp_CUSTOMER_PART table
In this case you are going to have final table without small files in it.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created.
Approach2:
Using input_file_name() function by making use of it:
check how many distinct filenames are there in each partition then select only the partitions that have more than 10..etc files in each partition.
Create an temporary table with these partitions and overwrite the final table only the selected partitions.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created because we are going to overwrite the final table.
Approach3:
Hive(not spark) offers overwriting and select same table .i.e
insert overwrite table default.t1 partition(partiton_column)
select * from default.t1; //overwrite and select from same t1 table
If you are following this way then there needs to be hive job triggered once your spark job finishes.
Hive will acquire lock while running overwrite/select the same table so if any job which is writing to table will wait.
In Addition: Orc format will offer concatenate which will merge small ORC files to create a new larger file.
alter table <db_name>.<orc_table_name> [partition_column="val"] concatenate;

How to append data to an existing partition in BigQuery table

We can create a partition on BigQuery table while creating a BigQuery table.
I have some questions on the partition.
How to append data to an existing partition in the BigQuery table.
How to create a new Partition in an existing BigQuery table if there is already partition present in that BiQuery table.
How to do truncate and load data to a partition in the BigQuery table(overwrite data in a partition in the BigQuery table).
How to append data to an existing partition in the BigQuery table.
Either you do this from Web UI or with API or with any client of your choice - the approach is the same - you just set your Destination Table with respective partition decorator, like below as an example
yourProject.yourDataset.youTable$20171010
Please note: to append your data - you need to use Append to table for Write Preference
How to create a new Partition in an existing BigQuery table if there is already partition present in that BiQuery table.
If the partition you set in decorator of destination table does not exist yet - it will be added for you
How to do truncate and load data to a partition in the BigQuery table(overwrite data in a partition in the BigQuery table).
To truncate and load to a specific partition - you should use Overwrite table for Write Preference