How to update table in Athena - sql

I am creating a table 'A' from another table 'B' in Athena using create sql query. However, Table 'B' is updated with new rows every hour. I want to know how can I update the table A data without dropping table A and creating it again.
I tried dropping table and creating it again, but that seems to create performance issue as every time a new table is getting created. I want to insert only new rows in table A whichever are added in Table B

Amazon Athena is a query engine, not a database.
When a query runs on a table, Athena uses the location of a table to determine where the data is stored in an Amazon S3 bucket. It then reads all files in that location (including sub-directories) and runs the query on that data.
Therefore, the easiest way to add data to Amazon Athena tables is to create additional files in that location in Amazon S3. The next time Athena runs a query, those files will be included as part of the referenced table. Even running the INSERT INTO command creates new files in that location. ("Each INSERT operation creates a new file, rather than appending to an existing file.")
If you wish to copy data from Table-B to Table-A, and you know a way to identify which rows to add (eg there is a column with a timestamp), you could use something like:
INSERT INTO table_a
SELECT * FROM table_b
WHERE timestamp_field > (SELECT MAX(timestamp_field FROM table_a))

Related

Add new partition-scheme to existing table in athena with SQL code

Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).

How to drop columns from a partitioned table in BigQuery

We can not use create or replace table statement for partitioned tables in BigQuery. I can export the table to GCS but BigQuery generates then multiple JSON files that can not be imported into a table in once. Is there a safe way to drop a column from a partitioned table? I use BigQuery's web interface.
Renaming a column is not supported by the Cloud Console, the classic BigQuery web UI, the bq command-line tool, or the API. If you attempt to update a table schema using a renamed column, the following error is returned: BigQuery error in update operation: Provided Schema does not match Table project_id:dataset.table.
There are two ways to manually rename a column:
Using a SQL query: choose this option if you are more concerned about simplicity and ease of use, and you are less concerned about costs.
Recreating the table: choose this option if you are more concerned about costs, and you are less concerned about simplicity and ease of use.
If you want to drop a column you can either:
Use a SELECT * EXCEPT query that excludes the column (or columns) you want to remove and use the query result to overwrite the table or to create a new destination table
You can also remove a column by exporting your table data to Cloud Storage, deleting the data corresponding to the column (or columns) you want to remove, and then loading the data into a new table with a schema definition that does not include the removed column(s). You can also use the load job to overwrite the existing table
There is a guide published for Manually Changing Table Schemas.
edit
In order to change a Partitioned table to a Non-partitioned table, you can use the Console to query your data and overwrite your current table or copy to a new one. As an example, I have a table in BigQuery partitioned by _PARTITIONTIME. I used the following query to create a non-partitioned table,
SELECT *, _PARTITIONTIME as pt FROM `project.dataset.table`
With the above code, you will query the data among all table's partitions and create an extra column to show which partition it came from. Then, before executing it, there are two options, save the view in a new non-partitioned table or overwrite the current table:
Creating a new table go to: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose your project, dataset and write your new table's name > Under Destination table write preference check Write if empty.
Overwriting the current table: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose the same project and dataset for your current table > Write the same table's name as the one you want to overwrite > Under Destination table write preference check Overwrite table.
credit

Are new records added into Hive table (ORC format) appended into the last stripe or a brand new stripe?

Let's say if I created a hive table as ORC format and inserted 1M records into the table, which created a file with 17 stripes. The last stripe is not full.
Then I insterted another 100 records into this table, will the new 100 records be appended into the last stripe or a new stripe will be created ?
I have tried to test it on a HDFS cluster, seems like every time we insert new records, a new file will be created (of course, new stripes are created too). Was wondering why?
Reason would be HDFS doesn't support editing file.
So when we insert data into Hive table all the time new files will be created.
In case if you want to merge these files you can use concatenate
Alter table <table_name> CONCATENATE;
(or)
You can insert overwrite the same table that you have selected from to merge all small files into big file.
insert overwrite <db_table>.<table1> select * from <db_table>.<table1>
You can also use sort by distribute by to control number of files created in HDFS directory.

Deduplication on Amazon Athena

We have streaming applications storing data on S3. The S3 partitions might have duplicated records. We query the data in S3 through Athena.
Is there a way to remove duplicates from S3 files so that we don't get them while querying from Athena?
You can write a small bash script that executes a hive/spark/presto query for reading the dat, removing the duplicates and then writing it back to S3.
I don't use Athena but since it is just presto then I will assume you can do whatever can be done in Presto.
The bash script does the following :
Read the data and apply a distinct filter (or whatever logic you want to apply) and then insert it to another location.
For Example :
CREATE TABLE mydb.newTable AS
SELECT DISTINCT *
FROM hive.schema.myTable
If it is a recurring task, then INSER OVERWRITE would be better.
Don't forget to set the location of the hive db to easily identify the data destination.
Syntax Reference : https://prestodb.io/docs/current/sql/create-table.html
Remove the old data directory using aws s3 CLI command.
Move the new data to the old directory
Now you can safely read the same table but the records would be distinct.
Please use CTAS:
CREATE TABLE new_table
WITH (
format = 'Parquet',
parquet_compression = 'SNAPPY')
AS SELECT DISTINCT *
FROM old_table;
Reference: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
We can not remove duplicate in Athena as it works on file it have work arrounds.
So some how duplicate record should be deleted from files in s3, most easy way would be shellscript.
Or
Write select query with distinct option.
Note: Both are costly operations.
Using Athena can make EXTERNAL TABLE on data stored in S3. If you want to modify existing data then use HIVE.
Create a table in hive.
INSERT OVERWRITE TABLE new_table_name SELECT DISTINCT * FROM old_table;

Hive doesn't identify manually created folders as partitions

I conducted an experiment. I have an external table and partitioned it by year,month,day,hour. If I use INSERT OVERWRITE and specify certain partition for the data to go to, it ends up creating appropriate folder structure. e.g.
INSERT OVERWRITE TABLE default.testtable PARTITION(year = 2016, month = 7, day=29, hour=18)
SELECT tbl.c1 FROM (select 'Test' as c1) as tbl;
This table has only one string column but that's not very important.
So the above statement creates appropriate folder structure. But if I try to manually create similar structure and fire SELECT query, hive doesn't return data in manually created folders. In terms of structure, I made sure manually created folders look exactly the same as that of auto created folders with a 0 size file at every level of hierarchy. Is it because whenever we insert data to specific partition, Hive creates (if it doesn't exist) that partition and also stores the partition information in its metastore? Because that is the only thing that would be bypassed if I create folder structure manually.
I just now figured out that merely by manually creating a folder won't make Hive start treating it as a partition. I would have to force Hive treat it as partition using ALTER TABLE ADD PARTITION statement:-
ALTER TABLE default.testtable ADD IF NOT EXISTS PARTITION (year = 2016, month = 7, day=29, hour = 18);
After this if I fire select statement on the table, I am able to see the manually created data in that folder location.