Deduplication on Amazon Athena - hive

We have streaming applications storing data on S3. The S3 partitions might have duplicated records. We query the data in S3 through Athena.
Is there a way to remove duplicates from S3 files so that we don't get them while querying from Athena?

You can write a small bash script that executes a hive/spark/presto query for reading the dat, removing the duplicates and then writing it back to S3.
I don't use Athena but since it is just presto then I will assume you can do whatever can be done in Presto.
The bash script does the following :
Read the data and apply a distinct filter (or whatever logic you want to apply) and then insert it to another location.
For Example :
CREATE TABLE mydb.newTable AS
SELECT DISTINCT *
FROM hive.schema.myTable
If it is a recurring task, then INSER OVERWRITE would be better.
Don't forget to set the location of the hive db to easily identify the data destination.
Syntax Reference : https://prestodb.io/docs/current/sql/create-table.html
Remove the old data directory using aws s3 CLI command.
Move the new data to the old directory
Now you can safely read the same table but the records would be distinct.

Please use CTAS:
CREATE TABLE new_table
WITH (
format = 'Parquet',
parquet_compression = 'SNAPPY')
AS SELECT DISTINCT *
FROM old_table;
Reference: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html

We can not remove duplicate in Athena as it works on file it have work arrounds.
So some how duplicate record should be deleted from files in s3, most easy way would be shellscript.
Or
Write select query with distinct option.
Note: Both are costly operations.

Using Athena can make EXTERNAL TABLE on data stored in S3. If you want to modify existing data then use HIVE.
Create a table in hive.
INSERT OVERWRITE TABLE new_table_name SELECT DISTINCT * FROM old_table;

Related

How to update table in Athena

I am creating a table 'A' from another table 'B' in Athena using create sql query. However, Table 'B' is updated with new rows every hour. I want to know how can I update the table A data without dropping table A and creating it again.
I tried dropping table and creating it again, but that seems to create performance issue as every time a new table is getting created. I want to insert only new rows in table A whichever are added in Table B
Amazon Athena is a query engine, not a database.
When a query runs on a table, Athena uses the location of a table to determine where the data is stored in an Amazon S3 bucket. It then reads all files in that location (including sub-directories) and runs the query on that data.
Therefore, the easiest way to add data to Amazon Athena tables is to create additional files in that location in Amazon S3. The next time Athena runs a query, those files will be included as part of the referenced table. Even running the INSERT INTO command creates new files in that location. ("Each INSERT operation creates a new file, rather than appending to an existing file.")
If you wish to copy data from Table-B to Table-A, and you know a way to identify which rows to add (eg there is a column with a timestamp), you could use something like:
INSERT INTO table_a
SELECT * FROM table_b
WHERE timestamp_field > (SELECT MAX(timestamp_field FROM table_a))

Update table in Athena

I have a table in Athena created from S3. I wanted to update the column values using the update table command. Is the UPDATE Table command not supported in Athena?
Is there any other way to update the table ?
Thanks
Athena only supports External Tables, which are tables created on top of some data on S3. Since the S3 objects are immutable, there is no concept of UPDATE in Athena. What you can do is create a new table using CTAS or a view with the operation performed there, or maybe use Python to read the data from S3, then manipulate it and overwrite it.

Add new partition-scheme to existing table in athena with SQL code

Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).

Is there a way to merge ORC files in HDFS without using ALTER TABLE CONCATENATE command?

This is my first week with Hive and HDFS, so please bear with me.
Almost all the ways I saw so far to merge multiple ORC files suggest using ALTER TABLE with CONCATENATE command.
But I need to merge multiple ORC files of the same table without having to ALTER the table. Another option is to create a copy of the existing table and then use ALTER TABLE on that so that my original table remains unchanged. But I can't do that as well because space and data redundancy reasons.
The thing I'm trying to achieve (ideally) is: I need to transport these ORCs as one file per table into a cloud environment. So, is there a way that I can merge the ORCs on-the-go during the transfer process into cloud? Can this be achieved with/without Hive, maybe directly in HDFS?
Two possible methods other than ALTER TABLE CONCATENATE:
Try to configure merge task, see details here: https://stackoverflow.com/a/45266244/2700344
Alternatively you can force single reducer. This method is quite applicable for not too big files. You can overwrite the same table with ORDER BY, this will force single reducer on the last ORDER BY stage. This will work slow or even fail with big files because all the data will be passed through single reducer:
INSERT OVERWRITE TABLE
SELECT * FROM TABLE
ORDER BY some_col; --this will force single reducer
As a side effect you will get better packed ORC file with efficient index on columns listed in order by.

Trying to copy data from Impala Parquet table to a non-parquet table

I am moving data around within Impala, not my design, and I have lost some data. I need to copy the data from the parquet tables back to their original non-parquet tables. Originally, the developers had done this with a simple one liner in a script. Since I don't know anything about databases and especially about Impala I was hoping you could help me out. This is the one line that is used to translate to a parquet table that I need to be reversed.
impalaShell -i <ipaddr> use db INVALIDATE METADATA <text_table>;
CREATE TABLE <parquet_table> LIKE <text_table> STORED AS PARQUET TABLE;
INSERT OVERWRITE <parquet_table> SELECT * FROM <text_table>;
Thanks.
Have you tried simply doing
CREATE TABLE <text_table>
AS
SELECT *
FROM <parquet_table>
Per the Cloudera documentation, this should be possible.
NOTE: Ensure that your does not exist or use a table name that does not already exist so that you do not accidentally overwrite other data.