Update table in Athena - sql

I have a table in Athena created from S3. I wanted to update the column values using the update table command. Is the UPDATE Table command not supported in Athena?
Is there any other way to update the table ?
Thanks

Athena only supports External Tables, which are tables created on top of some data on S3. Since the S3 objects are immutable, there is no concept of UPDATE in Athena. What you can do is create a new table using CTAS or a view with the operation performed there, or maybe use Python to read the data from S3, then manipulate it and overwrite it.

Related

Are two tables (native, external) always required in Hive for querying a DynamoDB table from AWS EMR?

Are two hive tables (native, external) always required for querying a DynamoDB table from an AWS EMR?
I have created a native hive table (CTAS, create table as select) using an hive external table that was mapped to a DynamoDB table. My (read) query times against external tables are slow and it uses up the read throughput versus native table are fast and read throughput is not consumed.
My questions:
Is this a standard practice/best practice i.e., create an external table mapped to a dynamodb table and then create a CTAS and query against CTAS for all read query use cases?
Where or how GSI's on dynamodb come into picture on hive side of things? Toward this curiosity I have tried to map my external hive table column to dynamodb GSI and some what expectedly saw NULLs.
So, back to #2 question was wondering how are GSI's used with a native or external hive table?
Thanks,
Answer is no.
However, from my observation if a hive native table data is backed (CTAS) by hive external table that is referencing a DynamoDb table: Read data is not accounted if you are querying hive native table from EMR. If you to take into account the periodic update (refresh data) of hive native table.

AWS S3 - Inserting into bucketed ORC table

I'm looking at storing data in S3 in ORC format for querying with Athena.
I want to partition the data like so ...
.../year=2019/month=7/
... and bucketing the data further by id (each id will have multiple records for each month, there are lots of id's)
I want to be able to insert new data into this structure daily... I understand that I can't use the INSERT INTO statement from Athena because bucketed tables are not supported.
What would be the best way to insert data daily into a table of this structure? Is it even possible to do with bucketed data?
Cheers
Presto allows inserts into existing partitions of bucketed partitioned tables since Presto 312. If Athena does not support this, you can very easily run a Presto cluster yourself, e.g. using Starburst Presto AWS integration (I can recommend this for other reasons too, as it can be way cheaper than using Athena if you run more than just few queries. Disclaimer: I'm from Starburst)

How does overwrite existing insert mode work in redshiftcopyactivity for aws data pipeline

I am new to aws data pipeline. We have a use case where we copy updated data into redshift . I wanted to know whether I can use OVERWRITE_EXISTING insert mode for redshiftcopyactivity. Also, please explain the internal working of OVERWRITE_EXISTING.
Data Pipelines are used to move data from DynamoDB or Amazon S3 to Amazon Redshift. You can load data into a new table, or easily merge data into an existing table.
"OVERWRITE_EXISTING", over writes the already existed data in to the destination table but with a constraint of unique identifier (Primary Key) in RedShift cluster.
You can use "TRUNCATE", if you dont want your table structure to be changed due to the addition of PK.
Though, you can find things here: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html

unioning tables from ec2 with aws glue

I have two mysql databases each on their own ec2 instance. Each database has a table ‘report’ under a schema ‘product’. I use a crawler to get the table schemas into the aws glue data catalog in a database called db1. Then I’m using aws glue to copy the tables from the ec2 instances into an s3 bucket. Then I’m querying the tables with redshift. I get the external schema in to redshift from the aws crawler using the script below in query editor. I would like to union the two tables together in to one table and add a column ’source’ with a flag to indicate the original table each record came from. Does anyone know if it’s possible to do that with aws glue during the etl process? Or can you suggest another solution? I know I could just union them with sql in redshift but my end goal is to create an etl pipeline that does that before it gets to redshift.
script:
create external schema schema1 from data catalog
database ‘db1’
iam_role 'arn:aws:iam::228276743211:role/madeup’
region 'us-west-2';
You can create a view that unions the 2 tables using Athena, then that view will be available in Redshift Spectrum.
CREATE OR REPLACE VIEW db1.combined_view AS
SELECT col1,cole2,col3 from db1.mysql_table_1
union all
SELECT col1,cole2,col3 from db1.mysql_table_2
;
run the above using Athena (not Redshift)

Deduplication on Amazon Athena

We have streaming applications storing data on S3. The S3 partitions might have duplicated records. We query the data in S3 through Athena.
Is there a way to remove duplicates from S3 files so that we don't get them while querying from Athena?
You can write a small bash script that executes a hive/spark/presto query for reading the dat, removing the duplicates and then writing it back to S3.
I don't use Athena but since it is just presto then I will assume you can do whatever can be done in Presto.
The bash script does the following :
Read the data and apply a distinct filter (or whatever logic you want to apply) and then insert it to another location.
For Example :
CREATE TABLE mydb.newTable AS
SELECT DISTINCT *
FROM hive.schema.myTable
If it is a recurring task, then INSER OVERWRITE would be better.
Don't forget to set the location of the hive db to easily identify the data destination.
Syntax Reference : https://prestodb.io/docs/current/sql/create-table.html
Remove the old data directory using aws s3 CLI command.
Move the new data to the old directory
Now you can safely read the same table but the records would be distinct.
Please use CTAS:
CREATE TABLE new_table
WITH (
format = 'Parquet',
parquet_compression = 'SNAPPY')
AS SELECT DISTINCT *
FROM old_table;
Reference: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
We can not remove duplicate in Athena as it works on file it have work arrounds.
So some how duplicate record should be deleted from files in s3, most easy way would be shellscript.
Or
Write select query with distinct option.
Note: Both are costly operations.
Using Athena can make EXTERNAL TABLE on data stored in S3. If you want to modify existing data then use HIVE.
Create a table in hive.
INSERT OVERWRITE TABLE new_table_name SELECT DISTINCT * FROM old_table;