Column name starts with numeric handling in Amazon Athena - sql

I loaded json file in s3 location ,in which a key starts with numeric (3party_count).I created table in aws Athena on top of this location by using crawler n aws glue.so column names has be created named 3party_count
But I couldn't do select query using this column ?
Error -invalid request exception
Can anyone help me on this ?

use double quotes
CREATE OR REPLACE VIEW "123view" AS
SELECT column_name1, column_name2
FROM "234table"
Names for Tables, Databases, and Columns AWS Athena

Related

Drop AWS Athena table with `.` in the name

I had a client upload a malformed table with a name like foo.bar into an Athena instance. What syntax can I use to drop the table? If I try
drop table if exists `foo.bar`
The command silently fails, presumably because the parser interprets foo as the database name. If I try adding the database name explicitly as
drop table if exists dbname."foo.bar"
or
drop table if exists dbname.`foo.bar`
I get a parse error from Athena.
Unfortunately, I don't have access to the Glue console to remove the table from there so I was wondering if it's possible to drop such a table via Athena SQL. Thanks!
Even if you don't have access to the Glue console you can use the the AWS CLI to delete the table directly using the Glue API:
aws glue delete-table --database-name dbname --name foo.bar

How to update column in AWS Athena Table

I have a table in Athena with the following columns.
Describe my_table
row_id
icd9_code
linksto
The column icd9_code is empty with intdata type. I want to insert some integer values to the column icd9_code of my table named my_table.
Those integer values are stored in an excel sheet in my local pc. Does AWS athena provide some way to do it.
Amazon Athena is primarily designed to run SQL queries across data stored in Amazon S3. It is not able to access data stored in Microsoft Excel files, nor is it able to access files stored on your computer.
To update a particular column of data for existing rows of data, you would need to modify the files in Amazon S3 that contain those rows of data.

Presto failed: com.facebook.presto.spi.type.VarcharType

I created a table with three columns - id, name, position,
then I stored the data into s3 using orc format using spark.
When I query select * from person it returns everything.
But when I query from presto, I get this error:
Query 20180919_151814_00019_33f5d failed: com.facebook.presto.spi.type.VarcharType
I have found the answer for the problem, when I stored the data in s3, the data inside the file was with one more column that was not defined in the hive table metastore.
So when Presto tried to query the data, it found that there are varchar instead of integer.
This also might happen if one record has a a type different than what is defined in the metastore.
I had to delete my data and import it again without that extra unneeded column

Hive partitioning for data on s3

Our data is stored using s3://bucket/YYYY/MM/DD/HH and we are using aws firehouse to land parquet data in there locations in near real time . I can query data using AWS athena just fine however we have a hive query cluster which is giving troubles querying data when partitioning is enabled .
This is what I am doing :
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
This doesn't seem to work when data on s3 is stored as s3:bucket/YYYY/MM/DD/HH
however this does work for s3:bucket/year=YYYY/month=MM/day=DD/hour=HH
Given the stringent bucket paths of firehose i cannot modify the s3 paths. So my question is what's the right partitioning scheme in hive ddl when you don't have an explicitly defined column name on your data path like year = or month= ?
Now you can specify S3 prefix in firehose.https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html
myPrefix/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
If you can't obtain folder names as per hive naming convention, you will need to map all the partitions manually
ALTER TABLE tableName ADD PARTITION (year='YYYY') LOCATION 's3:bucket/YYYY'

Deduplication on Amazon Athena

We have streaming applications storing data on S3. The S3 partitions might have duplicated records. We query the data in S3 through Athena.
Is there a way to remove duplicates from S3 files so that we don't get them while querying from Athena?
You can write a small bash script that executes a hive/spark/presto query for reading the dat, removing the duplicates and then writing it back to S3.
I don't use Athena but since it is just presto then I will assume you can do whatever can be done in Presto.
The bash script does the following :
Read the data and apply a distinct filter (or whatever logic you want to apply) and then insert it to another location.
For Example :
CREATE TABLE mydb.newTable AS
SELECT DISTINCT *
FROM hive.schema.myTable
If it is a recurring task, then INSER OVERWRITE would be better.
Don't forget to set the location of the hive db to easily identify the data destination.
Syntax Reference : https://prestodb.io/docs/current/sql/create-table.html
Remove the old data directory using aws s3 CLI command.
Move the new data to the old directory
Now you can safely read the same table but the records would be distinct.
Please use CTAS:
CREATE TABLE new_table
WITH (
format = 'Parquet',
parquet_compression = 'SNAPPY')
AS SELECT DISTINCT *
FROM old_table;
Reference: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
We can not remove duplicate in Athena as it works on file it have work arrounds.
So some how duplicate record should be deleted from files in s3, most easy way would be shellscript.
Or
Write select query with distinct option.
Note: Both are costly operations.
Using Athena can make EXTERNAL TABLE on data stored in S3. If you want to modify existing data then use HIVE.
Create a table in hive.
INSERT OVERWRITE TABLE new_table_name SELECT DISTINCT * FROM old_table;