query from athena a file in s3 - amazon-s3

I am complete newbie on this. I have this file in amazon s3.
How can I query this .tar.gz from Athena?
I am assuming I have to somehow decompress and ‘restore’ to ‘athena’? But I do not know how to do it.

You can directly query files in AWS Athena that are in .gz format as well as any flat files. If your tar file contains multiple .gz files and they are of the same file format then you don't need to gunzip them to .tsv.
Since, you have already converted to .tsv files make sure the files of the same format are put into a folder e.g.
s3://bucketname/folder/file1.gz
s3://bucketname/folder/file2.gz
etc. file1 and file2 should have the same structure.
Then define your AWS Athena table on top of this. Sample script below -
CREATE EXTERNAL TABLE table_name (
yr INT,
quarter INT,
month INT,
dayofmonth INT,
dayofweek INT,
flightdate STRING
)
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://bucketname/folder/';
Keeping homogeneous files is not mandatory but recommended so that you can add remove files under the same folder and just update the partition information every time there is a change.
Run MSCK REPAIR TABLE to refresh partition metadata each time a new partition is added to this table.
MSCK REPAIR TABLE table_name ;
Reference - https://docs.aws.amazon.com/athena/latest/ug/lazy-simple-serde.html#tsv-example

You can't query tarballs. Athena requires gzipped or uncompressed text-files. Other options are ORC or parquet files. You will need to untar the file and create a gzip file with just the .txt in it.

Related

Specify compression type with Athena

I have S3 data which is has GZIP compression. I'm trying to create a table in Athena using this file, and my CREATE TABLE statement succeeds - but when I query the table all rows are empty.
create external table mydatabase.table1 (
date date,
week_begin_date date,
week_end_date date,
value float
)
row format delimited fields terminated by ','
stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://my-bucket/some/path/'
How can I insist that Athena read my files as GZIP?
While Athena supports TBLPROPERTIES metadata (we can set properties within a CREATE TABLE, ALTER TABLE to set these properties, and SHOW TBLPROPERTIES to display properties of any table), it does not respect the TBLPROPERTIES ('compressionType'='gzip') option.
There's no apparent way to force compression / decompression algorithm. Athena attempts to identify compression based on file extension. A GZIP file with a .gz suffix will be readable; a GZIP file without that suffix will not.
Similarly, an uncompressed file with a .gz suffix will fail. The reported error is
HIVE_CURSOR_ERROR: incorrect header check
Some investigation revealed the following:
The only known way to have Athena recognize a file as a GZIP is to name it with a .gz suffix.
Other similar suffixes that do not work include .gzip, .zip, [^.]gz
GZIP and uncompressed files can live happily side by side in an Athena table or partition - the compression detection is done at the file level, not at the table level.

How to query data from gz file of Amazon S3 using Qubole Hive query?

I need get specific data from gz.
how to write the sql?
can I just sql as table database?:
Select * from gz_File_Name where key = 'keyname' limit 10.
but it always turn back with an error.
You need to create Hive external table over this file location(folder) to be able to query using Hive. Hive will recognize gzip format. Like this:
create external table hive_schema.your_table (
col_one string,
col_two string
)
stored as textfile --specify your file type, or use serde
LOCATION
's3://your_s3_path_to_the_folder_where_the_file_is_located'
;
See the manual on Hive table here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable
To be precise s3 under the hood does not store folders, filename containing /s in s3 represented by different tools such as Hive like a folder structure. See here: https://stackoverflow.com/a/42877381/2700344

Loading zip csv file from S3 into Hive

I have a csv file that's zipped in S3. For unzipped files, I would use the below code. Is there an option I can add so it upzips before loading?
I'm on Hive and am using a sql editor (db visualizer). I googled and saw some unix steps but I've never used unix before so am wondering if there is another way within the sql
create external table abc (
email string
value int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://path/'
TBLPROPERTIES ("skip.header.line.count"="1");

HiveQL Where In Clause That Points to a Set of Files

I have a set of ~100 files each with 50k IDs in them. I want to be able to make a query against Hive that has a Where In clause using the IDs from these files. I could also do this directly from Groovy, but I'm thinking the code would be cleaner if I did all of the processing from Hive instead of referencing an external Set. Is this possible?
Create an external table describing the format of your files, and set the location to the HDFS path of a directory containing the files.. i.e for tab delimited files
create external table my_ids(
id bigint,
other_col string
)
row format delimited fields terminated by "\t"
stored as textfile
location 'hdfs://mydfs/data/myids'
Now you can use Hive to access this data.

HIVE script - Specify file name as S3 Location

I am exporting data from DynamoDB to S3 using follwing script:
CREATE EXTERNAL TABLE TableDynamoDB(col1 String, col2 String)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES (
"dynamodb.table.name" = "TableDynamoDB",
"dynamodb.column.mapping" = "col1:col1,col2:col2"
);
CREATE EXTERNAL TABLE TableS3(col1 String, col2 String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://myBucket/DataFiles/MyData.txt';
INSERT OVERWRITE TABLE TableS3
SELECT * FROM TableDynamoDB;
In S3, I want to write the output to a given file name (MyData.txt)
but the way it is working currently is that above script created folder with name 'MyData.txt'
and then generated a file with some random name under this folder.
Is it at all possible to specify a file name in S3 using HIVE?
Thank you!
A few things:
There are 2 different ways hadoop can write data to s3. This wiki describes the differences in a little more detail. Since you are using the "s3" scheme, you are probably seeing a block number.
In general, M/R jobs (and hive queries) are going to want to write their output to multiple files. This is an artifact of parallel processing. In practice, most commands/APIs in hadoop handle directories pretty seamlessly so you shouldn't let it bug you too much. Also, you can use things like hadoop fs -getmerge on a directory to read all of the files in a single stream.
AFAIK, the LOCATION argument in the DDL for an external hive table is always treated as a directory for the reasons above.