Loading zip csv file from S3 into Hive - sql

I have a csv file that's zipped in S3. For unzipped files, I would use the below code. Is there an option I can add so it upzips before loading?
I'm on Hive and am using a sql editor (db visualizer). I googled and saw some unix steps but I've never used unix before so am wondering if there is another way within the sql
create external table abc (
email string
value int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://path/'
TBLPROPERTIES ("skip.header.line.count"="1");

Related

impala/hive show file format

How can I have impala or hive return the file format of the underlying files on HDFS for a table?
I tried:
SHOW FILES database.table_name
This ilst the files, but the problem is that some people stored parquet files as .parq and others .parquet. Is there anyway to return the file format, such that one could use it in a new create statement?
Use good old show create table mytable.
You can check the output and it clearly mentions file format. It also shows folder inside which file are stored - you should not try to use file name - let impala decide the name. below is a sample result from impala.
result
CREATE TABLE edh.mytable (
column1 STRING
)
STORED AS PARQUET --file format
LOCATION 's3a://cc-mys3/edh/user/hive/warehouse/edh.db/mytable' --folder location

query from athena a file in s3

I am complete newbie on this. I have this file in amazon s3.
How can I query this .tar.gz from Athena?
I am assuming I have to somehow decompress and ‘restore’ to ‘athena’? But I do not know how to do it.
You can directly query files in AWS Athena that are in .gz format as well as any flat files. If your tar file contains multiple .gz files and they are of the same file format then you don't need to gunzip them to .tsv.
Since, you have already converted to .tsv files make sure the files of the same format are put into a folder e.g.
s3://bucketname/folder/file1.gz
s3://bucketname/folder/file2.gz
etc. file1 and file2 should have the same structure.
Then define your AWS Athena table on top of this. Sample script below -
CREATE EXTERNAL TABLE table_name (
yr INT,
quarter INT,
month INT,
dayofmonth INT,
dayofweek INT,
flightdate STRING
)
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://bucketname/folder/';
Keeping homogeneous files is not mandatory but recommended so that you can add remove files under the same folder and just update the partition information every time there is a change.
Run MSCK REPAIR TABLE to refresh partition metadata each time a new partition is added to this table.
MSCK REPAIR TABLE table_name ;
Reference - https://docs.aws.amazon.com/athena/latest/ug/lazy-simple-serde.html#tsv-example
You can't query tarballs. Athena requires gzipped or uncompressed text-files. Other options are ORC or parquet files. You will need to untar the file and create a gzip file with just the .txt in it.

How to query data from gz file of Amazon S3 using Qubole Hive query?

I need get specific data from gz.
how to write the sql?
can I just sql as table database?:
Select * from gz_File_Name where key = 'keyname' limit 10.
but it always turn back with an error.
You need to create Hive external table over this file location(folder) to be able to query using Hive. Hive will recognize gzip format. Like this:
create external table hive_schema.your_table (
col_one string,
col_two string
)
stored as textfile --specify your file type, or use serde
LOCATION
's3://your_s3_path_to_the_folder_where_the_file_is_located'
;
See the manual on Hive table here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable
To be precise s3 under the hood does not store folders, filename containing /s in s3 represented by different tools such as Hive like a folder structure. See here: https://stackoverflow.com/a/42877381/2700344

HiveQL Where In Clause That Points to a Set of Files

I have a set of ~100 files each with 50k IDs in them. I want to be able to make a query against Hive that has a Where In clause using the IDs from these files. I could also do this directly from Groovy, but I'm thinking the code would be cleaner if I did all of the processing from Hive instead of referencing an external Set. Is this possible?
Create an external table describing the format of your files, and set the location to the HDFS path of a directory containing the files.. i.e for tab delimited files
create external table my_ids(
id bigint,
other_col string
)
row format delimited fields terminated by "\t"
stored as textfile
location 'hdfs://mydfs/data/myids'
Now you can use Hive to access this data.

HIVE script - Specify file name as S3 Location

I am exporting data from DynamoDB to S3 using follwing script:
CREATE EXTERNAL TABLE TableDynamoDB(col1 String, col2 String)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES (
"dynamodb.table.name" = "TableDynamoDB",
"dynamodb.column.mapping" = "col1:col1,col2:col2"
);
CREATE EXTERNAL TABLE TableS3(col1 String, col2 String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://myBucket/DataFiles/MyData.txt';
INSERT OVERWRITE TABLE TableS3
SELECT * FROM TableDynamoDB;
In S3, I want to write the output to a given file name (MyData.txt)
but the way it is working currently is that above script created folder with name 'MyData.txt'
and then generated a file with some random name under this folder.
Is it at all possible to specify a file name in S3 using HIVE?
Thank you!
A few things:
There are 2 different ways hadoop can write data to s3. This wiki describes the differences in a little more detail. Since you are using the "s3" scheme, you are probably seeing a block number.
In general, M/R jobs (and hive queries) are going to want to write their output to multiple files. This is an artifact of parallel processing. In practice, most commands/APIs in hadoop handle directories pretty seamlessly so you shouldn't let it bug you too much. Also, you can use things like hadoop fs -getmerge on a directory to read all of the files in a single stream.
AFAIK, the LOCATION argument in the DDL for an external hive table is always treated as a directory for the reasons above.