Amazon Athena set location to single csv file - sql

I would like to set the location value in my Athena SQL create table statement to a single CSV file as I do not want to query every file in the path. I can set and successfully query an s3 directory (object) path and all files in that path, but not a single file. Is setting a single file as the location supported?
Successfully queries CSV files in path:
LOCATION 's3://my_bucket/path/'
Returns zero results:
LOCATION 's3://my_bucket/path/filename.csv.gz'
Create table statement:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`my_db` (
`name` string,
`occupation` string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim' = ','
) LOCATION 's3://bucket-name/path/filename.csv.gz'
TBLPROPERTIES ('has_encrypted_data'='false');
I have read this Q&A and this, but it doesn't seem to address the same issue.
Thank you.

You could try adding path of that particular object in WHERE condition while querying:
SELECT * FROM default.my_db
WHERE "$path" = 's3://bucket-name/path/filename.csv.gz'

Related

Create External Table pointing to S3

How do we create an external table using Snowflake sql that points to a directory in S3? Below is the code I tried so far, but didn't work. Any help is highly appreciated.
create external table my_table
(
column1 varchar(4000),
column2 varchar(4000)
)
LOCATION 's3a://<externalbucket>'
Note : The file that I have in the S3 bucket is a csv file (comma seperated, double quotes enclosed and with header).
You will need to update your location to be an external stage, include the file_format parameter, and include the proper expression for the columns.
The location Parameter:
Specifies the external stage where the files containing data to be read are staged.
Additionally you'll need to define the file_format
https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html#required-parameters
So your statement should look more like this:
create external table my_table
(
column1 varchar as (value:c1::varchar),
column2 varchar as (value:c2::varchar)
)
location = #[namespace.]ext_stage_name[/path]
file_format = (type = CSV)
You may need to define additional paramaters in the file format to handle your file appropriately
Finally I sorted this out. Posting this answer as to make the answer simple to understand especially for the beginners.
Say that I have a csv file in the S3 location in the below format.
Step 1 :
Create a file format in which you can define what type of file it is, field delimiter, data enclosed in double quotes, skip the header of the file etc.
create or replace file format schema_name.pipeformat
type = 'CSV'
field_delimiter = '|'
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
skip_header = 1
https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html
Step 2 :
Create a Stage to specify the S3 details and file format.
create or replace stage schema_name.stage_name
url='s3://<path where file is kept>'
credentials=(aws_key_id='****' aws_secret_key='****')
file_format = pipeformat
https://docs.snowflake.com/en/sql-reference/sql/create-stage.html#required-parameters
Step 3 :
Create the external table based on the Stage name and file format.
create or replace external table schema_name.table_name
(
RollNumber INT as (value:c1::int),
Name varchar(20) as ( value:c2::varchar),
Marks int as (value:c3::int)
)
with location = #stage_name
file_format = pipeformat
https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html
Step 4 :
Now you should be able to query from the external table.
select *
from schema_name.table_name

Zero results in Athena query of S3 object

I placed a text file that is comma delimited in an S3 bucket. I am attempting to query the folder the file resides in but it returns zero results.
Create table DDL:
CREATE EXTERNAL TABLE myDatabase.myTable (
`field_1` string,
`field_2` string,
...
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://bucket/files from boss/'
TBLPROPERTIES ('has_encrypted_data'='false');
The issue was the whitespace in the location:
LOCATION 's3://bucket/files from boss/'
I removed the whitespace from the folder name in S3 and I was able to query without issue:
LOCATION 's3://bucket/files_from_boss/'

Hive select csv files only from directory

I have the following file structure
/base/{yyyy-mm-dd}/
folder1/
folderContainingCSV/
logs/
I want to load the data from my base directory for all dates. But the problem is that there are files in non csv.gz format in log/ directory. Is there a way to select only csv.gz files while querying from base directory level.
Sample query:-
CREATE EXTERNAL TABLE IF NOT EXISTS csvData (
`col1` string,
`col2` string,
`col3` string,
`col4` string,
`col5` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = '|'
) LOCATION 's3://base/'
TBLPROPERTIES ('has_encrypted_data'='true');
You may not do this at table creation level. You need to copy all the *.gz file separately into another folder.
This may be done within the hive script (containing the create table statement) itself. Just add below command at the beginning of the hive script (just before create table)
dfs -mkdir -p /new/path/folder
dfs -cp /regular/log/file/*.gz /new/path/folder
Now, you may create the external table pointing to new/path/folder.

How do you add Data to an Existing Hive Metastore?

I have multiple subdirectories in S3 that contain .orc files. I'm trying to create a hive metastore so I can query the data with Presto / Hive, etc. The data is poorlly structured (no consistent delimiter, ugly characters, etc). Here's a scrubbed sample:
1488736466 199.199.199.199 0_b.www.sphericalcow.com.f9b1.qk-g6m6z24tdr.v4.url.name.com TXT IN: NXDOMAIN/0/143
1488736466 6.6.5.4 0.3399.186472.4306.6668.638.cb5a.names-things.update.url.name.com TXT IN: NOERROR/3/306 0\009253\009http://az.blargi.ng/%D3%AB%EF%BF%BD%EF%BF%BD/\009 0\009253\009http://casinoroyal.online/\009 0\009253\009http://d2njbfxlilvpsq.cloudfront.net/b_zq_ym_bangvideo/bangvideo0826.apk\009
I was able to create a table pointing to one of the subdirectories using a serde regex and the fields are parsing properly, but as far as I can tell I can only load one subfolder at a time.
How does one add more data to an existing hive metastore?
Here's an example of my hive metastore create statement with the regex serde bit:
DROP TABLE IF EXISTS test;
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
COMMENT 'fill all the tables with the datas.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS ORC
LOCATION 's3://path/to/one/of/10/folders/'
tblproperties ("orc.compress" = "SNAPPY", "skip.header.line.count"="2");
select * from test limit 10;
I realize there is probably a very simple solution, but I tried INSERT INTO in place of CREATE EXTERNAL TABLE, but it understandably complains about the input, and I looked in both the hive and serde documentation for help but was unable to find a reference to adding to an existing store.
Possible solution using partitions.
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
partitioned by (mypartcol string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)"
)
LOCATION 's3://whatever/as/long/as/it/is/empty'
tblproperties ("skip.header.line.count"="2");
alter table test add partition (mypartcol='folder 1') location 's3://path/to/1st/of/10/folders/';
alter table test add partition (mypartcol='folder 2') location 's3://path/to/2nd/of/10/folders/';
.
.
.
alter table test add partition (mypartcol='folder 10') location 's3://path/to/10th/of/10/folders/';
For #TheProletariat (the OP)
It seems there is no need for RegexSerDe since the columns are delimited by space (' ').
Note the use of tblproperties ("serialization.last.column.takes.rest"="true")
create external table test
(
field1 bigint
,field2 string
,field3 string
,field4 string
)
row format delimited
fields terminated by ' '
tblproperties ("serialization.last.column.takes.rest"="true")
;

How to create table using external key word in hive

I want to process the data in hdfs , i am trying to create table using external keyword then i am getting following error, can you please provide solution for this.
hive> create EXTERNAL table samplecv(id INT, name STRING)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
LOCATION '/home/siva/jobportal/sample.csv';
I am getting following error, can u please provide solution for this
FAILED: Error in metadata: MetaException(message:Got exception: org.apache.hadoop.ipc.RemoteException java.io.FileNotFoundException: Parent path is not a directory: /home/siva/jobportal/sample.csv
can you please confirm that this path is on HDFS?
More info on the creating external tables in Hive: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables
I use the following for XML parsing serde in hive ---
CREATE EXTERNAL TABLE XYZ(
X STRING,
Y STRING,
Z ARRAY<STRING>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.X"="/XX/#X",
"column.xpath.Y"="/YY/#Y"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/XXX'
TBLPROPERTIES (
"xmlinput.start"="<xml start",
"xmlinput.end"="</xml end>"
);
At the moment, Hive only allows you to set a directory as a partition location while adding a partition. What you are trying to do here is to set a file as partition location. The workaround that I use is to first add a partition with a dummy/non-existent directory (Hive doesn't demand the directory exist while it's being set as a partition location) and then use alter table partition set location to change the partition location to your desired file. Surprisingly, Hive doesn't force the location to be a directory while setting the location of an existing partition the way it does while adding a new partition. So in your case, it will look like -
alter table samplecv add partition (id='11', name='somename') location '/home/siva/jobportal/somedirectory'
alter table samplecv partition (id='11', name='somename') set location '/home/siva/jobportal/sample.csv'
Hive always expect directory name in Location path instead of file name.
create your file inside a directory for example inside /home/siva/jobportal/sample/sample.csv and then try running below command to create your hive table.
create EXTERNAL table samplecv(id INT, name STRING)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
LOCATION '/home/siva/jobportal/sample';
In case if you get any error just put your file in hdfs and try, it should work.