Create table on Athena only from certain S3 files (depending on filename) - amazon-s3

I'm trying to create a table on Athena from S3 files.
In my bucket, I have different types of files (Activity, Epoch, BodyComp, etc.) and I'd like this table to contain only "Activity" files assuming their filenames are like :
"Activity__xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx__yyyyyyyyyy.json"
where :
- x is a character or a digit
- y is a digit
I can do that after creating the table with this SELECT statement but the query takes too much time:
SELECT *, regexp_extract("$path", '[^/]+$') AS filename
FROM runs
WHERE regexp_extract("$path", '[^/]+$') like 'Activity__%';
I'd like to do it directly in the CREATE TABLE statement.
I tried this with "input.regex" but it didn't work :
CREATE EXTERNAL TABLE IF NOT EXISTS runs(
summaryId string,
distanceInMeters float,
maxHeartRateInBeatsPerMinute int,
totalElevationGainInMeters float,
userAccessToken string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '1',
"input.regex" = "^Activity\_\_\w{8}-\w{4}-\w{4}-\w{4}-\w{12}\_\_\d{10}\.json")
LOCATION 's3://com.connector/'
TBLPROPERTIES ('has_encrypted_data'='false');
I think the problem comes from the fact that "input.regex" is not the correct parameter to get the filenames.
Thank you for your help,
Max

There is no direct way of doing this. Either you can rename files starting with _(underscore) so that Athena will ignore them or use CTAS and pass the select query above.

Related

select row from orc snappy table in hive

I have created a table employee_orc which is orc format with snappy compression.
create table employee_orc(emp_id string, name string)
row format delimited fields terminated by '\t' stored as orc tblproperties("orc.compress"="SNAPPY");
I have uploaded data into the table using the insert statement.
employee_orc table has 1000 records.
When I run the below query, it shows all the records
select * from employee_orc;
But when run the below query, it shows zero results even though the records exist.
select * from employee_orc where emp_id = "EMP456";
Why I am unable to retrieve a single record from the employee_orc table?
The record does not exist. You may think they are the same because they look the same, but there is some difference. One possibility are spaces at the beginning or end of the string. For this, you can use like:
where emp_id like '%EMP456%'
This might help you.
On my part, I don't understand why you want to specify a delimiter in ORC. Are you confusing CSV and ORC or external vs managed ?
I advice you to create your table differently
create table employee_orc(emp_id string, name string)
stored as ORC
TBLPROPERTIES (
"orc.compress"="ZLIB");

Is it possible to remove certain characters in a particular column while creating an Athena table?

I want to remove certain unnecessary characters in a column, so that the data can be split into an array.
The original data is in json format like this:
{
"id":"xyz",
"listL":"[\"N09jk\",\"KLpp1\"]",
"timestamp":"2019-01-04 05:33:02",
}
I want to parse the listL attribute as an array like [N09jk, KLpp1].
However given the current format, it takes the entire String as one element like this:
[["N09jk","KLpp1"]]
I was wondering if removing the characters [ , ], and " while parsing the file and then splitting into array would work.
My create table query is:
CREATE EXTERNAL TABLE IF NOT EXISTS db.table (
\`id\` string,
\`listL\` array<string>,
\`timestamp\` timestamp
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://path/'
TBLPROPERTIES ('has_encrypted_data'='false');
Create the table with the column listL as string and use json_parse to parse it as an array during querying:
SELECT
id,
json_parse(listL) as listL,
timestamp
FROM table
You can also create a view so that you don't have to include the json_parse in every query:
CREATE VIEW table_with_list AS
SELECT
id,
json_parse(listL) as listL,
timestamp
FROM table

Athena - Is there any way to create table pointing for specific filename format?

I'm using Athena to query data from multiple files partitioned on S3. I create a
CREATE EXTERNAL TABLE IF NOT EXISTS testing_table (
EventTime string,
IpAddress string,
Publisher string,
Segmentname string,
PlayDuration double,
cost double ) PARTITIONED BY (
year string,
month string,
day string ) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n' LOCATION 's3://campaigns/testing/';
In my location, the may have multiple files with different filename, such as: "campaign_au_click.csv", "campaign_au_impression.csv". These files may have different structure.
Is the any way that my above table only getting data from click files.
Thanks
Your best bet is to partition them into different folders. Athena, like Hive, works on the folder level - any and all files in a folder will be taken in as the same schema.
The very first option should be to have those files in different folders. But considering that we have the situation right now and we want to query table for specific files. There is a work around.
You create your table with root folder only. But while querying you can have a WHERE clause on filename. The column name for filename is accessed by "$path" (including quotes).
For example, you query can be
SELECT .....
From .....
WHERE
.....
AND
"$path" like "%_click.csv"
Note : The where clause provided is just an example. You can explore regexp_like instead of like.

HIVE 2.1.1 Table creation CSV-Serde

So I did all the research and couldn't see the same issue anywhere in HIVE.
Followed the link below and I have no issues with data in quotes..
https://github.com/ogrodnek/csv-serde
My external table creation has the below serde properties,but for some reason,the default escapeChar('\') is being replaced by quoteChar which is doublequotes(") for my data.
CREATE EXTERNAL TABLE IF NOT EXISTS people_full(
`Unique ID` STRING,
.
.
.
.
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"escapeChar" = "\\"
)
STORED AS TEXTFILE
DATA ISSUE :
Sample HDFS Source data : "\"Robs business
Target HIVE Output : """Robs business
So the three double quotes as seen in """Robs business after the replacement is causing the data unwanted data delimitation (column is a very long string) may be as HIVE cannot handle three double quotes inside data(quote(") is also my default quote character)?
Why is this happening and is there a solution ? Please help.Many thanks.
Best,
Asha
To import your csv file to hdfs with double qoutes in between data and create hive table for that file, follow the query in hive to create external table which works fine and displays each record as of in the file.
create external table tablename (datatype colname,datatype2 colname2) row format
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES
("separatorChar" = ",","quoteChar" = "\"") stored as textfile location '/dir_name/';
Here, the tablename represents the name of table, datatype is like string, int or maybe other and colname represents the name of the column you are going to give and finally dir_name is the location of csv or text file in hdfs location.
Try with the Escaped by it will work. Please find the below screenshot example.

Create a table in Hive and populate it with data

While trying to load data in a Hive table I encountered a behavior that looks strange to me. My data is made up of JSON objects loaded as records in a table called twitter_test containing a single column named "json".
Now I want to extract three fields from each JSON and build a new table called "my_twitter". I thus issue the command
CREATE TABLE my_twitter AS SELECT regexp_replace(get_json_object(t.json, '$.body\[0]'), '\n', '') as text, get_json_object(t.json, '$.publishingdate\[0]') as created_at, get_json_object(t.json, '$.author_screen_name\[0]') as author from twitter_test AS t;
The result is a table with three columns that contains no data. However, if I run the SELECT command alone it returns data as expected.
By trial and error I found out that i need to add LIMIT x at the end of the query for data to be inserted in the new table. The question is: why?
Furthermore, seems strange that I need to know in advance the number x of rows returned by the SELECT statement for the CREATE to work correctly. Is there any workaround?
You could create a table on this json data using the JSON serde which would parse the json objects and then you could easily select each individual columns easily.
Find below a sample hive DDL for creating a json table using json serde
CREATE EXTERNAL TABLE `json_table`(
A string
,B string
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'PATH'