Querying S3 Text files using Athena - amazon-s3

I created a database and the table on Athena, to point to an S3 bucket, where I have the log files created using the UNLOAD command on redshift database. Files have a default delimiter as pipe (|) for the columns.
while creating the table using the Athena interface, I used the field terminator as pipe (|) , collection and map key terminator as default. Here is the DDL statement.
CREATE EXTERNAL TABLE IF NOT EXISTS testdb.worktable (
field1 string,
field2 string,
field3 int,
field4 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '|',
'field.delim' = '|',
'collection.delim' = 'undefined',
'mapkey.delim' = 'undefined',
) LOCATION 's3://bucket_location'
TBLPROPERTIES ('has_encrypted_data'='false');
Problem :
Most of the rows are correctly aligned to the fields mentioned as columns (delimited by pipe |) But when there are spaces in a particular field say for example a space under field2 column, data shifts to the right, meaning field3 column data is showing up under field4 column
Could someone help me out fix this error ? Thank you!

Related

Error Message "HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns..."

I ran this in AWS Athena:
CREATE EXTERNAL TABLE IF NOT EXISTS `nina-nba-database`.`nina_nba_test` (
`Data` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = 'nina'
) LOCATION 's3://nina-gray/'
TBLPROPERTIES ('has_encrypted_data'='false');
However when I try to select the table using the syntax below:
SELECT * FROM "nina-nba-database"."nina_nba_table" limit 10;
It gives me this error:
HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns
This query ran against the "layla-nba-database" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: b96e4344-5bbe-4eca-9da4-70be11f8e87d
Would anyone be able to help?
The input.regex in your query doesn't look like valid one. The specified regex group while creating the table becomes a new column. So if you want to read data inside a column as new column you can specify the valid regex, to understand more about regex you can refer to Regex SerDe examples from this aws documentation. Or if your use case to just read columnar data you can create the table specifying proper delimiter, For example if your data is comma separated you can specify the delimiter as
...
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
...
have a look at this example for more details.

Create table on Athena only from certain S3 files (depending on filename)

I'm trying to create a table on Athena from S3 files.
In my bucket, I have different types of files (Activity, Epoch, BodyComp, etc.) and I'd like this table to contain only "Activity" files assuming their filenames are like :
"Activity__xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx__yyyyyyyyyy.json"
where :
- x is a character or a digit
- y is a digit
I can do that after creating the table with this SELECT statement but the query takes too much time:
SELECT *, regexp_extract("$path", '[^/]+$') AS filename
FROM runs
WHERE regexp_extract("$path", '[^/]+$') like 'Activity__%';
I'd like to do it directly in the CREATE TABLE statement.
I tried this with "input.regex" but it didn't work :
CREATE EXTERNAL TABLE IF NOT EXISTS runs(
summaryId string,
distanceInMeters float,
maxHeartRateInBeatsPerMinute int,
totalElevationGainInMeters float,
userAccessToken string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '1',
"input.regex" = "^Activity\_\_\w{8}-\w{4}-\w{4}-\w{4}-\w{12}\_\_\d{10}\.json")
LOCATION 's3://com.connector/'
TBLPROPERTIES ('has_encrypted_data'='false');
I think the problem comes from the fact that "input.regex" is not the correct parameter to get the filenames.
Thank you for your help,
Max
There is no direct way of doing this. Either you can rename files starting with _(underscore) so that Athena will ignore them or use CTAS and pass the select query above.

Hive external table delimited by commas, but comma present in data

I have some data coming in from an external source of the format:
user_id, user_name, project_name, position
"111", "Tom Petty", "Heartbreakers", "Vocals"
"222", "Ringo Starr", "Beatles, The", "Drummer"
"333", "Tom Brady", "Patriots", "QB"
And I create my external table thusly:
CREATE EXTERNAL TABLE tab1 (
USER_ID String,
USER_NAME String,
PROJECT_NAME String,
POSITION String
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/blah/foo'
The problem occurs when data in some of the columns have embedded commas in them, Beatles, The for instance. This results in Hive putting the word The into the next column (position) and dropping the data in the last column.
All the incoming data fields are wrapped in double quotes but they are comma delimited even though they may have commas in them. Unfortunately having the sender clean the data is not an option.
How can I go about creating this table?
try this
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "\""
)
You can try using Open CSV Serde in your hive table creation using specific serDe properties.
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

HIVE 2.1.1 Table creation CSV-Serde

So I did all the research and couldn't see the same issue anywhere in HIVE.
Followed the link below and I have no issues with data in quotes..
https://github.com/ogrodnek/csv-serde
My external table creation has the below serde properties,but for some reason,the default escapeChar('\') is being replaced by quoteChar which is doublequotes(") for my data.
CREATE EXTERNAL TABLE IF NOT EXISTS people_full(
`Unique ID` STRING,
.
.
.
.
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"escapeChar" = "\\"
)
STORED AS TEXTFILE
DATA ISSUE :
Sample HDFS Source data : "\"Robs business
Target HIVE Output : """Robs business
So the three double quotes as seen in """Robs business after the replacement is causing the data unwanted data delimitation (column is a very long string) may be as HIVE cannot handle three double quotes inside data(quote(") is also my default quote character)?
Why is this happening and is there a solution ? Please help.Many thanks.
Best,
Asha
To import your csv file to hdfs with double qoutes in between data and create hive table for that file, follow the query in hive to create external table which works fine and displays each record as of in the file.
create external table tablename (datatype colname,datatype2 colname2) row format
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES
("separatorChar" = ",","quoteChar" = "\"") stored as textfile location '/dir_name/';
Here, the tablename represents the name of table, datatype is like string, int or maybe other and colname represents the name of the column you are going to give and finally dir_name is the location of csv or text file in hdfs location.
Try with the Escaped by it will work. Please find the below screenshot example.

Hive MAP isn't reading input correctly

I am trying create a table on this mahout recommender system output data on s3.
703209355938578 [18519:1.5216354,18468:1.5127649,17962:1.5094717,18317:1.5075916]
828667482548563 [18070:1.0,18641:1.0,18632:1.0,18770:1.0,17814:1.0,18095:1.0]
1705358040772485 [18783:1.0,17944:1.0,18632:1.0,18770:1.0,18914:1.0,18386:1.0]
with this schema,
CREATE external table user_ad_reco (
userid bigint,
reco MAP<bigint , double>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
LOCATION
's3://xxxxx/data/RS/output/m05/';
but while I am reading data back with hive,
hive >
select * from user_ad_reco limit 10;
It is giving output like this
703209355938578 {18519:1.5216354,18468:1.5127649,17962:null}
828667482548563 {18070:1.0,18641:1.0,18632:1.0,18770:1.0,17814:null}
1705358040772485 {18783:1.0,17944:1.0,18632:1.0,18770:1.0,18914:null}
So, last key:value of map input is missing in output with null in last output pair :(.
Can anyone help regarding this?
Reason for nulls :
input data format with brackets gives null, cause of brackets the row
format in not being properly read , the last map entry 1.5075916
is being read as 1.5075916], so it's giving null due to data type
mismatch.
703209355938578 [ 18519:1.5216354,18468:1.5127649,17962:1.5094717,18317:1.5075916 ]
input data format without brackets works clean : (tested)
703209355938578 18519:1.5216354,18468:1.5127649,17962:1.5094717,18317:1.5075916
Thanks #ramisetty, I have done it in some indirect way, first got rid of two brackets [,] out of the map string, then create schema on string without brackets that.
CREATE EXTERNAL TABLE user_ad_reco_serde (
userid STRING,
reco_map STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]+)\\s\\[([^]]+)]"
)
STORED AS TEXTFILE
LOCATION
's3://xxxxxx/data/RS/output/6m/2014-01-2014-05/';
CREATE external table user_ad_reco_plain(
userid bigint,
reco string)
LOCATION
's3://xxxxx/data/RS/output/6m_plain/2014-01-2014-05/';
CREATE external table user_ad_reco (
userid bigint,
reco MAP<bigint , double>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
LOCATION
's3://xxxxxx/data/RS/output/6m_plain/2014-01-2014-05/';
There might be some simpler way.