Hive external table delimited by commas, but comma present in data - sql

I have some data coming in from an external source of the format:
user_id, user_name, project_name, position
"111", "Tom Petty", "Heartbreakers", "Vocals"
"222", "Ringo Starr", "Beatles, The", "Drummer"
"333", "Tom Brady", "Patriots", "QB"
And I create my external table thusly:
CREATE EXTERNAL TABLE tab1 (
USER_ID String,
USER_NAME String,
PROJECT_NAME String,
POSITION String
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/blah/foo'
The problem occurs when data in some of the columns have embedded commas in them, Beatles, The for instance. This results in Hive putting the word The into the next column (position) and dropping the data in the last column.
All the incoming data fields are wrapped in double quotes but they are comma delimited even though they may have commas in them. Unfortunately having the sender clean the data is not an option.
How can I go about creating this table?

try this
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "\""
)

You can try using Open CSV Serde in your hive table creation using specific serDe properties.
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

Related

Error Message "HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns..."

I ran this in AWS Athena:
CREATE EXTERNAL TABLE IF NOT EXISTS `nina-nba-database`.`nina_nba_test` (
`Data` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = 'nina'
) LOCATION 's3://nina-gray/'
TBLPROPERTIES ('has_encrypted_data'='false');
However when I try to select the table using the syntax below:
SELECT * FROM "nina-nba-database"."nina_nba_table" limit 10;
It gives me this error:
HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns
This query ran against the "layla-nba-database" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: b96e4344-5bbe-4eca-9da4-70be11f8e87d
Would anyone be able to help?
The input.regex in your query doesn't look like valid one. The specified regex group while creating the table becomes a new column. So if you want to read data inside a column as new column you can specify the valid regex, to understand more about regex you can refer to Regex SerDe examples from this aws documentation. Or if your use case to just read columnar data you can create the table specifying proper delimiter, For example if your data is comma separated you can specify the delimiter as
...
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
...
have a look at this example for more details.

Is it possible to remove certain characters in a particular column while creating an Athena table?

I want to remove certain unnecessary characters in a column, so that the data can be split into an array.
The original data is in json format like this:
{
"id":"xyz",
"listL":"[\"N09jk\",\"KLpp1\"]",
"timestamp":"2019-01-04 05:33:02",
}
I want to parse the listL attribute as an array like [N09jk, KLpp1].
However given the current format, it takes the entire String as one element like this:
[["N09jk","KLpp1"]]
I was wondering if removing the characters [ , ], and " while parsing the file and then splitting into array would work.
My create table query is:
CREATE EXTERNAL TABLE IF NOT EXISTS db.table (
\`id\` string,
\`listL\` array<string>,
\`timestamp\` timestamp
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://path/'
TBLPROPERTIES ('has_encrypted_data'='false');
Create the table with the column listL as string and use json_parse to parse it as an array during querying:
SELECT
id,
json_parse(listL) as listL,
timestamp
FROM table
You can also create a view so that you don't have to include the json_parse in every query:
CREATE VIEW table_with_list AS
SELECT
id,
json_parse(listL) as listL,
timestamp
FROM table

Querying S3 Text files using Athena

I created a database and the table on Athena, to point to an S3 bucket, where I have the log files created using the UNLOAD command on redshift database. Files have a default delimiter as pipe (|) for the columns.
while creating the table using the Athena interface, I used the field terminator as pipe (|) , collection and map key terminator as default. Here is the DDL statement.
CREATE EXTERNAL TABLE IF NOT EXISTS testdb.worktable (
field1 string,
field2 string,
field3 int,
field4 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '|',
'field.delim' = '|',
'collection.delim' = 'undefined',
'mapkey.delim' = 'undefined',
) LOCATION 's3://bucket_location'
TBLPROPERTIES ('has_encrypted_data'='false');
Problem :
Most of the rows are correctly aligned to the fields mentioned as columns (delimited by pipe |) But when there are spaces in a particular field say for example a space under field2 column, data shifts to the right, meaning field3 column data is showing up under field4 column
Could someone help me out fix this error ? Thank you!

HIVE 2.1.1 Table creation CSV-Serde

So I did all the research and couldn't see the same issue anywhere in HIVE.
Followed the link below and I have no issues with data in quotes..
https://github.com/ogrodnek/csv-serde
My external table creation has the below serde properties,but for some reason,the default escapeChar('\') is being replaced by quoteChar which is doublequotes(") for my data.
CREATE EXTERNAL TABLE IF NOT EXISTS people_full(
`Unique ID` STRING,
.
.
.
.
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"escapeChar" = "\\"
)
STORED AS TEXTFILE
DATA ISSUE :
Sample HDFS Source data : "\"Robs business
Target HIVE Output : """Robs business
So the three double quotes as seen in """Robs business after the replacement is causing the data unwanted data delimitation (column is a very long string) may be as HIVE cannot handle three double quotes inside data(quote(") is also my default quote character)?
Why is this happening and is there a solution ? Please help.Many thanks.
Best,
Asha
To import your csv file to hdfs with double qoutes in between data and create hive table for that file, follow the query in hive to create external table which works fine and displays each record as of in the file.
create external table tablename (datatype colname,datatype2 colname2) row format
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES
("separatorChar" = ",","quoteChar" = "\"") stored as textfile location '/dir_name/';
Here, the tablename represents the name of table, datatype is like string, int or maybe other and colname represents the name of the column you are going to give and finally dir_name is the location of csv or text file in hdfs location.
Try with the Escaped by it will work. Please find the below screenshot example.

Hive MAP isn't reading input correctly

I am trying create a table on this mahout recommender system output data on s3.
703209355938578 [18519:1.5216354,18468:1.5127649,17962:1.5094717,18317:1.5075916]
828667482548563 [18070:1.0,18641:1.0,18632:1.0,18770:1.0,17814:1.0,18095:1.0]
1705358040772485 [18783:1.0,17944:1.0,18632:1.0,18770:1.0,18914:1.0,18386:1.0]
with this schema,
CREATE external table user_ad_reco (
userid bigint,
reco MAP<bigint , double>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
LOCATION
's3://xxxxx/data/RS/output/m05/';
but while I am reading data back with hive,
hive >
select * from user_ad_reco limit 10;
It is giving output like this
703209355938578 {18519:1.5216354,18468:1.5127649,17962:null}
828667482548563 {18070:1.0,18641:1.0,18632:1.0,18770:1.0,17814:null}
1705358040772485 {18783:1.0,17944:1.0,18632:1.0,18770:1.0,18914:null}
So, last key:value of map input is missing in output with null in last output pair :(.
Can anyone help regarding this?
Reason for nulls :
input data format with brackets gives null, cause of brackets the row
format in not being properly read , the last map entry 1.5075916
is being read as 1.5075916], so it's giving null due to data type
mismatch.
703209355938578 [ 18519:1.5216354,18468:1.5127649,17962:1.5094717,18317:1.5075916 ]
input data format without brackets works clean : (tested)
703209355938578 18519:1.5216354,18468:1.5127649,17962:1.5094717,18317:1.5075916
Thanks #ramisetty, I have done it in some indirect way, first got rid of two brackets [,] out of the map string, then create schema on string without brackets that.
CREATE EXTERNAL TABLE user_ad_reco_serde (
userid STRING,
reco_map STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]+)\\s\\[([^]]+)]"
)
STORED AS TEXTFILE
LOCATION
's3://xxxxxx/data/RS/output/6m/2014-01-2014-05/';
CREATE external table user_ad_reco_plain(
userid bigint,
reco string)
LOCATION
's3://xxxxx/data/RS/output/6m_plain/2014-01-2014-05/';
CREATE external table user_ad_reco (
userid bigint,
reco MAP<bigint , double>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
LOCATION
's3://xxxxxx/data/RS/output/6m_plain/2014-01-2014-05/';
There might be some simpler way.