Need help reading complex data in hive - hive

I've a data in file like
[street#226 S Wabash Ave,city#Chicago,state#IL]
[street#227 B H Tower,city#San Diego,state#CA]
I've created table using this code:
create external table if not exists test (
address map<string,string>
)
row format delimited fields terminated by '\;'
COLLECTION ITEMS TERMINATED BY ','
map keys terminated by '#'
;
but when I'm reading the file, I'm getting it parsed like this:
{"[street#939 W El Camino":null,"city#Chicago":null,"state#IL]":null}
{"[street#415 N Mary Ave":null,"city#Chicago":null,"state#IL]":null}
What should I do ?

Solved. I first loaded the data as a string in the table and then created another table using
select str_to_map(regexp_replace(address,'\\[|\\]',''),',','#') as address from test;

Related

Error Message "HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns..."

I ran this in AWS Athena:
CREATE EXTERNAL TABLE IF NOT EXISTS `nina-nba-database`.`nina_nba_test` (
`Data` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = 'nina'
) LOCATION 's3://nina-gray/'
TBLPROPERTIES ('has_encrypted_data'='false');
However when I try to select the table using the syntax below:
SELECT * FROM "nina-nba-database"."nina_nba_table" limit 10;
It gives me this error:
HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns
This query ran against the "layla-nba-database" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: b96e4344-5bbe-4eca-9da4-70be11f8e87d
Would anyone be able to help?
The input.regex in your query doesn't look like valid one. The specified regex group while creating the table becomes a new column. So if you want to read data inside a column as new column you can specify the valid regex, to understand more about regex you can refer to Regex SerDe examples from this aws documentation. Or if your use case to just read columnar data you can create the table specifying proper delimiter, For example if your data is comma separated you can specify the delimiter as
...
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
...
have a look at this example for more details.

How to access individul elements of a blob in dynamoDb using a hive script?

I am transferring data from DynamoDB to S3 using a hive script in AWS Data Pipeline. I am using a script like this :
CREATE EXTERNAL TABLE dynamodb_table ( PROPERTIES STRING, EMAIL
STRING, ............. ) STORED BY
'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES
("dynamodb.table.name" = "${DYNAMODB_INPUT_TABLE}",
"dynamodb.column.mapping" =
"PROPERTIES:Properties,EMAIL:EmailId....");
CREATE EXTERNAL TABLE s3_table (
PROPERTIES STRING,
EMAIL STRING,
......
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY'\n'
LOCATION '${S3_OUTPUT_BUCKET}';
INSERT OVERWRITE TABLE s3_table SELECT * FROM dynamodb_table;
The Properties column in DyanmoDB table is like this
Properties : String
:{\"deal\":null,\"MinType\":null,\"discount\":null}
that is it contains multiple attributes in it. I want each attribute in Properties to come as a separate column (not just a string in a single column). I want the output in this schema
deal MinType discount EMAIL
How can I do this?
Is your Properties column in proper JSON format? If so, it looks like you can - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object

Not getting the desired output with a Hive query

I have two input files which are semicolon delimited. I loaded these files into two tables. Both tables contain the information on books. I joined both the tables on ISBN field. For creating these tables I used the below query to skip header and to read semi colon delimited files:-
Create table books (ISBN STRING,BookTitle STRING,BookAuthor STRING,YearOfPublication STRING,Publisher STRING,ImageURLS STRING,ImageURLM STRING,ImageURLL STRING) row format delimited fields terminated by '\;' lines terminated by '\n' tblproperties ("skip.header.line.count"="1");
Now when I am trying the below query but I am not getting the desired output:-
SELECT a.BookRating, COUNT(BookTitle)
FROM Books b
JOIN Rating a
on (b.ISBN = a.ISBN)
WHERE b.YearOfPublication = 2002
GROUP BY a.BookRating;
I am not getting anything. It just shows OK on the terminal after the query runs completely. Please let me know what can be done. Thanks in advance.
Your DDL script is not proper.
You have mentioned
row format delimited fields terminated by '\;'
But actually it should be
row format delimited fields terminated by ';'
Try this and let me know
YearOfPublication is a string so you need to change it to
WHERE b.YearOfPublication = '2002'

Hive MAP isn't reading input correctly

I am trying create a table on this mahout recommender system output data on s3.
703209355938578 [18519:1.5216354,18468:1.5127649,17962:1.5094717,18317:1.5075916]
828667482548563 [18070:1.0,18641:1.0,18632:1.0,18770:1.0,17814:1.0,18095:1.0]
1705358040772485 [18783:1.0,17944:1.0,18632:1.0,18770:1.0,18914:1.0,18386:1.0]
with this schema,
CREATE external table user_ad_reco (
userid bigint,
reco MAP<bigint , double>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
LOCATION
's3://xxxxx/data/RS/output/m05/';
but while I am reading data back with hive,
hive >
select * from user_ad_reco limit 10;
It is giving output like this
703209355938578 {18519:1.5216354,18468:1.5127649,17962:null}
828667482548563 {18070:1.0,18641:1.0,18632:1.0,18770:1.0,17814:null}
1705358040772485 {18783:1.0,17944:1.0,18632:1.0,18770:1.0,18914:null}
So, last key:value of map input is missing in output with null in last output pair :(.
Can anyone help regarding this?
Reason for nulls :
input data format with brackets gives null, cause of brackets the row
format in not being properly read , the last map entry 1.5075916
is being read as 1.5075916], so it's giving null due to data type
mismatch.
703209355938578 [ 18519:1.5216354,18468:1.5127649,17962:1.5094717,18317:1.5075916 ]
input data format without brackets works clean : (tested)
703209355938578 18519:1.5216354,18468:1.5127649,17962:1.5094717,18317:1.5075916
Thanks #ramisetty, I have done it in some indirect way, first got rid of two brackets [,] out of the map string, then create schema on string without brackets that.
CREATE EXTERNAL TABLE user_ad_reco_serde (
userid STRING,
reco_map STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]+)\\s\\[([^]]+)]"
)
STORED AS TEXTFILE
LOCATION
's3://xxxxxx/data/RS/output/6m/2014-01-2014-05/';
CREATE external table user_ad_reco_plain(
userid bigint,
reco string)
LOCATION
's3://xxxxx/data/RS/output/6m_plain/2014-01-2014-05/';
CREATE external table user_ad_reco (
userid bigint,
reco MAP<bigint , double>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
LOCATION
's3://xxxxxx/data/RS/output/6m_plain/2014-01-2014-05/';
There might be some simpler way.

Update a column in table using SQL*Loader

I have written an SQL script having the below query. The query works fine.
update partner set is_seller_buyer=1 where id in (select id from partner
where names in
(
'A','B','C','D','E',... -- Around 100 names.
));
But now instead of writing around 100 names in a query itself, I want to fetch all the names from the CSV file. I read about SQL*Loader on the Internet, but I did not get much on an update query.
My CSV file only contain names.
I have tried
load data
infile 'c:\data\mydata.csv'
into table partner set is_wholesaler_reseller=1
where id in (select id from partner
where names in
(
'A','B','C','D','E',... -- Around 100 names.
));
fields terminated by "," optionally enclosed by '"'
( names, sal, deptno )
How can I achieve this?
SQL*Loader does not perform updates, only inserts. So, you should insert your names into a separate table, say names, and run your update from that:
update partner set is_seller_buyer=1 where id in (select id from partner
where names in
(
select names from names
));
Your loader script can be changed to:
load data
infile 'c:\data\mydata.csv'
into table names
fields terminated by "," optionally enclosed by '"'
( names, sal, deptno )
An alternate to this is to use External Tables which allows Oracle to treat a flat file like it is a table. An example to get you started can be found here.