How to flatten a json structure in HIVE? - sql

I have a JSON structure that looks like this.
There is a structure nested in a column.
HEADER:
{"user":{"location":"USA","id":1514008171,"name":"Auzzie Jet","screenname":"metalheadgrunge","geoenabled":false},"tweetmessage":"Anthrax - Black - Sunglasses hell","createddate":"2013-06-20T12:08:44","identifier":"1234","geolocation":null}
When I query this, this works:
SELECT *
FROM TBL a
WHERE header.identifier = '1234'
But then, when I want to find the location from the nested json structure. It does not work:
SELECT *
FROM TBL a
WHERE header.identifier = '1234'
and a.header.user.location LIKE '%USA%'
Does anyone know how to query this in HIVE?

For flattening json structure you need to first create lateral view using json_tuple that's how you can be able to achieve what you intend to.
Please find the complete solution how to do it.
Step 1: Create external table tweets with single column tweet with data type of string.
CREATE EXTERNAL table tweets (tweet string);
Now put the json string
{"user":{"location":"USA","id":1514008171,"name":"Auzzie Jet","screenname":"metalheadgrunge","geoenabled":false},"tweetmessage":"Anthrax - Black - Sunglasses hell","createddate":"2013-06-20T12:08:44","identifier":"1234","geolocation":null}
Step 2: In text file named tweets.txt and run the below command to load the data from text file into hive table.
LOAD data local inpath 'tweets.txt' into table tweets;
Once done, now we are ready to play on our Json string.
So basically here what we are trying to achieve is querying on identifier and location fields which are basically at different levels.
user
location : USA
id : 1514008171
name : Auzzie Jet
screenname : metalheadgrunge
geoenabled : false
tweetmessage : Anthrax - Black - Sunglasses hell
createddate : 2013-06-20T12:08:44
identifier : 1234
geolocation : null
Level 1 fields are => user, tweetmessage, createddate,identifier, geolocation
Level 2 Fields are => location, id, name, screenname, geoenabled
So firstly we need to create lateral View on Level 1 so that we can query on Level 1 fields. In our example we need to query on identifier. Also to query on Level 2 fields we need to explode our User view which would be possible by lateral view.
LATERAL VIEW json_tuple(t.tweet, 'user', 'identifier' ) t1 as `user`, `identifier`
and then to query on location, we need to create another lateral view for level 2 fields.
LATERAL VIEW json_tuple(t1.`user`,'name', 'location') t2 as `name`, `location`
and that's it finally we can use select on tweets with lateral views.
Step 3 and Final Query:
SELECT t.* FROM tweets t LATERAL VIEW json_tuple(t.tweet, 'user', 'identifier' ) t1 as `user`, `identifier` LATERAL VIEW json_tuple(t1.`user`,'name', 'location') t2 as `name`, `location` where t1.`identifier`=1234 and `location` ="USA";
For more read on lateral view :LateralView
and on Json_Tuple : JsonTuple

Related

Postgresql JSON column check value exists in array of json

I have a database schema like the following where I have a Children record table
CREATE TABLE Children (
name varchar(100),
friends JSON NOT NULL,
);
INSERT INTO Children (name,friends)
VALUES('Sam',
array['{"name":"Rakesh","country":"Africa"}',
'{"name":"Ramesh","country":"India"}']::json[]);
Now I need to query the data and display it only if the name of the friend is like '%Ra'. Structure of the JSON data is consistent.
If you have json[] as data type then you can use unnest and then write your query, or if it is json then you can use json_array_elements.
Below code considers json[] data type -
select * from Children
where name in (
select name from (
select name, unnest(friends) as friend from Children
) i
where i.friend->>'name' like '%Ra');
DBFiddle

copy data from 2 separate tables in postgresql

Just wondering is it possible to extract data from 2 different tables at one time in postgresql
I have the following:
Blocks Table - has been created as follows in order to fit a schema, so the JSON information has all been stored in an information column containing 36 polygons each
UUID (UUID)
Name (TEXT)
Type (TEXT)
Information (TEXT)
815b2ce7-ce99-4d6c-b41a-bec512173f53
C2
Block
'stored JSON info'
7a9a03fc-8be6-47ca-b743-43715ebb5610
D2
Block
'stored JSON info'
9136dcda-2a55-4084-87c1-68ccde23aed8
E3
Block
'stored JSON info'
For a later query, I need to know the geometries of each of the polygons, so I created another table using a code which parsed them out:
CREATE TABLE blockc2_ AS SELECT geom FROM (SELECT elem->>'type' AS type, elem->'properties' AS prop, elem->'geometry' AS geom FROM (SELECT json_array_elements(data) elem FROM block) f1)f2;
A final table is created to show just the geometries (which will a associated with the already created UID's like below
new_table
UUID (UUID)
Geometry (Geometry)
815b2ce7-ce99-4d6c-b41a-bec512173f53
01030000000100000005000000972E05A56D6851C084D91C434C6C32401C05D4886B6851C086D974FA4D6C324078F4DA916D6851C036BF7504766C3240F31D0CAE6F6851C035BF1D4D746C3240972E05A56D6851C084D91C434C6C3240
7a9a03fc-8be6-47ca-b743-43715ebb5610
01030000000100000005000000BB05694F726851C0CB2A87A8486C32403EDC3733706851C0CD2ADF5F4A6C32409ACB3E3C726851C07E10E069726C324017F56F58746851C07C1088B2706C3240BB05694F726851C0CB2A87A8486C3240
9136dcda-2a55-4084-87c1-68ccde23aed8
1030000000100000005000000972E05A56D6851C084D91C434C6C32401C05D4886B6851C086D974FA4D6C324078F4DA916D6851C036BF7504766C3240F31D0CAE6F6851C035BF1D4D746C3240972E05A56D6851C084D91C434C6C3240
Ideally, I need a code like below (if its possible) because if I insert them separately they don't associate with each other. Instead of 3 rows of info, it will be 6 (3 UUIDS and 3 Geometries)
INSERT INTO new_table (uuid, geometry) SELECT UUID FROM blocks WHERE Name='C2' AND SELECT geometry FROM second_table WHERE Name='C2'
Is something like this possible?
create table C (select * from table B union all select * from table A)
This sounds like a join:
INSERT INTO new_table (uuid, geometry)
SELECT b.UUID, g.geometry
FROM blocks b JOIN
geometry g
USING (name)
WHERE Name = 'C2';

Group on Map value in hive

I have a table in hive
CREATE TABLE IF NOT EXISTS user
(
name STRING,
creation_date DATE,
cards map<STRING,STRING>
) STORED AS PARQUET ;
Let's suppose that I want to query the number of Gobelin cards per user and group by them
My query looks like this :
select card["Gobelin"], COUNT(*) from user GROUP BY card["Gobelin"] ;
I get an error on group by saying
FAILED: SemanticException [Error 10033]: Line 54:30 [] not valid on non-collection types '"Gobelin"': string
As far as I understand, you want to count map elements with key "Gobelin". You can explode the map, then filter out other keys and count the remaining values, e.g.
select count(*) as cnt
from(select explode(cards) as (key, val) from table)
where key = 'Gobelin'
You can use lateral view as well, see Hive's manual for details.

How to delete records in BigQuery based on values in an array?

In Google BigQuery, I would like to delete a subset of records, based on the value of a specific column. It's a query that I need to run repeatedly and that I would like to run automatically.
The problem is that this specific column is of the form STRUCT<column_1 ARRAY (STRING), column_2 ARRAY (STRING), ... >, and I don't know how to use such a column in the where-clause when using the delete-command.
Here is basically what I am trying to do (this code does not work):
DELETE
FROM dataset.table t
LEFT JOIN UNNEST(t.category.column_1) AS type
WHERE t.partition_date = '2020-07-22'
AND type = 'some_value'
The error that I'm getting is: Syntax error: Expected end of input but got keyword LEFT at [3:1]
If I replace the DELETE with SELECT *, it does work:
SELECT *
FROM dataset.table t
LEFT JOIN UNNEST(t.category.column_1) AS type
WHERE t.partition_date = '2020-07-22'
AND type = 'some_value'
Does somebody know how to use such a column to delete a subset of records?
EDIT:
Here is some code to create a reproducible example with some silly data (fill in your own dataset and table name in all queries):
Suppose you want to delete all rows where category.type contains the value 'food'.
1 - create a table:
CREATE TABLE <DATASET>.<TABLE_NAME>
(
article STRING,
category STRUCT<
color STRING,
type ARRAY<STRING>
>
);
2 - Insert data into the new table:
INSERT <DATASET>.<TABLE_NAME>
SELECT "apple" AS article, STRUCT('red' AS color, ['fruit','food'] as type) AS category
UNION ALL
SELECT "cabbage" AS article, STRUCT('blue' AS color, ['vegetable', 'food'] as type) AS category
UNION ALL
SELECT "book" AS article, STRUCT('red' AS color, ['object'] as type) AS category
UNION ALL
SELECT "dog" AS article, STRUCT('green' AS color, ['animal', 'pet'] as type) AS category;
3 - Show that select works (return all rows where category.type contains the value 'food'; these are the rows I want to delete):
SELECT *
FROM <DATASET>.<TABLE_NAME>
LEFT JOIN UNNEST(category.type) type
WHERE type = 'food'
Initial Result
4 - My attempt at deleting rows where category.type contains 'food' does not work:
DELETE
FROM <DATASET>.<TABLE_NAME>
LEFT JOIN UNNEST(category.type) type
WHERE type = 'food'
Syntax error: Unexpected keyword LEFT at [3:1]
Desired Result
This is the code I used to delete the desired records (the records where category.type contains the value 'food'.)
DELETE
FROM <DATASET>.<TABLE_NAME> t1
WHERE EXISTS(SELECT 1 FROM UNNEST(t1.category.type) t2 WHERE t2 = 'food')
The embarrasing thing is that I've seen these kind of answers on similar questions (for example on update-queries). But I come from Oracle-SQL and I think that there you are required to connect your subquery with your main query in the WHERE-statement of the subquery (ie. connect t1 with t2), so I didn't understand these answers. That's why I posted this question.
However, I learned that BigQuery automatically understands how to connect table t1 and 'table' t2; you don't have to explicitly connect them.
Now it is possible to still do this (perhaps even recommended?):
DELETE
FROM <DATASET>.<TABLE_NAME> t1
WHERE EXISTS (SELECT 1 FROM <DATASET>.<TABLE_NAME> t2 LEFT JOIN UNNEST(t2.category.type) AS type WHERE type = 'food' AND t1.article=t2.article)
but a second difficulty for me was that my ID in my actual data is somehow hidden in an array>struct-construction, so I got stuck connecting t1 & t2. Fortunately this is not always an absolute necessity.
Since you did not provide any sample data I am going to explain using some dummy data. In case you add your sample data, I can update the answer.
Firstly,according to your description, you have only a STRUCT not an Array[Struct <col_1, col_2>].For this reason, you do not need to use UNNEST to access the values within the data. Below is an example how to access particular data within a STRUCT.
WITH data AS (
SELECT 1 AS id, STRUCT("Alex" AS name, 30 AS age, "NYC" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Leo" AS name, 18 AS age, "Sydney" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Robert" AS name, 25 AS age, "Paris" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Mary" AS name, 28 AS age, "London" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Ralph" AS name, 45 AS age, "London" AS city) AS info
)
SELECT * FROM data
WHERE info.city = "London"
Notice that the STRUCT is named info and the data we accessed is city and used it in the WHERE clause.
Now, in order to delete the rows that contains an specific value within the STRUCT , in your case I assume it would be your_struct.column_1, you can use DELETE or MERGE and DELETE. I have saved the above data in a table to execute the below examples, which have the same output,
First method: DELETE
DELETE FROM `project.dataset.table`
WHERE info.city = "Sydney"
Second method: MERGE and DELETE
MERGE `project.dataset.table` a
USING (SELECT * from `project.dataset.table` WHERE info.city ="London") b
ON a.info.city =b.info.city
WHEN matched and b.id=1 then
Delete
And the output for both queries,
Row id info.name info.age info.city
1 1 Alex 30 NYC
2 1 Robert 25 Paris
3 1 Ralph 45 London
4 1 Mary 28 London
As you can see the row where info.city = "Sydney" was deleted in both cases.
It is important to point out that your data is excluded from your source table. Therefore, you should be careful.
Note: Since you want to run this process everyday, you could use Schedule Query within BigQuery Console, appending or overwriting the results after each run. Also, it is a good practice not deleting data from your source table. Thus, consider creating a new table from your source table without the rows you do not desire.

Hive query reading a list of constants from a text file?

I want to extract data for a list of userid I am interested in. If the list is short, I can type the query directly:
SELECT * FROM mytable WHERE userid IN (100, 101, 102);
(this is an example, the query might be more complex). But the list of userid might be long and available as a text file:
100
101
102
How can I run the same query with Hive reading from userids.txt directly?
One way is to put the data in another table and INNER JOIN to it, so that there has to be a match for the record to go through:
Create the table: CREATE TABLE users (userid INT);
Load the data file: LOAD DATA LOCAL INPATH 'userids.txt' INTO TABLE users;
Filter through the inner join: SELECT mytable.* FROM mytable INNER JOIN users ON mytable.userid = users.userid;