Can we create several entries from one line? - hive

My logs look like this:
client_id;event_1;event_2;event3
And i would like to get an SQL Table like this:
client_id | event
---------------------
... | event_1
... | event_2
... | event_3
I am new to Hive, it seems to me that one log line always provides one entry in the resulting SQL table.
I tried the following (unsuccessful):
CREATE EXTERNAL TABLE IF NOT EXISTS tablename (
client_id String,
`event` String
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^([^\;]+);.*([^\;]+).*$" )
LOCATION 's3://myBucket/prefix/';
It takes only the first event and ignore the others...

Unfortunately, it is not possible to generate rows using SerDe in table DDL. It's possible to do the same in Hive.
(1) Read all user events as a single column:
CREATE EXTERNAL TABLE IF NOT EXISTS tablename (
client_id String,
events String
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^([^\\;]+)\\;(.*)$" )
LOCATION 's3://myBucket/prefix/';
Check, it should read two columns, user_id and all events concatenated:
'client_id' and 'event_1;event_2;event3'
(2) Split events and explode to generate rows:
select t.user_id, e.event
from tablename t
lateral view outer explode(split(t.events,'\\;')) e as event;
Read also about Lateral View.
In Athena use UNNEST with CROSS JOIN:
select t.user_id, e.event
from tablename t
CROSS JOIN UNNEST(SPLIT(t.events,';')) AS e (event)

Related

Selecting most recent timestamp row and getting values from a column with a Variant DataType

I hope the title makes some sense, I'm open to suggestions if I should make it more readable.
I have a temp table in Snowflake called BI_Table_Temp. It has 2 columns Load_DateTime with a datatype Timestamp_LTZ(9) and JSON_DATA which is a Variant datatype that's has nested records from a JSON file. I want to query this table which I then plan to ingest to another table but I want to make sure I always get the most recent Load_DateTime row.
I've tried this, which works but it shows me the Load_DateTime column and I don't want that I just want to get the values from the JSON_DATA row that has the max Load_DateTime timestamp:
SELECT
MAX(Load_DateTime),
transactions.value:id::string as id
transactions.value:value2::string as account_value
transactions.value:value3::string as new_account_value
FROM BI_Table_Temp,
LATERAL FLATTEN (JSON_DATA:transactions) as transactions
GROUP BY transactions.value
A simple option:
WITH data AS (
SELECT Load_DateTime
, transactions.value:id::string as id
, transactions.value:value2::string as account_value
, transactions.value:value3::string as new_account_value
FROM BI_Table_Temp,
LATERAL FLATTEN (JSON_DATA:transactions) as transactions
), max_load AS (
SELECT MAX(Load_DateTime) Load_DateTime, id
FROM data,
GROUP BY id
)
SELECT transactions.value:id::string as id
, transactions.value:value2::string as account_value
, transactions.value:value3::string as new_account_value
FROM data
JOIN max_load
USING (id, Load_DateTime)
Since transactions.value is a variant, I'm guessing that for GROUP BY transactions.value you really mean GROUP BY transactions.value:id.

In Athena how do I query a member of a struct in an array in a struct?

I am trying to figure out how to query where I am checking the value of usage given the following table creation:
CREATE EXTERNAL TABLE IF NOT EXISTS foo.test (
`id` string,
`foo` struct< usages:array< struct< usage:string,
method_id:int,
start_at:string,
end_at:string,
location:array<string> >>>
) PARTITIONED BY (
timestamp date
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1' ) LOCATION 's3://foo.bar/' TBLPROPERTIES ('has_encrypted_data'='false');
I would like to have a query like:
SELECT * FROM "foo"."test" WHERE foo.usages.usage is null;
When I do that I get:
SYNTAX_ERROR: line 1:53: Expression "foo"."usages" is not of type ROW
If I make my query where I directly index the array as seen in the following it works.
SELECT * FROM "foo"."test" WHERE foo.usages[1].usage is null;
My overall goal though is to query across all items in the usages array and find any row where at least one item in the usages array has a member usage that is null.
Athena is based on Presto. In Presto 318 you can use any_match:
SELECT * FROM "foo"."test"
WHERE any_match(foo.usages, element -> element.usage IS NULL);
I think the function is not available in Athena yet, but you can emulate it using reduce.
SELECT * FROM "foo"."test"
WHERE reduce(
foo.usages, -- array to reducing
false, -- initial state
(state, element) -> state OR element.usage IS NULL, -- combining function
state -> state); -- output function (identity in this case)
You can achieve this by unnesting the array into rows and then check those for null values. This will result in one row per null-value entry.
select * from test
CROSS JOIN UNNEST(foo.usages) AS t(i)
where i.usage is null
So if you only nee the unique set, you must run this through a select distinct.
select distinct id from test
CROSS JOIN UNNEST(foo.usages) AS t(i)
where i.usage is null
Another way to emulate any_match(<array>, <function>) is with cardinality(filter(<array>, <function>)) > 0.
SELECT * FROM "foo"."test"
WHERE any_match(foo.usages, element -> element.usage IS NULL);
Becomes:
SELECT * FROM "foo"."test"
WHERE cardinality(filter(foo.usages, element -> element.usage IS NULL)) > 0

I want to join two tables with a common column in Big query?

To join the tables, I am using the following query.
SELECT *
FROM(select user as uservalue1 FROM [projectname.FullData_Edited]) as FullData_Edited
JOIN (select user as uservalue2 FROM [projectname.InstallDate]) as InstallDate
ON FullData_Edited.uservalue1=InstallDate.uservalue2;
The query works but the joined table only has two columns uservalue1 and uservalue2.
I want to keep all the columns present in both the table. Any idea how to achieve that?
#legacySQL
SELECT <list of fields to output>
FROM [projectname:datasetname.FullData_Edited] AS FullData_Edited
JOIN [projectname:datasetname.InstallDate] AS InstallDate
ON FullData_Edited.user = InstallDate.user
or (and preferable)
#standardSQL
SELECT <list of fields to output>
FROM `projectname.datasetname.FullData_Edited` AS FullData_Edited
JOIN `projectname.datasetname.InstallDate` AS InstallDate
ON FullData_Edited.user = InstallDate.user
Note, using SELECT * in such cases lead to Ambiguous column name error, so it is better to put explicit list of columns/fields you need to have in your output
The way around it is in using USING() syntax as in example below.
Assuming that user is the ONLY ambiguous field - it does the trick
#standardSQL
SELECT *
FROM `projectname.datasetname.FullData_Edited` AS FullData_Edited
JOIN `projectname.datasetname.InstallDate` AS InstallDate
USING (user)
For example:
#standardSQL
WITH `projectname.datasetname.FullData_Edited` AS (
SELECT 1 user, 'a' field1
),
`projectname.datasetname.InstallDate` AS (
SELECT 1 user, 'b' field2
)
SELECT *
FROM `projectname.datasetname.FullData_Edited` AS FullData_Edited
JOIN `projectname.datasetname.InstallDate` AS InstallDate
USING (user)
returns
user field1 field2
1 a b
whereas using ON FullData_Edited.user = InstallDate.user gives below error
Error: Duplicate column names in the result are not supported. Found duplicate(s): user
Don't use subqueries if you want all columns:
SELECT *
FROM [projectname.FullData_Edited] as FullData_Edited JOIN
[projectname.InstallDate] as InstallDate
ON FullData_Edited.uservalue1 = InstallDate.uservalue2;
You may have to list out the particular columns you want to avoid duplicate column names.
While you are at it, you should also switch to standard SQL.

Accessing BigQuery RECORD - Repeated in Tableau

I have a BigQuery Table with a column of RECORD type & mode REPEATED. I have to query and use this table in Tableau. Using UNNEST or FLATTEN in BigQuery is performing CROSS JOIN of the Table which is impacting performance. Is there any other way to use this table in Tableau without flattening it. Have posted the table schema image link below.
[Schema of Table]
https://i.stack.imgur.com/T4jHg.png
Is there any other way to use ... ?
You should not afraid UNNEST just because it “does” CROSS JOIN
The trick is that even though it is cross join but it is cross join within the row only and global to all rows in table. At the same time, there are always way to do stuff different
So, below example 1 – presents dummy example using UNNEST
And then Example 2 – shows how to do the same without using UNNEST, but rather using SQL UDF
You have not presented specifics about your case, so below is generic enough to show ‘other’ way
With Flattening via UNNEST
#standardSQL
WITH yourTable AS (
SELECT 1 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(1,'y','a','xxx'),(2,'n','b','yyy'),(3,'y','c','zzz'),(4,'n','d','vvv')] AS type UNION ALL
SELECT 2 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(11,'t','c','xxx'),(21,'n','a','yyy'),(31,'y','c','zzz'),(41,'f','d','vvv')] AS type
)
SELECT id, SUM(t.details) AS details
FROM yourTable, UNNEST(type) AS t
WHERE t.flag = 'y'
GROUP BY id
With SQL UDF
#standardSQL
CREATE TEMP FUNCTION do_something (
type ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
)
RETURNS INT64 AS ((
SELECT SUM(t.details) AS details
FROM UNNEST(type) AS t
WHERE t.flag = 'y'
));
WITH yourTable AS (
SELECT 1 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(1,'y','a','xxx'),(2,'n','b','yyy'),(3,'y','c','zzz'),(4,'n','d','vvv')] AS type UNION ALL
SELECT 2 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(11,'t','c','xxx'),(21,'n','a','yyy'),(31,'y','c','zzz'),(41,'f','d','vvv')] AS type
)
SELECT id, do_something(type) AS details
FROM yourTable

Hive Query Where Json Like or Equals

I'm new to learning about Hive and Hadoop. There is a table that I've created, which references a certain location containing files.
CREATE DATABASE IF NOT EXISTS <dbname>
LOCATION '/user/<username>/hive/<dbname>.db';
USE <dbname>;
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (json STRING)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS Parquet
LOCATION '/my-data/my/files';
This table has four columns: year, month, day, and json.
The json would look something like:
{
"t_id":"user.login",
"e_time":"2014-11-30T23:59:52Z",
"user_email_address":"someemail#email.com",
"la_id":"10",
"dbnum":16,
"remote_ip":"171.154.1.8",
"server_name":"some.server",
"protocol":"IMAPS",
"secure":true,
"result":"success"
}
A basic query, that works, looks something like this:
SELECT json FROM mydb WHERE year=2015 AND month=12 LIMIT 10;
What I would like to do is have a where clause where I could filter on the json fields listed above. I imagine that it would look like the following, but it does not work:
SELECT get_json_object(mytable.json, '$.t_id') as whatever
FROM mytable
WHERE year=2015 AND month=12 AND json like '%user.login%' LIMIT 1;
Or better yet, be able to query based on the json like so:
SELECT COUNT(*)
FROM mytable
WHERE json.t_id = 'user.login'
AND json.someDate > ... and so on...
Any advice is appreciated.
try this query:
select b.t_id from my_table a lateral view json_tuple(a.json,'t_id') b as t_id where a.year=2015 and a.month=12 LIMIT 10;
you can call another key in the json_tuple and use it in a where clause as well. e.g.:
select b.t_id from my_table a lateral view json_tuple(a.json,'t_id','result') b as t_id, result where a.year=2015 and a.month=12 and b.result ='true' LIMIT 10;
You need to have JSON Serde to read the data in JSON format. You can actually create table using JSON format then query normal table.
-- Add jar file using "add jar /path-to/hive-json-serde-0.2.jar"
CREATE EXTERNAL TABLE states_json (state_short_name string, state_full_name string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
LOCATION '/user/hduser/states.json';
states.json have data like {"state_short_name":"CA", "state_full_name":"California"}