Group on Map value in hive - sql

I have a table in hive
CREATE TABLE IF NOT EXISTS user
(
name STRING,
creation_date DATE,
cards map<STRING,STRING>
) STORED AS PARQUET ;
Let's suppose that I want to query the number of Gobelin cards per user and group by them
My query looks like this :
select card["Gobelin"], COUNT(*) from user GROUP BY card["Gobelin"] ;
I get an error on group by saying
FAILED: SemanticException [Error 10033]: Line 54:30 [] not valid on non-collection types '"Gobelin"': string

As far as I understand, you want to count map elements with key "Gobelin". You can explode the map, then filter out other keys and count the remaining values, e.g.
select count(*) as cnt
from(select explode(cards) as (key, val) from table)
where key = 'Gobelin'
You can use lateral view as well, see Hive's manual for details.

Related

Deduping data in BigQuery

I have a query the shows only the non duplicate values, I am looking for a solution on how to use this deduped data in other queries.
I do not have permissions to create anything, so i need to find a solution for that.
IDAN
EDIT (from "answer"):
this are the fields in my table "Purchases": user_id purchase_amount purchase_sku source device_type uuid - a unique identifier for each row
duplicate is considered when all fields except the uuid are identical. i need to return deduplicated data and prepare it for use for other queries.
this is the basic data, with duplicated values in rows 5-6 and 7-8
i want to show to non duplicate rows ,and on the duplicated row show only one row,like this:
deduped data
Consider below generic solution - you do not need to enlist all the column names at all - only uuid is used in query)
select any_value(t).*
from `project.dataset.table` t
group by to_json_string((select as struct * except(uuid) from unnest([t])))
You can use qualify with row_number():
select p.*
from purchases p
where 1=1
qualify row_number() over (partition by user_id, purchase_amount, purchase_sku, source, device_type order by uuid) = 1;
You can also use aggregation:
select purchase_amount, purchase_sku, source, device_type,
min(uuid) as uuid
from purchases
group by 1, 2, 3, 4;

How to delete records in BigQuery based on values in an array?

In Google BigQuery, I would like to delete a subset of records, based on the value of a specific column. It's a query that I need to run repeatedly and that I would like to run automatically.
The problem is that this specific column is of the form STRUCT<column_1 ARRAY (STRING), column_2 ARRAY (STRING), ... >, and I don't know how to use such a column in the where-clause when using the delete-command.
Here is basically what I am trying to do (this code does not work):
DELETE
FROM dataset.table t
LEFT JOIN UNNEST(t.category.column_1) AS type
WHERE t.partition_date = '2020-07-22'
AND type = 'some_value'
The error that I'm getting is: Syntax error: Expected end of input but got keyword LEFT at [3:1]
If I replace the DELETE with SELECT *, it does work:
SELECT *
FROM dataset.table t
LEFT JOIN UNNEST(t.category.column_1) AS type
WHERE t.partition_date = '2020-07-22'
AND type = 'some_value'
Does somebody know how to use such a column to delete a subset of records?
EDIT:
Here is some code to create a reproducible example with some silly data (fill in your own dataset and table name in all queries):
Suppose you want to delete all rows where category.type contains the value 'food'.
1 - create a table:
CREATE TABLE <DATASET>.<TABLE_NAME>
(
article STRING,
category STRUCT<
color STRING,
type ARRAY<STRING>
>
);
2 - Insert data into the new table:
INSERT <DATASET>.<TABLE_NAME>
SELECT "apple" AS article, STRUCT('red' AS color, ['fruit','food'] as type) AS category
UNION ALL
SELECT "cabbage" AS article, STRUCT('blue' AS color, ['vegetable', 'food'] as type) AS category
UNION ALL
SELECT "book" AS article, STRUCT('red' AS color, ['object'] as type) AS category
UNION ALL
SELECT "dog" AS article, STRUCT('green' AS color, ['animal', 'pet'] as type) AS category;
3 - Show that select works (return all rows where category.type contains the value 'food'; these are the rows I want to delete):
SELECT *
FROM <DATASET>.<TABLE_NAME>
LEFT JOIN UNNEST(category.type) type
WHERE type = 'food'
Initial Result
4 - My attempt at deleting rows where category.type contains 'food' does not work:
DELETE
FROM <DATASET>.<TABLE_NAME>
LEFT JOIN UNNEST(category.type) type
WHERE type = 'food'
Syntax error: Unexpected keyword LEFT at [3:1]
Desired Result
This is the code I used to delete the desired records (the records where category.type contains the value 'food'.)
DELETE
FROM <DATASET>.<TABLE_NAME> t1
WHERE EXISTS(SELECT 1 FROM UNNEST(t1.category.type) t2 WHERE t2 = 'food')
The embarrasing thing is that I've seen these kind of answers on similar questions (for example on update-queries). But I come from Oracle-SQL and I think that there you are required to connect your subquery with your main query in the WHERE-statement of the subquery (ie. connect t1 with t2), so I didn't understand these answers. That's why I posted this question.
However, I learned that BigQuery automatically understands how to connect table t1 and 'table' t2; you don't have to explicitly connect them.
Now it is possible to still do this (perhaps even recommended?):
DELETE
FROM <DATASET>.<TABLE_NAME> t1
WHERE EXISTS (SELECT 1 FROM <DATASET>.<TABLE_NAME> t2 LEFT JOIN UNNEST(t2.category.type) AS type WHERE type = 'food' AND t1.article=t2.article)
but a second difficulty for me was that my ID in my actual data is somehow hidden in an array>struct-construction, so I got stuck connecting t1 & t2. Fortunately this is not always an absolute necessity.
Since you did not provide any sample data I am going to explain using some dummy data. In case you add your sample data, I can update the answer.
Firstly,according to your description, you have only a STRUCT not an Array[Struct <col_1, col_2>].For this reason, you do not need to use UNNEST to access the values within the data. Below is an example how to access particular data within a STRUCT.
WITH data AS (
SELECT 1 AS id, STRUCT("Alex" AS name, 30 AS age, "NYC" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Leo" AS name, 18 AS age, "Sydney" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Robert" AS name, 25 AS age, "Paris" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Mary" AS name, 28 AS age, "London" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Ralph" AS name, 45 AS age, "London" AS city) AS info
)
SELECT * FROM data
WHERE info.city = "London"
Notice that the STRUCT is named info and the data we accessed is city and used it in the WHERE clause.
Now, in order to delete the rows that contains an specific value within the STRUCT , in your case I assume it would be your_struct.column_1, you can use DELETE or MERGE and DELETE. I have saved the above data in a table to execute the below examples, which have the same output,
First method: DELETE
DELETE FROM `project.dataset.table`
WHERE info.city = "Sydney"
Second method: MERGE and DELETE
MERGE `project.dataset.table` a
USING (SELECT * from `project.dataset.table` WHERE info.city ="London") b
ON a.info.city =b.info.city
WHEN matched and b.id=1 then
Delete
And the output for both queries,
Row id info.name info.age info.city
1 1 Alex 30 NYC
2 1 Robert 25 Paris
3 1 Ralph 45 London
4 1 Mary 28 London
As you can see the row where info.city = "Sydney" was deleted in both cases.
It is important to point out that your data is excluded from your source table. Therefore, you should be careful.
Note: Since you want to run this process everyday, you could use Schedule Query within BigQuery Console, appending or overwriting the results after each run. Also, it is a good practice not deleting data from your source table. Thus, consider creating a new table from your source table without the rows you do not desire.

Is there a way to check multiple columns using "IN" condition in Redshift Spectrum?

I have a Redshift Spectrum table named as customer_details_table where the column id is not unique. I have another column hierarchy which is based on which record should be given priority if they have the same id. Here's an example:
Here, if we encounter the same id as 28846 multiple times, we will choose John as the one to be qualified, considering he has the maximum hierarchy.
I'm trying to create this eligibility column using a group by on id and then selecting the record corresponding to maximum hierarchy. Here's my SQL code:
SELECT *,
CASE WHEN (
(id , hierarchy) IN
(SELECT id , max(hierarchy)
FROM
customer_details_table
GROUP BY id
)
) THEN 'Qualified' ELSE 'Disqualified' END as eligibility
FROM
customer_details_table
Upon running this I get the following error:
SQL Error [500310] [XX000]: [Amazon](500310) Invalid operation: This type of IN/NOT IN query is not supported yet;
The above code works fine when my table (customer_details_table) is a regular Redshift table, but fails when the same table is an external spectrum table. Can anyone please suggest a good solution/alternative to achieve the same logic in spectrum tables?
You can use window functions to generate the eligibility column:
Basically you need to partition the rows by id, and rank by descending hierarchy within each group.
select
*,
case when row_number() over(partition by id order by hierarchy desc) = 1
then 'Qualified' else 'Disqualified'
end eligibility
from customer_details_table
You can use window functions:
select cdt.*
from (select cdt.*,
row_number() over (partition by id order by hierarchy desc) as seqnum
from customer_details_table cdt
) cdt
where seqnum = 1;

Count values in Oracle table

I have this table which I want to use to store events.
CREATE TABLE EVENTS(
EVENTID INTEGER NOT NULL,
SOURCE VARCHAR2(50 ),
TYPE VARCHAR2(50 ),
EVENT_DATE DATE,
DESCRIPTION VARCHAR2(100 )
)
T have four types of events: info, warning, error, Critical
I need to count them in order to display the values into Bar Chart.
Is it possible to create SQL query which returns four values. For example:
info 12,
warning 332,
error 442,
Critical 23
I need only the type and the count.
It looks like you want a simple aggregation
SELECT type, count(*)
FROM events
GROUP BY type
ORDER BY (CASE type WHEN 'info' THEN 1
WHEN 'warning' THEN 2
WHEN 'error' THEN 3
WHEN 'critical' THEN 4
END) asc
It's not obvious to me whether (or how) you are sorting the data. I would expect that you'd want to store a sort order somewhere so that you don't have dozens of queries that implement the same sort order that have to be changed in the future when you add another type.
You can see GROUP BY reference for aditional information
SELECT type, count(*)
FROM events
GROUP BY type
You can get the desired output in these two ways:
SELECT type,COUNT(*) FROM EVENTS GROUP BY type;
SELECT DISTINCT type, COUNT(*) over(partition BY type) FROM EVENTS;

Random sample table with Hive, but including matching rows

I have a large table containing a userID column and other user variable columns, and I would like to use Hive to extract a random sample of users based on their userID. Furthermore, sometimes these users will be on multiple rows and if a randomly selected userID is contained in other parts of the table I would like to extract those rows too.
I had a look at the Hive sampling documentation and I see that something like this can be done to extract a 1% sample:
SELECT * FROM source
TABLESAMPLE (1 PERCENT) s;
but I am not sure how to add the constraint where I would like all other instances of those 1% userIDs selected too.
You can use rand() to split the data randomly and with the proper percent of userid in your category. I recommend rand() because setting the seed to something make the results repeatable.
select c.*
from
(select userID
, if(rand(5555)<0.1, 'test','train') end as type
from
(select userID
from mytable
group by userID
) a
) b
right outer join
(select *
from userID
) c
on a.userid=c.userid
where type='test'
;
This is set up for entity level modeling purposes, which is why I have test and train as types.