BigQuery COUNT column per ARRAY element - google-bigquery

I have a table with the following schema
id: STRING, NULLABLE
values: STRING, REPEATED
Row sample
----------------------------
| id | values |
----------------------------
| 123abc| [val1,val2,val3] |
----------------------------
I wanna count the number of ids per value.
Output sample
----------------------------
| value | id_count |
----------------------------
| val1 | 1 |
----------------------------
| val2 | 1 |
----------------------------
| val3 | 1 |
----------------------------
I've created the following query and it's working fine but I'm looking for a better way of doing it
SELECT value, COUNT(id) AS id_count
FROM(
SELECT id, value
FROM `myproject.mytable`, UNNEST(values) AS value
)
GROUP BY value
I'm trying to reduce the amount of data shuffled between the workers, so I looking for a way to get around the UNNEST function.

Consider below
select value, count(distinct id) id_count
from `myproject.mytable`, unnest(values) AS value
group by value

Related

Making intervals on value changes out of an ordered table by value

I have a specific task using PostgreSQL, which i need to come with a SQL query, but unfortunately, i'm not able to get a right result.
The example of the table i have:
----------------------------------------------
| ID | value | usage-date |
----------------------------------------------
| 1 | value1 | 2020-09-10 |
----------------------------------------------
| 1 | value1 | 2020-09-15 |
----------------------------------------------
| 1 | value1 | 2020-09-20 |
----------------------------------------------
| 1 | value1 | 2020-09-23 |
----------------------------------------------
| 1 | value1 | 2020-09-25 |
----------------------------------------------
| 1 | value1 | 2020-09-30 |
----------------------------------------------
| 1 | value2 | 2020-09-15 |
----------------------------------------------
| 1 | value2 | 2020-09-20 |
----------------------------------------------
| 1 | value2 | 2020-09-23 |
----------------------------------------------
| 1 | value2 | 2020-09-25 |
----------------------------------------------
So the table is ordered by ID, value and usage-date. My task is to extract intervals that should tell from which date to which date a certain value was active. So the output would look like following:
--------------------------------------------------------------
| ID | value | start-date | end-date |
--------------------------------------------------------------
| 1 | value1 | 2020-09-10 | 2020-09-15 |
--------------------------------------------------------------
| 1 | value2 | 2020-09-15 | 2020-09-20 |
--------------------------------------------------------------
| 1 | value1 | 2020-09-20 | 2020-09-23 |
--------------------------------------------------------------
| 1 | value2 | 2020-09-23 | 2020-09-25 |
--------------------------------------------------------------
| 1 | value1 | 2020-09-25 | 2020-09-30 |
--------------------------------------------------------------
Anyone some idea how is it possible to do it?
Here is an SQL fiddle so anyone can try.
step-by-step demo:db<>fiddle (click)
I guess, that two rows in your example are one for "start" and one for "end" of the activity. Because you didn't show any additional identifier, the query is more complex than it has to be. Because of this, we first have to remove the "end" rows manually.
SELECT
id,
value,
usagedate,
COALESCE( -- 4
lead(usagedate) OVER (PARTITION BY id ORDER BY usagedate), -- 3
first_value -- 4
)
FROM (
SELECT
*,
row_number() OVER (PARTITION BY id, value ORDER BY usagedate), -- 1
first_value(usagedate) OVER (PARTITION BY id ORDER BY usagedate DESC) -- 2
FROM
t
)s
WHERE row_number % 2 = 1 -- 1
add a row count, an identifier to the table. The idea is to remove every second row later (the "end" records). This can be done by the row_number() window function
We need to keep the last date value (we need it later). This can be done with fetching the first value in descending order (== last value in normal ascending order). Here we use an appropriate window function as well.
After removing every second row (see the WHERE clause), we can use the lead() window function to get the value of every ordered next row into the current one. So, we are able to use the start date of every new activity as end date of the current one. This is, why you, actually, don't need the "end" rows and we deleted them.
Only the very last records could make a problem. Because there is no next record after the last one, the lead() function returns NULL. To avoid this, we fetched the very last record (the end of the last activity) previously. With the COALESCE() function this can be achieved. It takes the first not null value in its parameter list, which is the lead value if it is not the last one, the first_value if it is
Your input does not coincide with the expected output , there is no clear distinction between value 1 and value2 interms of expected date range, it seems you just want to brute force the output.
Answer: Rather than making multiple entries for each value if you want a close range of dates for that value you can use the below query:
select distinct one1.id, vals.value1, one.minn as start_date, two.maxx as end_date from
table one1 inner join
(select value1 vals, min(user_date) as minn from table group by value1) one
on
one1.value = one.vals
inner join (
select value1 valm, max(user_date) as maxx from table group by value1) two
on one.vals = two.valm

Oracle SQL: Counting how often an attribute occurs for a given entry and choosing the attribute with the maximum number of occurs

I have a table that has a number column and an attribute column like this:
1.
+-----+-----+
| num | att |
-------------
| 1 | a |
| 1 | b |
| 1 | a |
| 2 | a |
| 2 | b |
| 2 | b |
+------------
I want to make the number unique, and the attribute to be whichever attribute occured most often for that number, like this (This is the end-product im interrested in) :
2.
+-----+-----+
| num | att |
-------------
| 1 | a |
| 2 | b |
+------------
I have been working on this for a while and managed to write myself a query that looks up how many times an attribute occurs for a given number like this:
3.
+-----+-----+-----+
| num | att |count|
------------------+
| 1 | a | 1 |
| 1 | b | 2 |
| 2 | a | 1 |
| 2 | b | 2 |
+-----------------+
But I can't think of a way to only select those rows from the above table where the count is the highest (for each number of course).
So basically what I am asking is given table 3, how do I select only the rows with the highest count for each number (Of course an answer describing providing a way to get from table 1 to table 2 directly also works as an answer :) )
You can use aggregation and window functions:
select num, att
from (
select num, att, row_number() over(partition by num order by count(*) desc, att) rn
from mytable
group by num, att
) t
where rn = 1
For each num, this brings the most frequent att; if there are ties, the smaller att is retained.
Oracle has an aggregation function that does this, stats_mode().:
select num, stats_mode(att)
from t
group by num;
In statistics, the most common value is called the mode -- hence the name of the function.
Here is a db<>fiddle.
You can use group by and count as below
select id, col, count(col) as count
from
df_b_sql
group by id, col

Selecting unique rows in in groups with missing values

I am having a table with two columns where the values of one of columns can be missing. First column is ID, second column is value.
I wanna select rows for unique IDs such that if there is multiple rows with the same ID but some of them have missing value, then return one of those that have existing value. If all rows with the ID have empty value, then return any one of them.
In other words, As long as two rows have the same ID they should belong to same group. But within each group, return the one that has 'value' if there is such.
For example,
Input table.
+--------+---------+
| ID | VALUE |
+------------------+
| x | 1 |
| x | 1 |
| y | 2 |
| y | |
| z | |
| z | |
+------------------+
Should return:
+------------+---------+
| ID | VALUE |
+------------+---------+
| x | 1 |
| y | 2 |
| z | |
+------------+---------+
From your description, you can just use max():
select id, max(value)
from t
group by id;
If you have additional columns that you want, then use row_number():
select t.*
from (select t.*,
row_number() over (partition by id order by (case when value is not null then 1 else 0 end)) as seqnum
from t
) t
where seqnum = 1;
You can use distinct function in hive/sql
hive> select distinct id,value from <db_name>.<table_name>;
the above query will return distinct values in id,value columns
hive> select distinct * from <db_name>.<table_name>;
the above statement is used to return only distinct (different) values based on all columns.
You can easily divide your query in to two query:
A: 1- find unique row with DISTINCT on (ID,Value) which are not empty VALUE
B: 2- find unique row with DISTINCT on ID which are empty in VALUE and ID not in(A(ID))
A U (B - A)

Getting values from all JSON fields in postgres

I need a SQL query where I can select all json keys. The following query let's me get all the keys to the JSON field. But I'm a bit at loss how I would go about making a query to also get all the values out too.
SELECT DISTINCT ON (key.*) key.*
FROM my_table,
jsonb_object_keys(my_table.json_field) as key
So the result of the above query would simply just be
key1
key2
With the following query you would get a result similar to this
SELECT * FROM my_table
| id | json_field |
| -- | ---------- |
| 1 | '{"key1": "value1"}' |
| 2 | '{"key2": "value2"}' |
The result I'm looking for would be the following
| id | key1 | key2 |
| -- | -------| ------ |
| 1 | value1 | null |
| 2 | null | value2 |
What makes it difficult is that I don't know the names of all keys which also may be a lot of keys for a single row.
select distinct on (field_1, field_2) id, job_id, field_1, field_2
from
my_table,
jsonb_populate_recordset(json_field) jprs (field_1 int, field2 text)
https://www.postgresql.org/docs/current/static/functions-json.html#FUNCTIONS-JSON-PROCESSING-TABLE

How to use DISTINCT ON (of PostgreSQL) in Firebird?

I have a TempTable with datas:
------------------------------------
| KEY_1 | KEY 2 | NAME | VALUE |
------------------------------------
| 1 | 0001 | NAME 2 | VALUE 1 |
| 1 | 0002 | NAME 1 | VALUE 3 |
| 1 | 0003 | NAME 3 | VALUE 2 |
| 2 | 0001 | NAME 1 | VALUE 2 |
| 2 | 0001 | NAME 2 | VALUE 1 |
------------------------------------
I want to get the following data:
------------------------------------
| KEY_1 | KEY 2 | NAME | VALUE |
------------------------------------
| 1 | 0001 | NAME 2 | VALUE 1 |
| 2 | 0001 | NAME 1 | VALUE 2 |
------------------------------------
In PostgreSQL, I use a query with DISTINCT ON:
SELECT DISTINCT ON (KEY_1) KEY_1, KEY_2, NAME, VALUE
FROM TempTable
ORDER BY KEY_1, KEY_2
In Firebird, how to get data as above datas?
PostgreSQL's DISTINCT ON takes the first row per stated group key considering the ORDER BY clause. In other DBMS (including later versions of Firebird), you'd use ROW_NUMBER for this. You number the rows per group key in the desired order and stay with those numbered #1.
select key_1, key_2, name, value
from
(
select key_1, key_2, name, value,
row_number() over (partition by key_1 order by key_2) as rn
from temptable
) numbered
where rn = 1
order by key_1, key_2;
In your example you have a tie (key_1 = 2 / key_2 = 0001 occurs twice) and the DBMS picks one of the rows arbitrarily. (You'd have to extend the sortkey both in DISTINCT ON and ROW_NUMBER to decide which to pick.) If you want two rows, i.e. showing all tied rows, you'd use RANK (or DENSE_RANK) instead of ROW_NUMBER, which is something DISTINCT ON is not capable of.
Firebird 3.0 supports window functions, so you can use:
select . . .
from (select t.*,
row_number() over (partition by key_1 order by key_2) as seqnum
from temptable t
) t
where seqnum = 1;
In earlier versions, you can use several methods. Here is a correlated subquery:
select t.*
from temptable t
where t.key_2 = (select max(t2.key_2)
from temptable t2
where t2.key_1 = t.key_1
);
Note: This will still return duplicate values for key_1 because of the duplicates for key_2. Alas . . . getting just one row is tricky unless you have a unique identifier for each row.