BigQuery: fill null for missing column in table - sql

I have a query to show daily aggregation based on some metrics. something like
select date(timestamp), metric, count(*) from aggs GROUP BY 1,2 ORDER BY 1,2;
Problem is sometime certain metric is missing for certain day like:
date | metric | count
03/01 | B | 50
03/02 | A | 60
03/02 | B | 10
03/02 | C | 70
03/03 | C | 10
I want to fill in 0 or null for missing date/metric pair - ie how can we do something like:
date | metric | count
03/01 | A | 0
03/01 | B | 50
03/01 | C | 0
03/02 | A | 60
03/02 | B | 10
03/02 | C | 70
03/03 | A | 0
03/03 | B | 0
03/03 | C | 10

You can generate the rows using a cross join and then fill in the values using a left join:
select date, metric, coalesce(t.count, 0)
from (select distinct date from t) d cross join
(select distinct metric from t) m left join
t
using (date, metric);
If you don't have all dates that you want, you can use:
unnest(generate_date_array(<date1>, <date2>, interval 1 day)) u(dte)

There are several ways to do time-series null-exposure in BigQuery. If the query performance is not an issue, then the easier way to do it is
WITH original_result AS (
SELECT date(timestamp) as date, metric, count(*)
FROM aggs
GROUP BY 1,2
)
SELECT
*
FROM
UNNEST(
GENERATE_DATE_ARRAY(<start_date>, <end_date>, INTERVAL 1 DAY)
) AS date
LEFT JOIN original_result USING (date)
ORDER BY 1, 2

Related

Bigquery: Joining 2 tables one having repeated records and one with count ()

I want to join tables after unnest arrays in Table:1 but the records duplicated after the join because of the unnest.
Table:1
| a | d.b | d.c |
-----------------
| 1 | 5 | 2 |
- -------------
| | 3 | 1 |
-----------------
| 2 | 2 | 1 |
Table:2
| a | c | f |
-----------------
| 1 | 12 | 13 |
-----------------
| 2 | 14 | 15 |
I want to join table 1 and 2 on a but I need also to have the output of:
| a | d.b | d.c | f | h | Sum(count(a))
---------------------------------------------
| 1 | 5 | 2 | 13 | 12 |
- ------------- - - 1
| | 3 | 1 | | |
---------------------------------------------
| 2 | 2 | 1 | 15 | 14 | 1
a can be repeated in table 2 for that I need to count(a) then select the sum after join.
My problem is when I'm joining I need the nested and repeated record to be the same as in the first table but when use aggregation to get the sum I can't group by struct or arrays so I UNNEST the records first then use ARRAY_AGG function but also there was an issue in the sum.
SELECT
t1.a,
t2.f,
t2.h,
ARRAY_AGG(DISTINCT(t1.db)) as db,
ARRAY_AGG(DISTINCT(t1.dc)) as dc,
SUM(t2.total) AS total
FROM (
SELECT
a,
d.b as db,
d.c as dc
FROM
`table1`,
UNNEST(d) AS d,
) AS t1
LEFT JOIN (
SELECT
a,
f,
h,
COUNT(*) AS total,
FROM
`table2`
GROUP BY
a,f,h) AS t2
ON
t1.a = t2.a
GROUP BY
1,
2,
3
Note: the error is in the total number after the sum it is much higher than expected all other data are correct.
I guess your table 2 contains is not unique for column a.
Lets assume that the table 2 looks like this:
a
c
f
1
12
13
2
14
15
1
100
101
There are two rows where a is 1. Since b and f are different, the grouping does not solve this ( GROUP BY a,f,h) AS t2) and counts(*) as total is one for each row.
a
c
f
total
1
12
13
1
2
14
15
1
1
100
101
1
In the next step you join this table to your table 1. The rows of table1 with value 1 in column a are duplicated, because table2 has two entries. This lead to the fact that the sum is too high.
Instead of unnesting the tables, I recommend following approach:
-- Creating of sample data as given:
with tbl_A as (select 1 a, [struct(5 as b,2 as c),struct(3,1)] d union all select 2,[struct(2,1)] union all select null,[struct(50,51)]),
tbl_B as (select 1 as a,12 b, 13 f union all select 2,14,15 union all select 1,100,101 union all select null,500,501)
-- Query:
select *
from tbl_A A
left join
(Select a,array_agg(struct(b,f)) as B, count(1) as counts from tbl_B group by 1) B
on ifnull(A.a,-9)=ifnull(B.a,-9)

Psql - generate series with running total

I have the following table:
create table account_info(
id int not null unique,
creation_date date,
deletion_date date,
gather boolean)
Adding sample data to it:
insert into account_info(id,creation_date,deletion_date,gather)
values(1,'2019-09-10',null,true),
(2,'2019-09-12',null,true),
(3,'2019-09-14','2019-10-08',true),
(4,'2019-09-15','2019-09-18',true),
(5,'2019-09-22',null,false),
(6,'2019-09-27','2019-09-29',true),
(7,'2019-10-04','2019-10-17',false),
(8,null,'2019-10-20',true),
(9,'2019-10-12',null,true),
(10,'2019-10-18',null,true)
I would like to see how many accounts have been added grouped by week and how many accounts have been deleted grouped by week.
I have tried the following:
select dd, count(distinct ai.id) as created ,count(distinct ai2.id) as deleted
from generate_series('2019-09-01'::timestamp,
'2019-10-21'::timestamp, '1 week'::interval) dd
left join account_info ai on ai.creation_date::DATE <= dd::DATE
left join account_info ai2 on ai2.deletion_date::DATE <=dd::DATE
where ai.gather is true
and ai2.gather is true
group by dd
order by dd asc
This produces the following output:
dd | Created | Deleted |
+------------+---------+---------+
| 2019-09-22 | 4 | 1 |
| 2019-09-29 | 5 | 2 |
| 2019-10-06 | 5 | 2 |
| 2019-10-13 | 6 | 3 |
| 2019-10-20 | 7 | 4 |
This output shows me the the running total of how many have been created and how many been deleted.
I would like to see however something like this:
+------------+---------+---------+-------------------+-------------------+
| dd | Created | Deleted | Total Sum Created | Total Sum Deleted |
+------------+---------+---------+-------------------+-------------------+
| 2019-09-22 | 4 | 1 | 4 | 1 |
| 2019-09-29 | 1 | 1 | 5 | 2 |
| 2019-10-06 | NULL | NULL | 5 | 2 |
| 2019-10-13 | 1 | 1 | 6 | 3 |
| 2019-10-20 | 1 | 1 | 7 | 4 |
I get an error message, when trying to sum up the created and deletedcolumns in psql. As I cannot nest aggregate functions.
You could just turn your existing query to a subquery and use lag() to compute the difference between consecutive records:
select
dd,
created - coalesce(lag(created) over(order by dd), 0) created,
deleted - coalesce(lag(deleted) over(order by dd), 0) deleted,
created total_sum_created,
deleted total_sum_deleted
from (
select
dd,
count(distinct ai.id) as created ,
count(distinct ai2.id) as deleted
from
generate_series(
'2019-09-01'::timestamp,
'2019-10-21'::timestamp,
'1 week'::interval
) dd
left join account_info ai
on ai.creation_date::DATE <= dd::DATE and ai.gather is true
left join account_info ai2
on ai2.deletion_date::DATE <=dd::DATE and ai2.gather is true
group by dd
) x
order by dd asc
I moved conditions ai[2].gather = true to the on side of the join: putting these conditions in the where clause basically turns you left joins to inner joins.
Demo on DB Fiddle:
| dd | created | deleted | total_sum_created | total_sum_deleted |
| ------------------------ | ------- | ------- | ----------------- | ----------------- |
| 2019-09-01T00:00:00.000Z | 0 | 0 | 0 | 0 |
| 2019-09-08T00:00:00.000Z | 0 | 0 | 0 | 0 |
| 2019-09-15T00:00:00.000Z | 4 | 0 | 4 | 0 |
| 2019-09-22T00:00:00.000Z | 0 | 1 | 4 | 1 |
| 2019-09-29T00:00:00.000Z | 1 | 1 | 5 | 2 |
| 2019-10-06T00:00:00.000Z | 0 | 0 | 5 | 2 |
| 2019-10-13T00:00:00.000Z | 1 | 1 | 6 | 3 |
| 2019-10-20T00:00:00.000Z | 1 | 1 | 7 | 4 |
Another option would be to use lag() in combination with generate_series() to generate a list of date ranges. Then you can do just one join on the original table, and do conditional aggregation in the outer query:
select
dd,
count(distinct case
when ai.creation_date::date <= dd::date and ai.creation_date::date > lag_dd::date
then ai.id
end) created,
count(distinct case
when ai.deletion_date::date <= dd::date and ai.deletion_date::date > lag_dd::date
then ai.id
end) deleted,
count(distinct case
when ai.creation_date::date <= dd::date
then ai.id
end) total_sum_created,
count(distinct case
when ai.deletion_date::date <= dd::date
then ai.id
end) total_sum_deleted
from
(
select dd, lag(dd) over(order by dd) lag_dd
from generate_series(
'2019-09-01'::timestamp,
'2019-10-21'::timestamp,
'1 week'::interval
) dd
) dd
left join account_info ai on ai.gather is true
group by dd
order by dd
Demo on DB Fiddle
A lateral join and aggregation is soooo well suited to this problem. If you are content with the weeks in the data:
select date_trunc('week', dte) as week,
sum(is_create) as creates_in_week,
sum(is_delete) as deletes_in_week,
sum(sum(is_create)) over (order by min(v.dte)) as running_creates,
sum(sum(is_delete)) over (order by min(v.dte)) as running_deletes
from account_info ai cross join lateral
(values (ai.creation_date, 1, 0), (ai.deletion_date, 0, 1)
) v(dte, is_create, is_delete)
where v.dte is not null and ai.gather
group by week
order by week;
If you want it for a specified set of weeks:
select gs.wk,
sum(v.is_create) as creates_in_week,
sum(v.is_delete) as deletes_in_week,
sum(sum(v.is_create)) over (order by min(v.dte)) as running_creates,
sum(sum(v.is_delete)) over (order by min(v.dte)) as running_deletes
from generate_series('2019-09-01'::timestamp,
'2019-10-21'::timestamp, '1 week'::interval) gs(wk) left join
( account_info ai cross join lateral
(values (ai.creation_date, 1, 0), (ai.deletion_date, 0, 1)
) v(dte, is_create, is_delete)
)
on v.dte >= gs.wk and
v.dte < gs.wk + interval '1 week'
where dte is not null and ai.gather
group by gs.wk
order by gs.wk;
Here is a db<>fiddle.
You can generate the results you want using a series of CTEs to build up the data tables:
with dd as
(select *
from generate_series('2019-09-01'::timestamp,
'2019-10-21'::timestamp, '1 week'::interval) d),
ddl as
(select d, coalesce(lag(d) over (order by d), '1970-01-01'::timestamp) as pd
from dd),
counts as
(select d, count(distinct ai.id) as created, count(distinct ai2.id) as deleted
from ddl
left join account_info ai on ai.creation_date::DATE > ddl.pd::DATE AND ai.creation_date::DATE <= ddl.d::DATE AND ai.gather is true
left join account_info ai2 on ai2.deletion_date::DATE > ddl.pd::DATE AND ai2.deletion_date::DATE <= ddl.d::DATE AND ai2.gather is true
group by d)
select d, created, deleted,
sum(created) over (rows unbounded preceding) as "total created",
sum(deleted) over (rows unbounded preceding) as "total deleted"
from counts
order by d asc
Note that the gather condition needs to be part of the left join to avoid turning those into inner joins.
Output:
d created deleted total created total deleted
2019-09-01 00:00:00 0 0 0 0
2019-09-08 00:00:00 0 0 0 0
2019-09-15 00:00:00 4 0 4 0
2019-09-22 00:00:00 0 1 4 1
2019-09-29 00:00:00 1 1 5 2
2019-10-06 00:00:00 0 0 5 2
2019-10-13 00:00:00 1 1 6 3
2019-10-20 00:00:00 1 1 7 4
Note this query gives the results for the week ending with d. If you want results for the week starting with d, the lag can be changed to lead. You can see this in my demo.
Demo on dbfiddle

PostgreSQL : SQL Request with a Group By and a Percentage on two differents tables

I'm currently blocked on an complex request (with a join):
I have this table "DATA":
order | product
----------------
1 | A
1 | B
2 | A
2 | D
3 | A
3 | C
4 | A
4 | B
5 | Y
5 | Z
6 | W
6 | A
7 | A
And this table "DICO":
order | couple | first | second
-------------------------------
1 | A-B | A | B
2 | A-D | A | D
3 | A-C | A | C
4 | A-B | A | B
5 | Y-Z | Y | Z
6 | W-A | W | A
I would like to obtain, on one line :
order | count | total1stElem | %1stElem | total2ndElem | %2ndElem
------------------------------------------------------------------
A-B | 2 | 6 | 33% | 2 | 100%
A-D | 1 | 6 | 16% | 1 | 100%
A-C | 1 | 6 | 16% | 1 | 100%
Y-Z | 1 | 1 | 100% | 1 | 100%
W-A | 1 | 1 | 100% | 6 | 16%
Information:
Fields: (On the 1st Line example)
total1stElem : count ALL('A') in table Data (all occurrences of A in Data)
total2ndElem : count ALL('B') in table Data (all occurrences of B in Data)
Count : count the number of 'A-B' occurence in table Dico
%1stElem = ( Count / total1stElem ) * 100
%1ndElem = ( Count / total2ndElem ) * 100
I'm based on this request:
select couple, count(*),
sum(count(*)) over (partition by first) as total,
(count(*) * 1.0 / sum(count(*)) over (partition by first) ) as ratio
from dico1
group by couple, first ORDER BY ratio DESC;
And I want to do something like :
select couple, count(*) as COUNT,
count(*) over (partition by product #FROM DATA WHERE product = first#) as total1stElem,
(count(*) * 1.0 / sum(count(*)) over (partition by product #FROM DATA WHERE product = first#) as %1stElem
count(*) over (partition by product #FROM DATA WHERE product = second#) as total2ndElem,
(count(*) * 1.0 / sum(count(*)) over (partition by product #FROM DATA WHERE product = second#) as %2ndElem
from dico1
group by couple, first ORDER BY COUNT DESC;
I'm totally blocked on the jointure part of my request. Somebody can help me ? I've been helped for this kind or request on Oracle, but unfortunately it's impossible to adapt UNPIVOT and PIVOT function in PostgreSQL.
I'd create CTEs that aggregate each table and count the occurrences you listed, and join dico's aggregation on data's aggregation twice, once for first and once for second:
WITH data_count AS (
SELECT product, COUNT(*) AS product_count
FROM data
GROUP BY product
),
dico_count AS (
SELECT couple, first, second, COUNT(*) AS dico_count
FROM dico
GROUP BY couple, first, second
)
SELECT couple,
dico_count,
data1.product_count AS total1stElem,
TRUNC(dico_count * 100.0 / data1.product_count) AS percent1stElem,
data2.product_count AS total2ndElem,
TRUNC(dico_count * 100.0 / data2.product_count) AS percent2ndElem
FROM dico_count dico
JOIN data_count data1 ON dico.first = data1.product
JOIN data_count data2 ON dico.second = data2.product
ORDER BY 1

sqlite self join query using max()

Given the following table:
| id | user_id | score | date |
|----|---------|-------|------------|
| 1 | 1 | 1 | 2017-08-31 |
| 2 | 1 | 1 | 2017-09-01 |
| 3 | 2 | 2 | 2017-09-01 |
| 4 | 1 | 2 | 2017-09-02 |
| 5 | 2 | 2 | 2017-09-02 |
| 6 | 3 | 1 | 2017-09-02 |
Need to find the user_ids that have the max score for any given date (there can be more than one), so I'm trying:
SELECT s1.user_id
FROM (
SELECT max(score) as max, user_id, date
FROM scores
) AS s2
INNER JOIN scores as s1
ON s1.date = '2017-08-31'
AND s1.score = s2.max
The query returns correctly for the last 2 dates but returns 0 records for the first date ('2107-08-31'), it should return the score of 1
Why won't that first date return correctly and/or is there a more elegant way of writing this query?
Here is the version of the query that comes closest to working, though it does not work when there is only one test score. I do not understand how I am getting away with not using the GROUP BY clause in the aggregate.
SELECT s1.user_id
FROM (
SELECT max(score) as max, user_id, date
FROM scores
) AS s2
INNER JOIN scores as s1
ON s1.date = :date
AND s1.score = s2.max
A correct query option is:
SELECT user_id
FROM scores
WHERE score = (SELECT MAX(score) FROM scores WHERE date = '2017-08-01')
Note that one issue with your query (which is probably your issue) is that the user_id and date in the sub query are not going to be related to the row that contains MAX(score) since you don't have any "group by" clause to force grouping

Hive query selecting top 2 rows by percentage count and display as columns

I have a table something like below in my hadoop cluster
ID | CATEGORY | COUNT
101 | A | 40
101 | B | 40
101 | C | 20
102 | D | 10
102 | A | 20
102 | E | 30
102 | F | 40
I have to write a Hive query which will show IDs and top 2 categories by percentage count as columns. So my result table should look like
ID | CAT1 | % | CAT2 | %
101 | A | 40 | B | 40
102 | F | 40 | E | 30
Please keep in my mind that this is only a sample table which I have kept very simple for explaining purpose.
To get the top 2 per ID, you can use the rank() function, see example here.
To get percentage out of overall, you can join on ID with an aggregate table:
select ID,sum(count) as sum from input_table group by ID
And finally, if you want to turn the table from ID, Cat, % to one ID per row, you would need to use collect_list for Cat and % in a sub query and then create a column for the array elements
Select ID, categories[0], pcts[0],categories[1], pcts[1] from (
Select a.ID, collect_list(Cat) as categories , collect_list(Count/sum) as pcts from (
Select ID, Cat, Count, rank from (
SELECT ID, Cat, Count,
rank() over (PARTITION BY ID ORDER BY Count DESC) as rank
FROM input_table) inner where rank <= 2 ) a,
(select ID,sum(count) as sum from input_table group by ID) b where a.ID = b.ID
group by a.ID ) inner;