generating value rows in between dates - sql

I have a data table that lists id changes on a given date. Structure is the following (Table A):
+----------------------------------------------------------+
| person current_id previous_id action date |
+----------------------------------------------------------+
| A 1 0 'id assignment' 2019-01-01 |
| B 2 1 'id change' 2019-01-03 |
| A 2 1 'id change' 2019-01-02 |
| C 4 2 'id change' 2019-01-03 |
| ... ... ... ... ... |
+----------------------------------------------------------+
However Table A provides a date only if there was a change on that date.
For a traceability study, I am trying to create a data table (Table B below) using Table A. Each day should contain the corresponding id for the existing people in that table (using hive).
Something like this (Table B):
+---------------------------+
| date person id |
+---------------------------+
| 2019-01-01 A 1 |
| 2019-01-01 B 1 |
| 2019-01-01 C 2 |
| 2019-01-02 A 2 |
| 2019-01-02 B 1 |
| 2019-01-02 C 2 |
| 2019-01-03 A 2 |
| 2019-01-03 B 2 |
| 2019-01-03 C 4 |
| ... ... ... |
+---------------------------+
All I can do is getting time independent current ids for mentioned people. I have no idea where to start on generating the output table. Cannot establish the logic.
Thanks in advance for your help!

First, you need to generate the rows. Assuming that you have at least one change on each day, you can use a cross join.
Then you need to impute the value on each days. The simplest method would use lag() with the ignore nulls option, but I don't think Hive supports this.
Instead, two levels of window functions can work:
select person, date,
coalesce(current_id,
max(current_id) over (partition by person, id_date)
) as id
from (select p.person, d.date, a.current_id,
max(case when a.current_id is not null then d.date end) over (partition by p.person order by d.date) as id_date
from (select distinct person from tablea a) p cross join
(select distinct date from tablea a) d left join
tablea a
on p.person = a.person and d.date = a.date
) pd;
If you cannot use cross join, perhaps this will work:
from (select distinct person, 1 as joinkey from tablea a) p join
(select distinct date, 1 as joinkey from tablea a) d
on p.joinkey = d.joinkey left join
tablea a
on p.person = a.person and d.date = a.date

Related

Bigquery: Joining 2 tables one having repeated records and one with count ()

I want to join tables after unnest arrays in Table:1 but the records duplicated after the join because of the unnest.
Table:1
| a | d.b | d.c |
-----------------
| 1 | 5 | 2 |
- -------------
| | 3 | 1 |
-----------------
| 2 | 2 | 1 |
Table:2
| a | c | f |
-----------------
| 1 | 12 | 13 |
-----------------
| 2 | 14 | 15 |
I want to join table 1 and 2 on a but I need also to have the output of:
| a | d.b | d.c | f | h | Sum(count(a))
---------------------------------------------
| 1 | 5 | 2 | 13 | 12 |
- ------------- - - 1
| | 3 | 1 | | |
---------------------------------------------
| 2 | 2 | 1 | 15 | 14 | 1
a can be repeated in table 2 for that I need to count(a) then select the sum after join.
My problem is when I'm joining I need the nested and repeated record to be the same as in the first table but when use aggregation to get the sum I can't group by struct or arrays so I UNNEST the records first then use ARRAY_AGG function but also there was an issue in the sum.
SELECT
t1.a,
t2.f,
t2.h,
ARRAY_AGG(DISTINCT(t1.db)) as db,
ARRAY_AGG(DISTINCT(t1.dc)) as dc,
SUM(t2.total) AS total
FROM (
SELECT
a,
d.b as db,
d.c as dc
FROM
`table1`,
UNNEST(d) AS d,
) AS t1
LEFT JOIN (
SELECT
a,
f,
h,
COUNT(*) AS total,
FROM
`table2`
GROUP BY
a,f,h) AS t2
ON
t1.a = t2.a
GROUP BY
1,
2,
3
Note: the error is in the total number after the sum it is much higher than expected all other data are correct.
I guess your table 2 contains is not unique for column a.
Lets assume that the table 2 looks like this:
a
c
f
1
12
13
2
14
15
1
100
101
There are two rows where a is 1. Since b and f are different, the grouping does not solve this ( GROUP BY a,f,h) AS t2) and counts(*) as total is one for each row.
a
c
f
total
1
12
13
1
2
14
15
1
1
100
101
1
In the next step you join this table to your table 1. The rows of table1 with value 1 in column a are duplicated, because table2 has two entries. This lead to the fact that the sum is too high.
Instead of unnesting the tables, I recommend following approach:
-- Creating of sample data as given:
with tbl_A as (select 1 a, [struct(5 as b,2 as c),struct(3,1)] d union all select 2,[struct(2,1)] union all select null,[struct(50,51)]),
tbl_B as (select 1 as a,12 b, 13 f union all select 2,14,15 union all select 1,100,101 union all select null,500,501)
-- Query:
select *
from tbl_A A
left join
(Select a,array_agg(struct(b,f)) as B, count(1) as counts from tbl_B group by 1) B
on ifnull(A.a,-9)=ifnull(B.a,-9)

Count IDs in Column A that are not repeated in Column B - SQL

I have been with this query for two days and read many posts, but still can't figure out how to handle this situation.
My table is like this:
+------+------+
| Type | ID |
+------+------+
| A | 1339 |
| A | 1156 |
| B | 1156 |
| A | 1192 |
| B | 1214 |
| B | 1202 |
| C | 1202 |
| A | 1207 |
| B | 1207 |
| C | 1207 |
| B | 1241 |
+------+------+
I need to count how many IDs of B there are, but that ID is not repeated in A.
In detail, two criteria should be reflected:
Criterion 1: How many IDs does B have ONLY in B?
Criterion 2: How many IDs does B have in A and B?
C does not matter in this situation.
My expected result should be something like this:
+---------------+-----------+
| Ds in A and B | IDs in B |
+---------------------------+
| 2 | 4 |
+---------------+-----------+
It seems that it can be something like this:
select Count(Id) -- Or if we want distinct Ids: Count(Distinct Id)
from MyTable
where Type = 'A' -- Id has Type 'A'
and Id not in (select b.Id -- Not appears among of type 'B'
from MyTable b
where b.Type = 'B')
Here we get all Id which are have type A, but not B; to find Ids which have A type only:
select Count(Id) -- Or if we want distinct Ids: Count(Distinct Id)
from MyTable
where Type = 'A'
and Id not in (select b.Id
from MyTable b
where b.Type <> 'A')
To get Ids that have both type A and B just change not in into in (or do self join):
select Count(Id) -- Or if we want distinct Ids: Count(Distinct Id)
from MyTable
where Type = 'A'
and Id in (select b.Id
from MyTable b
where b.Type = 'B')
Try using COUNT and DISTINCT. But do not forget a WHERE condition to select B type rows. This is the type of query you get end up with :
SELECT COUNT(DISTINCT ID) FROM table WHERE Type = "B"
how IDs in B equal 4? There are five IDs in the B and tree id in (B and not in A)
select COUNT(DISTINCT ID) as AandB from tTable where Type='B' and ID
in(select id from tTable where Type='A')
select COUNT(DISTINCT ID) as B from tTable where Type='B'
select COUNT(DISTINCT ID) as Bnot_inA from tTable where Type='B' and ID not
in(select id from tTable where Type='A')

LEFT JOIN ON most recent date in Google BigQuery

I've got two tables, both with timestamps and some more data:
Table A
| name | timestamp | a_data |
| ---- | ------------------- | ------ |
| 1 | 2018-01-01 11:10:00 | a |
| 2 | 2018-01-01 12:20:00 | b |
| 3 | 2018-01-01 13:30:00 | c |
Table B
| name | timestamp | b_data |
| ---- | ------------------- | ------ |
| 1 | 2018-01-01 11:00:00 | w |
| 2 | 2018-01-01 12:00:00 | x |
| 3 | 2018-01-01 13:00:00 | y |
| 3 | 2018-01-01 13:10:00 | y |
| 3 | 2018-01-01 13:10:00 | z |
What I want to do is
For each row in Table A LEFT JOIN the most recent record in Table B that predates it.
When there is more than one possibility take the last one
Target Result
| name | timestamp | a_data | b_data |
| ---- | ------------------- | ------ | ------ |
| 1 | 2018-01-01 11:10:00 | a | w |
| 2 | 2018-01-01 12:20:00 | b | x |
| 3 | 2018-01-01 13:30:00 | c | z | <-- note z, not y
I think this involves a subquery, but I cannot get this to work in Big Query. What I have so far:
SELECT a.a_data, b.b_data
FROM `table_a` AS a
LEFT JOIN `table_b` AS b
ON a.name = b.name
WHERE a.timestamp = (
SELECT max(timestamp) from `table_b` as sub
WHERE sub.name = b.name
AND sub.timestamp < a.timestamp
)
On my actual dataset, which is a very small test set (under 2Mb) the query runs but never completes. Any pointers much appreciated 👍🏻
You can try to use a select subquery.
SELECT a.*,(
SELECT MAX(b.b_data)
FROM `table_b` AS b
WHERE
a.name = b.name
and
b.timestamp < a.timestamp
) b_data
FROM `table_a` AS a
EDIT
Or you can try to use ROW_NUMBER window function in a subquery.
SELECT name,timestamp,a_data , b_data
FROM (
SELECT a.*,b.b_data,ROW_NUMBER() OVER(PARTITION BY a.name ORDER BY b.timestamp desc,b.name desc) rn
FROM `table_a` AS a
LEFT JOIN `table_b` AS b ON a.name = b.name AND b.timestamp < a.timestamp
) t1
WHERE rn = 1
Below is for BigQuery Standard SQL and does not require specifying all columns on both sides - only name and timestamp. So it will work for any number of the columns in both tables (assuming no ambiguity in name rather than for above mentioned two columns)
#standardSQL
SELECT a.*, b.* EXCEPT (name, timestamp)
FROM (
SELECT
ANY_VALUE(a) a,
ARRAY_AGG(b ORDER BY b.timestamp DESC LIMIT 1)[SAFE_OFFSET(0)] b
FROM `project.dataset.table_a` a
LEFT JOIN `project.dataset.table_b` b
USING (name)
WHERE a.timestamp > b.timestamp
GROUP BY TO_JSON_STRING(a)
)
In BigQuery, arrays are often an efficient way to solve such problems:
SELECT a.a_data, b.b_data
FROM `table_a` a LEFT JOIN
(SELECT b.name,
ARRAY_AGG(b.b_data ORDER BY b.timestamp DESC LIMIT 1)[OFFSET(1)] as b_data
FROM `table_b` b
GROUP BY b.name
) b
ON a.name = b.name;
this is a common case where you can't just Group by and get the minimum. I suggest the following:
SELECT *
FROM table_a as a inner join (SELECT name, min(timestamp) as timestamp
FROM table_b group by 1) as b
on (a.timestamp = b.timestamp and a.name = b.name)
This way you limit it only to the minimum present in Table b, as you specified.
You can also achieve that in a more readable way using the WITH statement:
WITH min_b as (
SELECT name,
min(timestamp) as timestamp
FROM table_b group by 1
)
SELECT *
FROM table_a as a inner join min_b
on (a.timestamp = min_b.timestamp and a.name = min_b.name)
Let me know if it worked!

select column1 from table A based on unique value of another column2 in table B

I have table A and table B and need to select a column1 from table A based on unique value of another column in table B
table A
id | product |
1 | A |
1 | B |
1 | A |
2 | A |
3 | B |
4 | A |
table B
id | product | date
1 | A | 1/01/2017
1 | B | 1/02/2017
1 | A | 1/01/2017
2 | A | 1/01/2017
3 | B | 1/02/2017
4 | A | 1/01/2017
I want the output to be : 2,3,4
i.e. all the 'id's' which have a unique value in 'date' column of table B
Depending upon the actual restrictions in your tables, there are a couple of options.
Option 1 - assuming that for example ID=1, Product=A, date=1/01/2017 and ID=1, Product=B, date=1/01/2017 means that ID=1 IS NOT included in your final result as it has 2 entries for the date = 1/01/2017 even though they are for different Products
SELECT a.ID
FROM
(
SELECT ID, COUNT(*)
FROM TableB
GROUP BY ID
HAVING COUNT(*) = 1
) a
Option 2 - assuming that for example ID=1, Product=A, date=1/01/2017 and ID=1, Product=B, date=1/01/2017 means that ID=1 IS included in your final result as it only has a single date for each ID/Product combination
SELECT DISTINCT ID
FROM
(
SELECT ID, Product, COUNT(*)
FROM TableB
GROUP BY ID, Product
HAVING COUNT(*) = 1
) a

SQL to group by 2 IDs

So I have a table that's laid out like this:
table1:
ID | metric1 | metric2
A | 1 | 1
A | 1 | 1
B | 2 | 3
C | 3 | 2
And another table that may have an alternate ID an item may have (note that the new ID will also be in the table above). Example:
conversions:
old_ID | new_ID
A | C
So I'm trying to create a query that aggregates both on the new ID and the old-ID, but also preserves the old-ID if available. So basically the results I want look like this:
ID | potential_old_ID | metric1 | metric2
C | A | 5 | 4
B | NULL | 2 | 3
So far with my current strategy I've been able to get close with a query like this:
select
(CASE WHEN new_ID is null then ID else new_ID END) as ID,
(CASE WHEN new_ID is null then null else ID END) as potential_old_ID,
SUM(metric1),
SUM(metric2)
from table1
left join conversions on ID = old_ID
group by ID, new_ID
Which get's me close, but it still separates C and A in separate rows, which doesn't work for my use case:
ID | potential_old_ID | metric1 | metric2
C | A | 2 | 2
B | NULL | 2 | 3
C | NULL | 3 | 2
If I remove the new_ID from the group by I get an error on the query. Anyway I can get the results I'm looking for that I'm missing?
You need to make sure that the rows that have C already also have the same potential old ID as the ones that are A. Something like
SELECT ID, potential_old_ID, SUM(metric1), SUM(metric2)
FROM
( select
(CASE WHEN c1.new_ID is null then ID else c1.new_ID END) as ID,
COALESCE(c1.old_ID, c2.old_ID) as potential_old_ID,
metric1,
metric2
from table1
left join conversions c1 on ID = c1.old_ID
left join conversions c2 on ID = c2.new_ID
) AS data
GROUP BY ID, potential_old_ID
Hmmmm . . . I think this does what you want:
select coalesce(new_id, id) as id,
SUM(metric1),
SUM(metric2)
from table1 left join
conversions
on ID = old_ID
group by coalesce(new_id, id);