How do I coalesce NULLs across multiple rows in BigQuery?

How do I coalesce NULLs across multiple rows in BigQuery? - sql

I have the following table:
Date |event_number| customer_id1 | customer_age | customer_gender
10/01/2020 | 1 | abc | NULL | NULL
10/01/2020 | 2 | abc | NULL | male
10/01/2020 | 3 | abc | 45 | NULL
10/01/2020 | 1 | def | 30 | NULL
I want to run a SQL query each day to look for new combinations of custom_id1, customer_age, customer_gender.
Output should look like this:
query_run_time | customer_id1 | customer_age | customer gender
11/01/2020 | abc | 45 | male
11/01/2020 | def | 30 | NULL
Query run time is the date the query was run. If the combination (customer_id, custmer_age, customer_gender) is already in the table I don't want to insert the row.
Thanks

You can use window functions to assign internal row numbers for merge multiple queries, e.g. like this:
SELECT COALESCE(a.customer_id, b.customer_id) as customer_id
, customer_age
, customer_gender
FROM (
SELECT customer_id, customer_age
, ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY customer_age ) AS row_no
FROM customer_event
WHERE customer_age IS NOT NULL
) a
FULL JOIN (
SELECT customer_id, customer_gender
, ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY customer_gender ) AS row_no
FROM customer_event
WHERE customer_gender IS NOT NULL
) b ON b.customer_id = a.customer_id
AND b.row_no = a.row_no
ORDER BY COALESCE(a.customer_id, b.customer_id)
, COALESCE(a.row_no, b.row_no)
Schema and Test Data
CREATE TABLE customer_event (
event_number INT NOT NULL,
customer_id VARCHAR(10) NOT NULL,
customer_age INT,
customer_gender VARCHAR(10)
);
INSERT INTO customer_event VALUES
( 1, 'abc', NULL, NULL ),
( 2, 'abc', NULL, 'male' ),
( 3, 'abc', 45 , NULL ),
( 4, 'abc', 50 , 'female' ),
( 5, 'abc', 27 , NULL ),
( 1, 'def', 30 , NULL );
Output
customer_id customer_age customer_gender
abc 27 female
abc 45 male
abc 50 (null)
def 30 (null)
The above is from testing with PostgreSQL 9.6 on SQL Fiddle.

Use Window function
SELECT query_run_time, customer_id, MAX(customer_age) customer_age,
MAX(customer_gender)customer_gender
FROM tbl
GROUP BY query_run_time, customer_id
FIDDLE DEMO
Output
query_run_time | customer_id1 | customer_age | customer gender
11/01/2010 | abc | 45 | male
11/01/2020 | def | 30 | NULL

I suspect that what you really want is the most recent value for each column. Here is one method:
select date, customerid1,
array_agg(customer_age ignore nulls order by event_number desc limit 1)[safe_ordinal(1) as age,
array_agg(customer_gender ignore nulls order by event_number desc limit 1)[safe_ordinal(1) as gender
from t
group by date, customerid1;

Related

Get records having the same value in 2 columns but a different value in a 3rd column

I am having trouble writing a query that will return all records where 2 columns have the same value but a different value in a 3rd column. I am looking for the records where the Item_Type and Location_ID are the same, but the Sub_Location_ID is different.
The table looks like this:
+---------+-----------+-------------+-----------------+
| Item_ID | Item_Type | Location_ID | Sub_Location_ID |
+---------+-----------+-------------+-----------------+
| 1 | 00001 | 20 | 78 |
| 2 | 00001 | 110 | 124 |
| 3 | 00001 | 110 | 124 |
| 4 | 00002 | 3 | 18 |
| 5 | 00002 | 3 | 25 |
+---------+-----------+-------------+-----------------+
The result I am trying to get would look like this:
+---------+-----------+-------------+-----------------+
| Item_ID | Item_Type | Location_ID | Sub_Location_ID |
+---------+-----------+-------------+-----------------+
| 4 | 00002 | 3 | 18 |
| 5 | 00002 | 3 | 25 |
+---------+-----------+-------------+-----------------+
I have been trying to use the following query:
SELECT *
FROM Table1
WHERE Item_Type IN (
SELECT Item_Type
FROM Table1
GROUP BY Item_Type
HAVING COUNT (DISTINCT Sub_Location_ID) > 1
)
But it returns all records with the same Item_Type and a different Sub_Location_ID, not all records with the same Item_Type AND Location_ID but a different Sub_Location_ID.

This should do the trick...
-- some test data...
IF OBJECT_ID('tempdb..#TestData', 'U') IS NOT NULL
BEGIN DROP TABLE #TestData; END;
CREATE TABLE #TestData (
Item_ID INT NOT NULL PRIMARY KEY,
Item_Type CHAR(5) NOT NULL,
Location_ID INT NOT NULL,
Sub_Location_ID INT NOT NULL
);
INSERT #TestData (Item_ID, Item_Type, Location_ID, Sub_Location_ID) VALUES
(1, '00001', 20, 78),
(2, '00001', 110, 124),
(3, '00001', 110, 124),
(4, '00002', 3, 18),
(5, '00002', 3, 25);
-- adding a covering index will eliminate the sort operation...
CREATE NONCLUSTERED INDEX ix_indexname ON #TestData (Item_Type, Location_ID, Sub_Location_ID, Item_ID);
-- the actual solution...
WITH
cte_count_group AS (
SELECT
td.Item_ID,
td.Item_Type,
td.Location_ID,
td.Sub_Location_ID,
cnt_grp_2 = COUNT(1) OVER (PARTITION BY td.Item_Type, td.Location_ID),
cnt_grp_3 = COUNT(1) OVER (PARTITION BY td.Item_Type, td.Location_ID, td.Sub_Location_ID)
FROM
#TestData td
)
SELECT
cg.Item_ID,
cg.Item_Type,
cg.Location_ID,
cg.Sub_Location_ID
FROM
cte_count_group cg
WHERE
cg.cnt_grp_2 > 1
AND cg.cnt_grp_3 < cg.cnt_grp_2;

You can use exists :
select t.*
from table t
where exists (select 1
from table t1
where t.Item_Type = t1.Item_Type and
t.Location_ID = t1.Location_ID and
t.Sub_Location_ID <> t1.Sub_Location_ID
);

Sql server has no vector IN so you can emulate it with a little trick. Assuming '#' is illegal char for Item_Type
SELECT *
FROM Table1
WHERE Item_Type+'#'+Cast(Location_ID as varchar(20)) IN (
SELECT Item_Type+'#'+Cast(Location_ID as varchar(20))
FROM Table1
GROUP BY Item_Type, Location_ID
HAVING COUNT (DISTINCT Sub_Location_ID) > 1
);
The downsize is the expression in WHERE is non-sargable

I think you can use exists:
select t1.*
from table1 t1
where exists (select 1
from table1 tt1
where tt1.Item_Type = t1.Item_Type and
tt1.Location_ID = t1.Location_ID and
tt1.Sub_Location_ID <> t1.Sub_Location_ID
);

Average of Days between ordered dates per group

+-------+-------+-----------+
| EmpID | PerID | VisitDate |
+-------+-------+-----------+
| 1 | 22 | 2/24/2017 |
| 1 | 22 | 3/25/2017 |
| 1 | 22 | 4/5/2017 |
| 2 | 33 | 5/6/2017 |
| 2 | 33 | 8/9/2017 |
| 2 | 33 | 6/7/2017 |
+-------+-------+-----------+
I am trying to find the latest visit date and average days between visits per EmpID. For Avg, I'll first have to order the days consecutively and then find the average.
Eg: Avg. days for EmpID=1 and PerID=22 would be [29(Days between 3/25 and 2/24) + 11 (Days between 3/25 and 4/5)/2] = 20 Days.
Desired Output:
+-------+-------+----------+----------+
| EmpID | PerID | MaxVDate | AvgVDays |
+-------+-------+----------+----------+
| 1 | 22 | 4/5/2017 | 20 |
| 2 | 33 | 8/9/2017 | 47.5 |
+-------+-------+----------+----------+
Attempt:
SELECT
EmpID
,PerID
,MAX(VisitDate) AS MaxVDate
,--Dunno how to find average AS AvgVDays
FROM
T1
GROUP BY
EmpID
,PerID

You can use lag to get the previous date and compute the date difference. Then use avg window function to get the average days.
Select distinct empid,perid,maxVdate,avg(diff_with_prev) OVER(Partition by empid) as avgVDays
from (
SELECT EmpID,PerID
,MAX(VisitDate) OVER(Partition BY EmpID) AS MaxVDate
,DATEDIFF(DAY,LAG(VisitDate) OVER(Partition BY EmpID order by VisitDate), VisitDate) as diff_with_prev
FROM T1
) t

Here's an option...
IF OBJECT_ID('tempdb..#TestData', 'U') IS NOT NULL
DROP TABLE #TestData;
CREATE TABLE #TestData (
EmpID INT NOT NULL,
PerID INT NOT NULL,
VisitDate DATE NOT NULL
);
INSERT #TestData (EmpID, PerID, VisitDate) VALUES
(1, 22, '2/24/2017'),
(1, 22, '3/25/2017'),
(1, 22, '4/5/2017'),
(2, 33, '5/6/2017'),
(2, 33, '8/9/2017'),
(2, 33, '6/7/2017');
-- SELECT * FROM #TestData td;
SELECT
db.EmpID,
db.PerID,
AvgDays = AVG(db.DaysBetween * 1.0)
FROM (
SELECT
*,
DaysBetween = DATEDIFF(dd, LAG(td.VisitDate, 1) OVER (PARTITION BY td.EmpID, td.PerID ORDER BY td.VisitDate), td.VisitDate)
FROM
#TestData td
) db
GROUP BY
db.EmpID,
db.PerID;
Results...
EmpID PerID AvgDays
----------- ----------- ---------------------------------------
1 22 20.000000
2 33 47.500000

The task is much easier than you think. You get the average with (last visit - first visit) / (count visits - 1).
select
empid,
perid,
max(VisitDate) as MaxVDate,
datediff(day, min(VisitDate), max(VisitDate)) * 1.0 / (count(*) - 1) as avgvdays
from mytable
group by empid, perid
having count(*) > 1
order by empid, perid;
The multiplication with 1.0 is necessary in order to avoid integer division. (You could also cast to decimal instead.)
As the calcualtion only makes sense for empid/perid pairs with more than one entry (and in order to avoid division by zero), I have applied an according HAVING clause.
Here is a test: http://rextester.com/AIFPA62612

Update table in Postgresql by grouping rows

I want to update a table by grouping (or combining) some rows together based on a certain criteria. I basically have this table currently (I want to group by 'id_number' and 'date' and sum 'count'):
Table: foo
---------------------------------------
| id_number | date | count |
---------------------------------------
| 1 | 2001 | 1 |
| 1 | 2001 | 2 |
| 1 | 2002 | 1 |
| 2 | 2001 | 6 |
| 2 | 2003 | 12 |
| 2 | 2003 | 2 |
---------------------------------------
And I want to get this:
Table: foo
---------------------------------------
| id_number | date | count |
---------------------------------------
| 1 | 2001 | 3 |
| 1 | 2002 | 1 |
| 2 | 2001 | 6 |
| 2 | 2003 | 14 |
---------------------------------------
I know that I can easily create a new table with the pertinent info. But how can I modify an existing table like this without making a "temp" table? (Note: I have nothing against using a temporary table, I'm just interested in seeing if I can do it this way)

If you want to delete rows you can add a primary key (for distinguish rows) and use two sentences, an UPDATE for the sum and a DELETE for obtain less rows.
You can do something like this:
create table foo (
id integer primary key,
id_number integer,
date integer,
count integer
);
insert into foo values
(1, 1 , 2001 , 1 ),
(2, 1 , 2001 , 2 ),
(3, 1 , 2002 , 1 ),
(4, 2 , 2001 , 6 ),
(5, 2 , 2003 , 12 ),
(6, 2 , 2003 , 2 );
select * from foo;
update foo
set count = count_sum
from (
select id, id_number, date,
sum(count) over (partition by id_number, date) as count_sum
from foo
) foo_added
where foo.id_number = foo_added.id_number
and foo.date = foo_added.date;
delete from foo
using (
select id, id_number, date,
row_number() over (partition by id_number, date order by id) as inner_order
from foo
) foo_ranked
where foo.id = foo_ranked.id
and foo_ranked.inner_order <> 1;
select * from foo;
You can try it here: http://rextester.com/PIL12447
With only one UPDATE
(but with a trigger) you can set a NULL value in count and trigger a DELETE in that case.
create table foo (
id integer primary key,
id_number integer,
date integer,
count integer
);
create function delete_if_count_is_null() returns trigger
language plpgsql as
$BODY$
begin
if new.count is null then
delete from foo
where id = new.id;
end if;
return new;
end;
$BODY$;
create trigger delete_if_count_is_null
after update on foo
for each row
execute procedure delete_if_count_is_null();
insert into foo values
(1, 1 , 2001 , 1 ),
(2, 1 , 2001 , 2 ),
(3, 1 , 2002 , 1 ),
(4, 2 , 2001 , 6 ),
(5, 2 , 2003 , 12 ),
(6, 2 , 2003 , 2 );
select * from foo;
update foo
set count = case when inner_order = 1 then count_sum else null end
from (
select id, id_number, date,
sum(count) over (partition by id_number, date) as count_sum,
row_number() over (partition by id_number, date order by id) as inner_order
from foo
) foo_added
where foo.id_number = foo_added.id_number
and foo.date = foo_added.date
and foo.id = foo_added.id;
select * from foo;
You can try it in: http://rextester.com/MWPRG10961

How to create a condition for this case?

Sample Table:
Id |Acc_Code|Description |Balance | Acclevel| Acctype| Exttype|
--- -------- ----------------- |-------- |-------- | -------| -------|
1 |SA |Sales | 0.00 | 1 | SA | |
2 |CS |Cost of Sales | 0.00 | 1 | CS | |
3 |5000/001|Revenue | 94.34 | 2 | SA | |
4 |5000/090|Sales(Local) | 62.83 | 2 | SA | |
5 |7000/000|Manufacturing Acc |-250.80 | 2 | CS | MA |
6 |7000/200|Manufacturing Acc | 178.00 | 2 | CS | |
This is a sample data of a temporary table which would be used to be inserted into another temporary table that would calculate the data for Profit and Loss Statement (For Manufacturing related Accounts only).
In this case, the acc_code for Manufacturing accounts start from 7000/000 and separated/partitioned for each following Exttype.
Eg: We start from the exttype of MA and based on its acclevel (could be 2 or more) until the next exttype.
The idea is we get the manufacturing accounts by SELECT FROM tmp_acc_list WHERE acc_code BETWEEN #start_acc_code (7000/000 in this case) AND #end_acc_code (the data before the next exttype)
I don't know what the exttype is, I'm still learning the tables.
How do we create the #end_acc_code part out from this sample table?

So here is a all in one script.
I created Your table for test:
create table #tmp_acc_list(
Id numeric,
Acc_Code nvarchar(100),
Acclevel numeric,
Acctype nvarchar(100),
Exttype nvarchar(100));
GO
insert into #tmp_acc_list(Id, Acc_Code, Acclevel, Acctype, Exttype)
select 1 , 'SA', 1,'SA', null union all
select 2 , 'CS', 1,'CS', null union all
select 3 , '5000/001', 2,'SA', null union all
select 4 , '5000/090', 2,'SA', null union all
select 5 , '7000/000', 2,'CS', 'MA' union all
select 6 , '7000/200', 2,'CS', null
;
Then comes the query:
with OrderedTable as -- to order the table is Id is not an order
(
select
t.*, ROW_NUMBER() over (
order by id asc --use any ordering You need here
)
as RowNum
from
#tmp_acc_list as t
),
MarkedTable as -- mark with common number
(
select
t.*,
Max(case when t.Exttype is null then null else t.RowNum end)
over (order by t.RowNum) as GroupRownum
from OrderedTable as t
),
GroupedTable as -- add group Exttype
(
select
t.Id, t.Acc_Code, t.Acclevel, t.Acctype, t.Exttype,
max(t.Exttype) over (partition by t.GroupRownum) as GroupExttype
from MarkedTable as t
)
select * from GroupedTable where GroupExttype = 'MA'
Is this what You need?

select *
from
(
select Id, Acc_Code
from tmp_acc_list
where Acc_Code = '7000/000'
) s
cross join tmp_acc_list a
cross apply
(
select top 1 x.Id, x.Acc_Code
from tmp_acc_list x
where x.Id >= a.Id
and x.AccLevel = a.AccLevel
and x.Acctype = a.Acctype
and x.Exttype = ''
order by Id desc
) e
where a.Id between s.Id and e.Id

Select last changed row in sub-query

I have a table product:
id | owner_id | last_activity | box_id
------------------------------------
1 | 2 | 12/19/2014 | null
2 | 2 | 12/13/2014 | null
3 | 2 | 08/11/2014 | null
4 | 2 | 12/11/2014 | 99
5 | 2 | null | 99
6 | 2 | 12/15/2014 | 99
7 | 2 | null | 105
8 | 2 | null | 105
9 | 2 | null | 105
The only variable that I have is owner_id.
I need to select all products of a user, but if the product is in a box then only latest one should be selected.
Sample output for owner = 2 is following:
id | owner_id | last_activity | box_id
------------------------------------
1 | 2 | 12/19/2014 | null
2 | 2 | 12/13/2014 | null
3 | 2 | 08/11/2014 | null
6 | 2 | 12/15/2014 | 99
7 | 2 | null | 105
I'm not able to find a way to select the latest product from a box.
My current query, which does not return correct value, but can be executed:
SELECT p.* FROM product p
WHERE p.owner_id = 2
AND (
p.box IS NULL
OR (
p.box IS NOT NULL
AND
p.id = ( SELECT MAX(pp.id) FROM product pp
WHERE pp.box_id = p.box_id )
)
I tried with dates:
SELECT p.* FROM product p
WHERE p.owner_id = 2
AND (
p.box IS NULL
OR (
p.box IS NOT NULL
AND
p.id = ( SELECT * FROM (
SELECT pp.id FROM product pp
WHERE pp.box_id = p.box_id
ORDER BY last_activity desc
) WHERE rownum = 1
)
)
Which gives error: p.box_id is undefined as it's inside 2nd subquery.
Do you have any ideas how can I solve it?

The ROW_NUMBER analytical function might help with such queries:
SELECT "owner_id", "id", "box_id", "last_activity" FROM
(
SELECT "owner_id", "id", "box_id", "last_activity",
ROW_NUMBER()
OVER (PARTITION BY "box_id" ORDER BY "last_activity" DESC NULLS LAST) rn
-- ^^^^^^^^^^^^^^^
-- descending order, reject nulls after not null values
-- (this is the default, but making it
-- explicit here for self-documentation
-- purpose)
FROM T
WHERE "owner_id" = 2
) V
WHERE rn = 1 or "box_id" IS NULL
ORDER BY "id" -- <-- probably not necessary, but matches your example
See http://sqlfiddle.com/#!4/db775/8
there can be nulls as a value. If there are nulls in all products inside a box, then MIN(id) should be returned
Even if is is probably not a good idea to rely on id to order things is you think you need that, you will have to change the ORDER BY clause to:
... ORDER BY "last_activity" DESC NULLS LAST, "id" DESC
-- ^^^^^^^^^^^

Use exists
SELECT
p.*
FROM
product p
WHERE
p.owner_id = 2 AND
( p.box IS NULL OR
(
p.box IS NOT NULL AND
NOT EXISTS
(
SELECT
pp.id
FROM
product pp
WHERE
pp.box_id = p.box_id AND
pp.last_activity > p.last_activity
)
)
)

You can use union to first get all rows where box_is null and than fetch rows with max id and date where box_id is not null:
SELECT * FROM
(
SELECT id,owner_id,last_activity,box_id FROM product WHERE owner_id = 2 AND box_id IS NULL
UNION
SELECT MAX(id),owner_id,MAX(last_activity),box_id FROM product WHERE owner_id = 2 AND box_id IS NOT NULL GROUP BY owner_id, box_id
) T1
ORDER BY
id

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How do I coalesce NULLs across multiple rows in BigQuery? - sql

Related

Get records having the same value in 2 columns but a different value in a 3rd column

Average of Days between ordered dates per group

Update table in Postgresql by grouping rows

How to create a condition for this case?

Select last changed row in sub-query

Categories

Resources