Fill values by user in BigQuery - sql

I've got data on how a set of customers spend money on a daily basis, with the following structure in BigQuery:
CREATE TABLE if not EXISTS daily_spend (
user_id int,
created_at DATE,
value float
);
INSERT INTO daily_spend VALUES
(1, '2021-01-01', 0),
(1, '2021-01-02', 1),
(1, '2021-01-04', 1),
(1, '2021-01-05', 2),
(1, '2021-01-07', 5),
(2, '2021-01-01', 5),
(2, '2021-01-03', 0),
(2, '2021-01-04', 1),
(2, '2021-01-06', 2);
I'd like to complete the data for each user by putting 0's in the days the users didn't spent any money, only including days between their first and last days spending money.
So, the output table in this example would have the following values:
(1, '2021-01-01', 0),
(1, '2021-01-02', 1),
(1, '2021-01-03', 0),
(1, '2021-01-04', 1),
(1, '2021-01-05', 2),
(1, '2021-01-06', 0),
(1, '2021-01-07', 5),
(2, '2021-01-01', 5),
(2, '2021-01-02', 0),
(2, '2021-01-03', 0),
(2, '2021-01-04', 1),
(2, '2021-01-05', 0),
(2, '2021-01-06', 2)
What's the simplest way of doing this in BigQuery?

Use below
select user_id, created_at, ifnull(value, 0) value
from (
select user_id, min(created_at) min_date, max(created_at) max_date
from daily_spend
group by user_id
), unnest(generate_date_array(min_date, max_date)) created_at
left join daily_spend
using(user_id, created_at)
If applied to sample data in your question - output is

Related

Starting and Ending a row-count based on values in another column

There is a need to monitor the performance of a warehouse of goods. Please refer to the table containing data for one warehouse below:
WK_NO: Week number; Problem: Problem faced on that particular week. Empty cells are NULLs.
I need to create the 3rd column:
Weeks on list: A column indicating the number of weeks that a particular warehouse is being monitored as of that particular week.
Required Logic:
Initially the column's values are to be 0. If a warehouse is encountering problems continuously for 4 weeks, it is put onto a "list" and a counter starts, indicating the number of weeks the warehouse has been problematic. And if the warehouse is problem-free for 4 continuous weeks after facing problems, the counter resets to 0 and stays 0 until there is another 4 weeks of problems.
Code to generate data shown above:
CREATE TABLE warehouse (
WK_NO INT NOT NULL,
Problem STRING,
Weeks_on_list_ref INT
);
INSERT INTO warehouse
(WK_NO, Problem, Weeks_on_list_ref)
VALUES
(1, NULL, 0),
(2, NULL, 0),
(3, 'supply', 0),
(4, 'supply', 0),
(5, 'manpower', 0),
(6, 'supply', 0),
(7, 'manpower', 1),
(8, 'supply', 2),
(9, NULL, 3),
(10, NULL, 4),
(11, 'supply', 5),
(12, 'supply', 6),
(13, 'manpower', 7),
(14, NULL, 8),
(15, NULL, 9),
(16, NULL, 10),
(17, NULL, 11),
(18, NULL, 0),
(19, NULL, 0),
(20, NULL, 0);
Any help is much appreciated.
Update:
Some solutions are failing when bringing in data for multiple warehouses.
Updated the code generation script with W_NO which is the warehouse ID, for your consideration.
CREATE OR REPLACE TABLE warehouse (
W_NO INT NOT NULL,
WK_NO INT NOT NULL,
Problem STRING,
Weeks_on_list_ref INT
);
INSERT INTO warehouse
(W_NO, WK_NO, Problem, Weeks_on_list_ref)
VALUES
(1, 1, NULL, 0),
(1, 2, NULL, 0),
(1, 3, 'supply', 0),
(1, 4, 'supply', 0),
(1, 5, 'manpower', 0),
(1, 6, 'supply', 0),
(1, 7, 'manpower', 1),
(1, 8, 'supply', 2),
(1, 9, NULL, 3),
(1, 10, NULL, 4),
(1, 11, 'supply', 5),
(1, 12, 'supply', 6),
(1, 13, 'manpower', 7),
(1, 14, NULL, 8),
(1, 15, NULL, 9),
(1, 16, NULL, 10),
(1, 17, NULL, 11),
(1, 18, NULL, 0),
(1, 19, NULL, 0),
(1, 20, NULL, 0),
(2, 1, NULL, 0),
(2, 2, NULL, 0),
(2, 3, 'supply', 0),
(2, 4, 'supply', 0),
(2, 5, 'manpower', 0),
(2, 6, 'supply', 0),
(2, 7, 'manpower', 1),
(2, 8, 'supply', 2),
(2, 9, NULL, 3),
(2, 10, NULL, 4),
(2, 11, 'supply', 5),
(2, 12, 'supply', 6),
(2, 13, 'manpower', 7),
(2, 14, NULL, 8),
(2, 15, NULL, 9),
(2, 16, NULL, 10),
(2, 17, NULL, 11),
(2, 18, NULL, 0),
(2, 19, NULL, 0),
(2, 20, NULL, 0);
Consider below query for updated question:
SELECT W_NO, WK_NO, Problem, IF(MOD(div, 2) = 0, 0, RANK() OVER (PARTITION BY W_NO, div ORDER BY WK_NO)) AS Weeks_on_list
FROM (
SELECT *, COUNTIF(flag IS TRUE) OVER (PARTITION BY W_NO ORDER BY WK_NO) AS div FROM (
SELECT *,
LAG(Problem, 5) OVER w0 IS NULL AND COUNT(Problem) OVER w1 = 4 OR
LAG(Problem, 5) OVER w0 IS NOT NULL AND COUNT(Problem) OVER w1 = 0 AS flag
FROM warehouse
WINDOW w0 AS (PARTITION BY W_NO ORDER BY WK_NO), w1 AS (w0 ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING)
)
)
ORDER BY W_NO, WK_NO;
Consider below query:
Using a window frame with fixed size 4, find boundaries first where
warehouse turns into abnormal state and vice versa in innermost query.
Partition weeks using boundaries found in step 1.
Since normal and abnormal states take turns, so calculate RANK() only for abnormal state in outermost query.
SELECT WK_NO, Problem, IF(MOD(div, 2) = 0, 0, RANK() OVER (PARTITION BY div ORDER BY WK_NO)) AS Weeks_on_list
FROM (
SELECT *, COUNTIF(flag IS TRUE) OVER (ORDER BY WK_NO) AS div FROM (
SELECT *,
LAG(Problem, 5) OVER w0 IS NULL AND COUNT(Problem) OVER w1 = 4 OR
LAG(Problem, 5) OVER w0 IS NOT NULL AND COUNT(Problem) OVER w1 = 0 AS flag
FROM warehouse
WINDOW w0 AS (ORDER BY WK_NO), w1 AS (w0 ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING)
)
)
ORDER BY WK_NO;

Fill table using last value per user

I've got data on customers balance on a daily basis, with the following structure in BigQuery:
CREATE TABLE if not EXISTS balance (
user_id int,
updated_ag DATE,
value float
);
INSERT INTO balance VALUES
(1, '2021-01-01', 0),
(1, '2021-01-02', 1),
(1, '2021-01-05', 2),
(1, '2021-01-07', 5),
(2, '2021-01-01', 5),
(2, '2021-01-03', 0),
(2, '2021-01-04', 1),
(2, '2021-01-06', 2);
I have one row for a user on a given day if the balance on that day changed.
I'd like to complete the data for each user by putting the balance of the last day where there was an update, with the dates being between the first day with balance and last date in the table.
So, the output table in this example would have the following values:
(1, '2021-01-01', 0),
(1, '2021-01-02', 1),
(1, '2021-01-03', 1),
(1, '2021-01-04', 1),
(1, '2021-01-05', 2),
(1, '2021-01-06', 2),
(1, '2021-01-07', 5),
(2, '2021-01-01', 5),
(2, '2021-01-02', 5),
(2, '2021-01-03', 0),
(2, '2021-01-04', 1),
(2, '2021-01-05', 1),
(2, '2021-01-06', 2),
(2, '2021-01-07', 2)
What's the simplest way of doing this in BigQuery?
Try this one:
select user_id, generated_date, value
from (
select
*,
ifnull(lead(date) over(partition by user_id order by date) - 1, max(date) over()) date_to
from balance
), unnest(generate_date_array(date, date_to, interval 1 day)) generated_date

Forward fill since (possibly non existent) date in BigQuery

I have data from two different sources. On one hand I have user data from our app. This has a primary key of ID and UTC date. There are only rows for UTC dates when are users uses the app. On the other hand I have advertisement campaign attribition data for the users (which can be multiple advertisment campaigns per user). This table has a primary key of ID and campaign and a metric containing a advertisment attribution timestamp. I want to combine the two data sources such that I can compute if a campaign is generating more revenue than it costs among other campaign statistics.
App data example:
SELECT
*
FROM UNNEST(ARRAY<STRUCT<ID INT64, UTC_Date DATE, Revenue FLOAT64>>
[(1, DATE('2021-01-01'), 0),
(1, DATE('2021-01-05'), 5),
(1, DATE('2021-01-10'), 0),
(2, DATE('2021-01-03'), 10),
(2, DATE('2021-01-08'), 0),
(2, DATE('2021-01-09'), 0)])
advertisement campaign attribition data example:
SELECT
*
FROM UNNEST(ARRAY<STRUCT<ID INT64, Attribution_Timestamp Timestamp, campaign_name STRING>>
[(1, TIMESTAMP('2021-01-01 09:54:31'), "A"),
(1, TIMESTAMP('2021-01-09 22:32:51'), "B"),
(2, TIMESTAMP('2021-01-03 19:12:11'), "A")])
The end result I would like to get is:
SELECT
*
FROM UNNEST(ARRAY<STRUCT<ID INT64, UTC_Date DATE, Revenue FLOAT64, campaign_name STRING>>
[(1, DATE('2021-01-01'), 0, "A"),
(1, DATE('2021-01-05'), 5, "A"),
(1, DATE('2021-01-10'), 0, "B"),
(2, DATE('2021-01-03'), 10, "A"),
(2, DATE('2021-01-08'), 0, "A"),
(2, DATE('2021-01-09'), 0, "A")])
This can be achieved by somehow joining the campaign attribution data to the app data and then forward filling.
The problem I have is that the advertisment attribution timestamp can have a mismatch with the UTC dates in the app data table. This means I cannot use a left join as it will not assign campaign_name B to ID 1. Does anyone know an elegant way to solve this problem?
Found a solution! Here is what I did (and a little bit more sample data):
WITH app_data AS
(
SELECT
*
FROM UNNEST(ARRAY<STRUCT<adid INT64, utc_date DATE, Revenue FLOAT64>>
[(1, DATE('2021-01-01'), 0),
(1, DATE('2021-01-05'), 5),
(1, DATE('2021-01-10'), 0),
(1, DATE('2021-01-12'), 0),
(1, DATE('2021-01-15'), 0),
(1, DATE('2021-01-16'), 15),
(1, DATE('2021-01-18'), 0),
(2, DATE('2021-01-03'), 10),
(2, DATE('2021-01-08'), 0),
(2, DATE('2021-01-09'), 0),
(2, DATE('2021-01-15'), 4),
(2, DATE('2021-02-01'), 0),
(2, DATE('2021-02-08'), 8),
(2, DATE('2021-02-15'), 0),
(2, DATE('2021-03-04'), 0),
(2, DATE('2021-03-06'), 12),
(3, DATE('2021-02-15'), 10),
(3, DATE('2021-02-23'), 5),
(3, DATE('2021-03-25'), 0),
(3, DATE('2021-03-30'), 0)])
),
advertisment_attribution_data AS
(
SELECT
*
FROM UNNEST(ARRAY<STRUCT<adid INT64, utc_date DATE, campaign_name STRING>>
[(1, DATE(TIMESTAMP('2021-01-01 09:54:31')), "A"),
(1, DATE(TIMESTAMP('2021-01-09 22:32:51')), "B"),
(1, DATE(TIMESTAMP('2021-01-17 14:30:05')), "C"),
(2, DATE(TIMESTAMP('2021-01-03 19:12:11')), "A"),
(1, DATE(TIMESTAMP('2021-01-15 18:17:57')), "B"),
(3, DATE(TIMESTAMP('2021-03-14 22:32:51')), "C")])
)
SELECT
t1.*,
IFNULL(LAST_VALUE(t2.campaign_name IGNORE NULLS) OVER (PARTITION BY t1.adid ORDER BY t1.utc_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), "Organic") as campaign_name
FROM
app_data t1
LEFT JOIN
advertisment_attribution_data t2
ON t1.adid = t2.adid
AND t1.utc_date = (SELECT MIN(t3.utc_date) FROM app_data t3 WHERE t2.adid=t3.adid AND t2.utc_date <= t3.utc_date)
EDIT
It doesn't work when I select a real table in app_data. It says: Unsupported subquery with table in join predicate.
EDIT 2
Found a way to solve the problem where you cannot use subqueries in joins (apparently it is possible for tables which are not selected from an existing table...) This is the way it works in any case:
WITH app_data AS
(
SELECT
*
FROM UNNEST(ARRAY<STRUCT<adid INT64, utc_date DATE, Revenue FLOAT64>>
[(1, DATE('2021-01-01'), 0),
(1, DATE('2021-01-05'), 5),
(1, DATE('2021-01-10'), 0),
(1, DATE('2021-01-12'), 0),
(1, DATE('2021-01-15'), 0),
(1, DATE('2021-01-16'), 15),
(1, DATE('2021-01-18'), 0),
(2, DATE('2021-01-03'), 10),
(2, DATE('2021-01-08'), 0),
(2, DATE('2021-01-09'), 0),
(2, DATE('2021-01-15'), 4),
(2, DATE('2021-02-01'), 0),
(2, DATE('2021-02-08'), 8),
(2, DATE('2021-02-15'), 0),
(2, DATE('2021-03-04'), 0),
(2, DATE('2021-03-06'), 12),
(3, DATE('2021-02-15'), 10),
(3, DATE('2021-02-23'), 5),
(3, DATE('2021-03-25'), 0),
(3, DATE('2021-03-30'), 0)])
),
advertisment_attribution_data AS
(
SELECT
*,
(
SELECT
MIN(t2.utc_date)
FROM app_data t2
WHERE t1.adid=t2.adid
AND t1.utc_date <= t2.utc_date
) as attribution_join_date -- is the closest next date for this adid in app_data to the attribution date. This ensures the join lateron works.
FROM UNNEST(ARRAY<STRUCT<adid INT64, utc_date DATE, campaign_name STRING>>
[(1, DATE(TIMESTAMP('2021-01-01 09:54:31')), "A"),
(1, DATE(TIMESTAMP('2021-01-09 22:32:51')), "B"),
(1, DATE(TIMESTAMP('2021-01-17 14:30:05')), "C"),
(2, DATE(TIMESTAMP('2021-01-03 19:12:11')), "A"),
(1, DATE(TIMESTAMP('2021-01-15 18:17:57')), "B"),
(3, DATE(TIMESTAMP('2021-03-14 22:32:51')), "C")]) t1
)
SELECT
t1.*,
IFNULL(LAST_VALUE(t2.campaign_name IGNORE NULLS) OVER (PARTITION BY t1.adid ORDER BY t1.utc_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 'Organic') as campaign_name
FROM
app_data t1
LEFT JOIN
advertisment_attribution_data t2
ON t1.adid = t2.adid
AND t1.utc_date = t2.attribution_join_date

SQL : how to exclude rows from count based on a public/private category?

I have two tables (products and categories) with a many-to-many link table (products_categories). The query that I want to build should return only the products that belong to 5 or more public categories. Private categories have a '0' in the 'public' column of the 'categories' table, public categories have a '1'.
I can't find a way to ignore the private categories in the count. From my testing data, only Shovel and Lighter should make the cut. For the moment I get Motorbike, Shovel, Basketball, Football, Tennisball, Pickaxe and Lighter, because they belong to 5 or more categories (public and private).
The tables :
CREATE TABLE products(
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(25),
price INT,
created_at DATE,
PRIMARY KEY(id)
);
CREATE TABLE categories(
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(25),
public BIT NOT NULL DEFAULT 0,
PRIMARY KEY (id)
);
CREATE TABLE products_categories(
product_id INT,
category_id INT,
FOREIGN KEY (product_id) REFERENCES products(id),
FOREIGN KEY (category_id) REFERENCES categories(id)
);
The query :
SELECT products.id, products.name, COUNT(categories.id)
FROM products
INNER JOIN products_categories ON products.id = products_categories.product_id
INNER JOIN categories ON categories.id = products_categories.category_id
GROUP BY products.id
HAVING COUNT(categories.id) >= 5
Data for testing :
INSERT INTO categories VALUES
(1, 'Small', b'1'),
(2, 'Medium', b'1'),
(3, 'Large', b'1'),
(4, 'One-size', b'1'),
(5, 'Tool', b'1'),
(6, 'Sport', b'1'),
(7, 'Ball', b'1'),
(8, 'Camping', b'1'),
(9, 'Food', b'1'),
(10, 'Non-food', b'1'),
(11, 'High-return', b'0'),
(12, 'Low-return', b'0'),
(13, 'Dangerous', b'0');
INSERT INTO products VALUES
(1, 'Bicycle', 50, '2021-03-02'),
(2, 'Motorbike', 100, '2021-03-02'),
(3, 'Shovel', 10, '2021-03-02'),
(4, 'Skis', 20, '2021-03-02'),
(5, 'Tent-S', 20, '2021-03-02'),
(6, 'Tent-M', 30, '2021-03-02'),
(7, 'Tent-L', 30, '2021-03-02'),
(8, 'Basketball', 5, '2021-03-02'),
(9, 'Football', 5, '2021-03-02'),
(10, 'Tennisball', 2, '2021-03-02'),
(11, 'Pickaxe', 15, '2021-03-02'),
(12, 'Lighter', 1, '2021-03-02'),
(13, 'Bottle-S', 2, '2021-03-02'),
(14, 'Bottle-M', 3, '2021-03-02'),
(15, 'Bottle-L', 4, '2021-03-02');
INSERT INTO products_categories VALUES
(1, 4),
(1, 6),
(1, 10),
(1, 11),
(2, 4),
(2, 6),
(2, 10),
(2, 11),
(2, 13),
(3, 4),
(3, 5),
(3, 8),
(3, 10),
(3, 12),
(3, 13),
(4, 4),
(4, 6),
(4, 10),
(4, 11),
(5, 1),
(5, 8),
(5, 10),
(5, 12),
(6, 2),
(6, 8),
(6, 10),
(6, 12),
(7, 3),
(7, 8),
(7, 10),
(7, 12),
(8, 4),
(8, 6),
(8, 7),
(8, 10),
(8, 12),
(9, 4),
(9, 6),
(9, 7),
(9, 10),
(9, 12),
(10, 4),
(10, 6),
(10, 7),
(10, 10),
(10, 12),
(11, 4),
(11, 5),
(11, 10),
(11, 12),
(11, 13),
(12, 4),
(12, 5),
(12, 8),
(12, 10),
(12, 11),
(12, 13),
(13, 1),
(13, 8),
(13, 9),
(13, 12),
(14, 2),
(14, 8),
(14, 9),
(14, 12),
(15, 3),
(15, 8),
(15, 9),
(15, 12);
You would seem to want:
SELECT p.id, p.name, COUNT(*)
FROM products p JOIN
products_categories pc
ON p.id = pc.product_id JOIN
categories c
ON c.id = pc.category_id
WHERE c.public = 1
GROUP BY p.id, p.name
HAVING COUNT(*) >= 5

Subqueries With Multiple Tables

Good day, need your help on my Vehicle Inspection Database. You can see below the structure, you can see it also in here http://sqlfiddle.com/#!3/4ab7e . What I need from this is to extract The Number of Vehicles With Atleast One (1) Defect or Violation Per PROJECT. In the schema below The Total for Project 4 = two (2) vehicles AND Project 9 = 1 vehicle.
Columns Needed are [Project_Name],[Vehicle_Type],[yy],[mm],[Total]
-- Vehicle Inspection Database --
-- Vehicle_Type Table
CREATE TABLE VehicleType
([VehicleTypeId] int,
[Type] varchar (36));
INSERT INTO VehicleType ([VehicleTypeId],[Type])
VALUES (1, 'Light Vehicle'),
(2, 'Tanker'),
(3, 'Goods');
-- Car Table
CREATE TABLE Vehicle
([VehicleID] varchar(36),
[PlateNo] varchar(36),
[VehicleTypeId] int,
[Project] int);
INSERT INTO Vehicle ([VehicleID], [PlateNo],[VehicleTypeId], [Project])
VALUES('A57D4151-BD49-4B44-AF10-000F1C298E05', '8112AG', 1, 4),
('C7095628-AE88-4DD0-A4FD-00363EAB767F', '60070 AD2', 2, 9),
('E714CCD7-E56C-46A8-89D5-003CA5BF6094', '68823 AD1', 3, 9);
-- Event Table
CREATE TABLE Event
([EventID] int,
[VehicleID] varchar(36),
[EventTime] smalldatetime,
[TicketStatus] varchar (10)) ;
INSERT INTO Event([EventID], [VehicleID], [EventTime], TicketStatus)
VALUES (1, 'A57D4151-BD49-4B44-AF10-000F1C298E05', '20130701', 'Open'),
(2, 'A57D4151-BD49-4B44-AF10-000F1C298E05', '20130702', 'Close'),
(3, 'A57D4151-BD49-4B44-AF10-000F1C298E05', '20130703', 'Close'),
(4, 'C7095628-AE88-4DD0-A4FD-00363EAB767F','20130705', 'Open'),
(5, 'C7095628-AE88-4DD0-A4FD-00363EAB767F','20130710', 'Open');
-- Event_Defects Table
CREATE TABLE EventDefects
([EventDefectsID] int,
[EventID] int,
[Status] varchar(15),
[DefectID] int) ;
INSERT INTO EventDefects ([EventDefectsID], [EventID], [Status], [DefectID])
VALUES
-- 1st Inspection for PlateNo. 8112AG
(1, 1, 'YES', 1),
(2, 1, 'NO', 2),
(3, 1, 'YES',3),
(4, 1, 'N/A', 4),
(5, 1, 'N/A', 5),
-- 2nd Inspection for PlateNo. 8112AG
(6, 2, 'NO', 1),
(7, 2, 'NO', 2),
(8, 2, 'NO', 3),
(9, 2, 'N/A', 4),
(10,2, 'N/A', 5),
-- 3rd Inspection for PlateNo. 8112AG
(11, 3, 'NO', 1),
(12, 3, 'NO', 2),
(13, 3, 'NO', 3),
(14, 3, 'NO', 4),
(15, 3, 'NO', 5),
-- 1st Inspection for PlateNo. 60070 AD2
(16, 3, 'NO', 1),
(17, 3, 'NO', 2),
(18, 3, 'NO', 3),
(19, 3, 'N/A', 4),
(20, 3, 'N/A', 5);
-- Defects Table
CREATE TABLE Defects
([DefectID] int,
[DefectsName] varchar (36),
[DefectClassID] int) ;
INSERT INTO Defects ([DefectID], [DefectsName], [DefectClassID])
VALUES (1, 'TYRE', 1),
(2, 'BRAKING SYSTEM', 1),
(3, 'MIRRORS AND WINDSCREEN', 2),
(4, 'OVER SPEEDING', 3),
(5, 'NOT WEARING SEATBELTS', 3);
-- Defect_Class Table
CREATE TABLE DefectClass
([Description] varchar (15),
[DefectClassID] int) ;
INSERT INTO DefectClass ([DefectClassID], [Description])
VALUES (1, 'CATEGORY A'),
(2, 'CATEGORY B'),
(3, 'CATEGORY C');
Do all the joins as inner joins... and it'll eliminate records that are empty.
Can you check if this works for you?
SELECT Vehicle.VehicleID from Vehicle
INNER JOIN Event ON Vehicle.VehicleID = Event.VehicleID
INNER JOIN EventDefects ON EventDefects.EventID = Event.EventID
INNER JOIN Defects ON EventDefects.DefectID = Defects.DefectID
GROUP BY Vehicle.VehicleID;