In BigQuery, can I add rows of missing data? [duplicate]

In BigQuery, can I add rows of missing data? [duplicate] - sql

This question already has answers here:
How to add records for each user based on another existing row in BigQuery?
(3 answers)
Closed 2 years ago.
I have a table where each row represents the number of transactions a user has per day. If they had no transaction that day then they don't have a row for that date. How can I add these 'missing rows' and set the number of transactions to 0
My table:
Date | User | numTransactions
2020-01-01 | anna | 2
2020-01-01 | john | 3
2020-01-02 | anna | 1
2020-01-04 | anna | 1
2020-01-05 | john | 2
Anna had transactions on Jan 1,2, and 4 but not Jan 3, and 5
John had transactions on Jan 1, and 5 but not Jan 2, 3, and 4
I want to add rows which shows the dates there are 0 transactions
Date | User | numTransactions
2020-01-01 | anna | 2
2020-01-01 | john | 3
2020-01-02 | anna | 1
2020-01-04 | anna | 1
2020-01-05 | john | 2
2020-01-02 | john | 0
2020-01-03 | anna | 0
2020-01-03 | john | 0
2020-01-04 | john | 0
2020-01-05 | anna | 0

You can join with GENERATE_DATE_ARRAY:
WITH test_table AS (
SELECT DATE '2020-01-01' AS Date, 'anna' AS User, 2 AS numTransactions UNION ALL
SELECT '2020-01-01', 'john', 3 UNION ALL
SELECT '2020-01-02', 'anna', 1 UNION ALL
SELECT '2020-01-04', 'anna', 1 UNION ALL
SELECT '2020-01-05', 'john', 2
),
clients_list AS (
SELECT DISTINCT User FROM test_table
)
SELECT
Date,
User,
IFNULL(numTransactions, 0) AS numTransactions
FROM UNNEST(GENERATE_DATE_ARRAY('2020-01-01', '2020-01-05')) AS Date
CROSS JOIN clients_list
LEFT JOIN test_table USING(Date, User)

I recommend writing the code in this fashion:
WITH t AS (
SELECT DATE '2020-01-01' AS Date, 'anna' AS User, 2 AS numTransactions UNION ALL
SELECT '2020-01-01', 'john', 3 UNION ALL
SELECT '2020-01-02', 'anna', 1 UNION ALL
SELECT '2020-01-04', 'anna', 1 UNION ALL
SELECT '2020-01-05', 'john', 2
)
SELECT u.user, COALESCE(dte, u.date) as date,
(CASE WHEN dte = u.date THEN u.numTransactions ELSE 0 END) as numTransactions
FROM (SELECT user, date, numTransactions,
COALESCE(DATE_ADD(LEAD(DATE) OVER (PARTITION BY user ORDER BY date), INTERVAL -1 DAY), DATE '2020-01-05') as end_date
FROM t
) u LEFT JOIN
UNNEST(GENERATE_DATE_ARRAY(date, end_date, INTERVAL 1 DAY)) dte
ON 1=1
ORDER BY user, date;
This is slightly simpler than generating all the dates up-front (not requiring getting the unique names and then re-joining to the same table).
Much more important are the performance characteristics, which have proven very important in my experience in making this scalable. Basically, the CROSS JOIN for generating all user/date combinations uses a lot of resources. This version keeps all the operations "local" to a given user (well, there is some data movement to get all the users co-located on the same node).
Specifically, I have seen queries that run out of resources or literally take hours to complete using the CROSS JOIN method finish within a minute using this method.

Related

Oracle SQL - Get difference between dates based on check in and checkout records

Assume I have the following table data.
# | USER | Entrance | Transaction Date Time
-----------------------------------------------
1 | ALEX | INBOUND | 2020-01-01 10:20:00
2 | ALEX | OUTBOUND | 2020-01-02 10:00:00
3 | ALEX | INBOUND | 2020-01-04 11:30:00
4 | ALEX | OUTBOUND | 2020-01-07 15:00:00
5 | BEN | INBOUND | 2020-01-08 08:00:00
6 | BEN | OUTBOUND | 2020-01-09 09:00:00
I would like to know the total of how many days the user has stay outbound.
For each inbound and outbound is considered one trip, every trip exceeded 24 hours is considered as 2 days.
Below is my desired output:
No. of Days | Trips Count
----------------------------------
Stay < 1 day | 1
Stay 1 day | 1
Stay 2 days | 0
Stay 3 days | 0
Stay 4 days | 1

I would use lead() and aggregation. Assuming that the rows are properly interlaced:
select floor( (next_dt - dt) ) as num_days, count(*)
from (select t.*,
lead(dt) over (partition by user order by dt) as next_dt
from trips t
) t
where entrance = 'INBOUND'
group by floor( (next_dt - dt) )
order by num_days;
Note: This does not include the 0 rows. That does not seem central to your question and is a significant complication.

I still don't know what you mean with < 1 day, but this I got this far
Setup
create table trips (id number, name varchar2(10), entrance varchar2(10), ts TIMESTAMP);
insert into trips values( 1 , 'ALEX','INBOUND', TIMESTAMP '2020-01-01 10:20:00');
insert into trips values(2 , 'ALEX','OUTBOUND',TIMESTAMP '2020-01-02 10:00:00');
insert into trips values(3 , 'ALEX','INBOUND',TIMESTAMP '2020-01-04 11:30:00');
insert into trips values(4 , 'ALEX','OUTBOUND',TIMESTAMP '2020-01-07 15:00:00');
insert into trips values(5 , 'BEN','INBOUND',TIMESTAMP '2020-01-08 08:00:00');
insert into trips values(6 , 'BEN','OUTBOUND',TIMESTAMP '2020-01-09 07:00:00');
Query
select decode (t.days, 0 , 'Stay < 1 day', 1, 'Stay 1 day', 'Stay ' || t.days || ' days') Days , count(d.days) Trips_count
FROM (Select Rownum - 1 days From dual Connect By Rownum <= 6) t left join
(select extract (day from b.ts - a.ts) + 1 as days from trips a
inner join trips b on a.name = b.name
and a.entrance = 'INBOUND'
and b.entrance = 'OUTBOUND'
and a.ts < b.ts
and not exists (select ts from trips where entrance = 'OUTBOUND' and ts > a.ts and ts < b.ts)) d
on t.days = d.days
group by t.days order by t.days
Result
DAYS | TRIPS_COUNT
----------------|------------
Stay < 1 day | 0
Stay 1 day | 2
Stay 2 days | 0
Stay 3 days | 0
Stay 4 days | 1
Stay 5 days | 0
You could replace the 6 with a select max with the second subquery repeated

SQL query to answer: If <event 1> occurs in timepoint A, does <event 2> occur in time period B-C?

I'm querying a large data set to figure out if a bunch of campaign events (i.e. event 1,2,..) during different timepoints gives a result in user activity (active, inactive) during the following 3 days after each event (but not in the same day as the campaign event itself).
I'm merging two tables to do this, and they look like this merged:
| date | user | events | day_activity |
| 2020-01-01 | 1 | event1 | active |
| 2020-01-01 | 2 | event1 | inactive |
| 2020-01-02 | 1 | null | inactive |
| 2020-01-02 | 2 | null | active |
| 2020-01-03 | 1 | null | inactive |
| 2020-01-03 | 2 | null | active |
| 2020-01-04 | 1 | null | active |
| 2020-01-04 | 2 | null | active |
What I am trying to achieve is, for each user/date/event gang (= row) where an event occured, to add another column called 3_day_activity, containing the activity not on the event (= current row) day but the following 3 days only (giving a score of 1 per active day). An example for how the 1st day of this table would look after (I add * in the activity days counted in the added column for user 1, and # for the events counted in the column for user 2)):
| date | user | events | day_activity | 3_day_activity
| 2020-01-01 | 1 | event1 | active | 1
| 2020-01-01 | 2 | event1 | inactive | 3
| 2020-01-02 | 1 | null | inactive * (0)| null (bco no event)
| 2020-01-02 | 2 | null | active # (1) | null (bco no event)
| 2020-01-03 | 1 | null | inactive * (0)| null (bco no event)
| 2020-01-03 | 2 | null | active # (1) | null (bco no event)
| 2020-01-04 | 1 | null | active * (1) | null (bco no event)
| 2020-01-04 | 2 | null | active # (1) | null (bco no event)
I tried solving this with a window function. It runs, but I think I misunderstood some important idea on how to design it, because the result contains a ton of repetitions...
cm.date,
cm.user,
event,
day_activity,
COUNTIF(active_today = 'active') OVER 3d_later AS 3_day_activity
FROM `customer_message` cm
INNER JOIN `customer_day` ud
ON cm.user = ud.user
AND cm.date = ud.date
WHERE
cm.date > '2019-12-25'
WINDOW 3d_later AS (PARTITION BY user ORDER BY UNIX_DATE(cm.date) RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING)
EDIT:
I was asked to supply an example of how this repetition might look. Here's what I see if I add an "ORDER BY 3_day_activity" clause at the end of the query:
Row date user day_activity 3_day_activity
1 2020-01-01 2 active 243
2 2020-01-01 2 active 243
3 2020-01-01 2 active 243
4 2020-01-01 2 active 243
5 2020-01-01 2 active 243
6 2020-01-01 2 active 243
7 2020-01-02 2 active 243
8 2020-01-02 2 active 243
EDIT2 :
This remains unsolved.. I have tried asking another question, as per the suggestion of one commenter, but I am locked from doing so even if the problem is not identical (I suppose due to the similarities to this one). I have tested grouping based on user and date, but I then it instead throws an error due to not aggregating in the 'COUNTIF' clause.
This is the attempt mentioned; SQL: Error demanding aggregation when counting, grouping and windowing

Below example is for BigQuery Standard SQL
#standardSQL
SELECT *, IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM `project.dataset.table`
WINDOW three_day_activity_window AS (
PARTITION BY user
ORDER BY UNIX_DATE(date)
RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT DATE '2020-01-01' date , 1 user, 'event1' events, 'active' day_activity UNION ALL
SELECT '2020-01-01', 2, 'event1', 'inactive' UNION ALL
SELECT '2020-01-02', 1, NULL, 'inactive' UNION ALL
SELECT '2020-01-02', 2, NULL, 'active' UNION ALL
SELECT '2020-01-03', 1, NULL, 'inactive' UNION ALL
SELECT '2020-01-03', 2, NULL, 'active' UNION ALL
SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
SELECT '2020-01-04', 2, NULL, 'active'
)
SELECT *, IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM `project.dataset.table`
WINDOW three_day_activity_window AS (
PARTITION BY user
ORDER BY UNIX_DATE(date)
RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)
ORDER BY date, user
with output
Row date user events day_activity three_day_activity
1 2020-01-01 1 event1 active 1
2 2020-01-01 2 event1 inactive 3
3 2020-01-02 1 null inactive 0
4 2020-01-02 2 null active 0
5 2020-01-03 1 null inactive 0
6 2020-01-03 2 null active 0
7 2020-01-04 1 null active 0
8 2020-01-04 2 null active 0
Update for - to avoid registering the same user as active multiple times in one day (and tallying those up to a huge sum)?
If you want to avoid counting all activity for user on same day - use below adjusted version (note extra entry in sample data to introduce user's multiple activity on same day)
#standardSQL
WITH `project.dataset.table` AS (
SELECT DATE '2020-01-01' DATE , 1 user, 'event1' events, 'active' day_activity UNION ALL
SELECT '2020-01-01', 2, 'event1', 'inactive' UNION ALL
SELECT '2020-01-02', 1, NULL, 'inactive' UNION ALL
SELECT '2020-01-02', 2, NULL, 'active' UNION ALL
SELECT '2020-01-03', 1, NULL, 'inactive' UNION ALL
SELECT '2020-01-03', 2, NULL, 'active' UNION ALL
SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
SELECT '2020-01-04', 2, NULL, 'active'
)
SELECT *,
IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM (
SELECT date, user, MAX(events) events, MIN(day_activity) day_activity
FROM `project.dataset.table`
GROUP BY date, user
)
WINDOW three_day_activity_window AS (
PARTITION BY user
ORDER BY UNIX_DATE(date)
RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)
ORDER BY date, user

You seem to be quite there. A range partition is the right way to go. BigQuery only supports integers in such frame, so we need to convert the date to a number; since you have dates with no time component, UNIX_DATE() comes to mind:
WINDOW 3d_later AS (
PARTITION BY user
ORDER BY UNIX_DATE(cm.date)
RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)

Self join next timestamp

I am looking to merge timestamp from 2 different row based on Employee and punch card but the max or limit does not work with the from statement, if I only use > then i get every subsequent for everyday... I want the next higher value on a self join, also I have to mention that i have to use SQL 2008! so the lag and Lead does not work!
please help me.
SELECT , Det.name
,Det.[time]
,Det2.[time]
,Det.[type]
,det2.type
,Det.[detail]
FROM [detail] Det
join [detail] Det2 on
Det2.name = Det.name
and
Det2.time > Det.time Max 1
where det.type <>3
Table detail
NAME | Time | Type | detail
john | 10:30| 1 | On
steve| 10:32| 1 | On
john | 10:34| 2 | break
paul | 10:35| 1 | On
steve| 10:45| 3 | Off
john | 10:49| 2 | on
paul | 10:55| 3 | Off
john | 11:12| 3 | Off
Wanted result
John | 10:30 | 10:34 | 1 | 2 | On
John | 10:34 | 10:49 | 2 | 1 | Break
John | 10:49 | 11:12 | 1 | 3 | on
Steve| 10:32 | 10:45 | 1 | 3 | on
Paul | 10:35 | 10:55 | 1 | 3 | On
Thank you in advance!

You can do it with cross apply:
SELECT Det.name
,Det.[time]
,ca.[time]
,Det.[type]
,ca.type
,Det.[detail]
FROM [detail] Det
Cross Apply(Select Top 1 * From detail det2 where det.Name = det2.Name Order By det2.Time) ca
Where det.Type <> 3

As you said LAG or LEAD functions won't work for you, but you could use ROW_NUMBER() OVER (PARTITION BY name ORDER BY time DESC) on both tables and then do a JOIN on RN1 = RN2 + 1
This is just a idea, but I don't see an issue why it shouldn't work.
Query:
;WITH Data (NAME, TIME, type, detail)
AS (
SELECT 'john', CAST('10:30' AS DATETIME2), 1, 'On'
UNION ALL
SELECT 'steve', '10:32', 1, 'On'
UNION ALL
SELECT 'john', '10:34', 2, 'break'
UNION ALL
SELECT 'paul', '10:35', 1, 'On'
UNION ALL
SELECT 'steve', '10:45', 3, 'Off'
UNION ALL
SELECT 'john', '10:49', 2, 'on'
UNION ALL
SELECT 'paul', '10:55', 3, 'Off'
UNION ALL
SELECT 'john', '11:12', 3, 'Off'
)
SELECT t.NAME, LTRIM(RIGHT(CONVERT(VARCHAR(25), t.TIME, 100), 7)) AS time, LTRIM(RIGHT(CONVERT(VARCHAR(25), t2.TIME, 100), 7)) AS time, t.type, t2.type, t.detail
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY NAME ORDER BY TIME) rn, *
FROM Data
) AS t
INNER JOIN (
SELECT ROW_NUMBER() OVER (PARTITION BY NAME ORDER BY TIME) rn, *
FROM Data
) AS t2
ON t2.NAME = t.NAME
AND t2.rn = t.rn + 1;
Result:
NAME time time type type detail
----------------------------------------------
john 10:30AM 10:34AM 1 2 On
john 10:34AM 10:49AM 2 2 break
john 10:49AM 11:12AM 2 3 on
paul 10:35AM 10:55AM 1 3 On
steve 10:32AM 10:45AM 1 3 On
Any comments, concerns - let me know. :)

As #evaldas-buinauskas said,
The OVER and LAG statements in SQL will work for you.
Here is a similar example:
http://www.databasejournal.com/features/mssql/lead-and-lag-functions-in-sql-server-2012.html

How to display SQL query using Group by like this?

2i ll listed only 1 table that i need to query :
lodgings_Contract :
id_contract indentity primary,
id_person int,
id_room varchar(4),
day_begin datetime,
day_end datetime,
day_register datetime
money_per_month money
And this is values for table lodgings_Contract (This datas used for Example only):
id_contract | id_person | id_room | day_begin -----| day_end ----- | day_register------- | money_per_month
3 | 2 | 101 | 1/12/2014 | 27/2/2015 | 1/12/2015 | 100
2 | 1 | 102 | 1/1/2014 | 27/4/2014 | 1/1/2014 | 200
1 | 3 | 103 | 1/1/2014 | 27/3/2014 | 1/1/2014 | 300
*person 1 rent room 102 in 4 month at year 2014 with 200/month And person 2 rent room 101 in 3 month but 1 month at year 2014 and 2 month at year 2015 with 100/month .Person 3 rent room 103 in 3 month at year 2014 with 300/month
I want my result display 3 field : Month | Year | Incomes
Result :
Month | Year | Incomes
1 |2014| 500
2 |2014| 500
3 |2014| 500
4 |2014| 200
12 |2014| 100
1 |2015| 100
2 |2015| 100
Can i do that ? Help me Please !
I was post another post before this post but it complicated and requires 3 tables so i make this post with only 1 table.
This is my code :
select month(day_begin)as 'Month',year(day_begin)as 'Year',money_per_month as 'Incomes'
from lodgings_Contract
group by day_begi,money_per_month
It only listed first month of "day_begin".I have no idea how to do it right

To get the results you first need a calendar table, in the following query is created on the fly with a CTE.
That said what is the purpose of the column day_register? It seems a copy of day_begin, with probably a typo for the contract with ID 3.
WITH Months(N) AS (
SELECT 1 UNION ALL Select 2 UNION ALL Select 3 UNION ALL Select 4
UNION ALL Select 5 UNION ALL Select 6 UNION ALL Select 7 UNION ALL Select 8
UNION ALL Select 9 UNION ALL Select 10 UNION ALL Select 11 UNION ALL Select 12
), Calendar(N) As (
SELECT CAST(2010 + y.N AS VARCHAR) + RIGHT('00' + Cast(m.N AS VARCHAR), 2)
FROM Months m
CROSS JOIN Months y
)
SELECT RIGHT(c.N, 2) [Month]
, LEFT(c.N, 4) [Year]
, SUM(money_per_month) Incomes
FROM lodgings_Contract lc
INNER JOIN Calendar c
ON c.N BETWEEN CONVERT(VARCHAR(6), lc.day_begin, 112)
AND CONVERT(VARCHAR(6), lc.day_end, 112)
GROUP BY c.N
The calendar CTE is small as it's unknown to me for how many year is the real data. If there are many years it is better to create a calendar table in your DB and use it instead of calculate it every time.
The calendar CTE return a list of month in the format yyyyMM.
In the main query the CONVERT(VARCHAR(6), lc.day_begin, 112) change the day_begin to the ISO format yyyyMMdd and take only the first six value, so again yyyyMM, for example for the id_contract 3 we will have 201412, the same for the day_end.
If the beginning of the contract is day_register change lc.day_begin to lc.day_register.
SQLFiddle demo

SQL duration between two dates in different rows

I would really appreciate some assistance if somebody could help me construct a MSSQL Server 2000 query that would return the duration between a customer's A entry and their B entry.
Not all customers are expected to have a B record and so no results would be returned.
Customers Audit
+---+---------------+---+----------------------+
| 1 | Peter Griffin | A | 2013-01-01 15:00:00 |
| 2 | Martin Biggs | A | 2013-01-02 15:00:00 |
| 3 | Peter Griffin | C | 2013-01-05 09:00:00 |
| 4 | Super Mario | A | 2013-01-01 15:00:00 |
| 5 | Martin Biggs | B | 2013-01-03 18:00:00 |
+---+---------------+---+----------------------+
I'm hoping for results similar to:
+--------------+----------------+
| Martin Biggs | 1 day, 3 hours |
+--------------+----------------+

Something like the below (don't know your schema, so you'll need to change names of objects) should suffice.
SELECT ABS(DATEDIFF(HOUR, CA.TheDate, CB.TheDate)) AS HoursBetween
FROM dbo.Customers CA
INNER JOIN dbo.Customers CB
ON CB.Name = CA.Name
AND CB.Code = 'B'
WHERE CA.Code = 'A'

SELECT A.CUSTOMER, DATEDIFF(HOUR, A.ENTRY_DATE, B.ENTRY_DATE) DURATION
FROM CUSTOMERSAUDIT A, CUSTOMERSAUDIT B
WHERE B.CUSTOMER = A.CUSTOMER AND B.ENTRY_DATE > A.ENTRY_DATE

This is Oracle query but all features available in MS Server as far as I know. I'm sure I do not have to tell you how to concatenate the output to get desired result. All values in output will be in separate columns - days, hours, etc... And it is not always easy to format the output here:
SELECT id, name, grade
, NVL(EXTRACT(DAY FROM day_time_diff), 0) days
, NVL(EXTRACT(HOUR FROM day_time_diff), 0) hours
, NVL(EXTRACT(MINUTE FROM day_time_diff), 0) minutes
, NVL(EXTRACT(SECOND FROM day_time_diff), 0) seconds
FROM
(
SELECT id, name, grade
, (begin_date-end_date) day_time_diff
FROM
(
SELECT id, name, grade
, CAST(start_date AS TIMESTAMP) begin_date
, CAST(end_date AS TIMESTAMP) end_date
FROM
(
SELECT id, name, grade, start_date
, LAG(start_date, 1, to_date(null)) OVER (ORDER BY id) end_date
FROM stack_test
)
)
)
/
Output:
ID NAME GRADE DAYS HOURS MINUTES SECONDS
------------------------------------------------------------
1 Peter Griffin A 0 0 0 0
2 Martin Biggs A 1 1 0 0
3 Peter Griffin C 2 17 0 0
4 Super Mario A -3 -18 0 0
5 Martin Biggs A 2 3 0 0
The table structure/columns I used - it would be great if you took care of this and data in advance:
CREATE TABLE stack_test
(
id NUMBER
,name VARCHAR2(50)
,grade VARCHAR2(3)
,start_date DATE
)
/

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

In BigQuery, can I add rows of missing data? [duplicate] - sql

Related

Oracle SQL - Get difference between dates based on check in and checkout records

SQL query to answer: If <event 1> occurs in timepoint A, does <event 2> occur in time period B-C?

Self join next timestamp

How to display SQL query using Group by like this?

SQL duration between two dates in different rows

Categories

Resources