Calculate probability of previous occurrences in Big Query - google-bigquery

I have a data of the format:
(Table A)
Site
Day
Event Type
Site A
Mon
Event 1
Site A
Mon
Event 2
Site A
Tue
Event 1
Site A
Tue
Event 3
Site A
Wed
Event 3
Site A
Wed
Event 1
Site B
Wed
Event 1
I now want to create a "full" table equivalent of the above (i.e. all permutation of the site, day and event type present in the new table) with calculated probabilities of an event happening from Mon - Wed (i.e. In Site A, Event 1 has a probability of 1.00 because it existed for each day while Event 2 only has a probability of 0.33 because it only happened on a Mon. Similarly, for Site B only Event 1 has a probability of 0.33 while events 2 and 3 have probabilities of 0).
(Table B)
Site
Day
Event Type
P
Site A
Mon
Event 1
1.00
Site A
Mon
Event 2
0.33
Site A
Mon
Event 3
0.66
Site A
Tue
Event 1
1.00
Site A
Tue
Event 2
0.33
Site A
Tue
Event 3
0.66
Site A
Wed
Event 1
1.00
Site A
Wed
Event 2
0.33
Site A
Wed
Event 3
0.66
Site B
Mon
Event 1
0.33
Site B
Mon
Event 2
0.00
Site B
Mon
Event 3
0.00
Site B
Tue
Event 1
0.33
Site B
Tue
Event 2
0.00
Site B
Tue
Event 3
0.00
Site B
Wed
Event 1
0.33
Site B
Wed
Event 2
0.00
Site B
Wed
Event 3
0.00
How can I do this efficiently in Big Query? I have not been able to successfully implement it so far. Thanks

Consider below approach
select Site, Day, EventType,
round(countif(format('%t', t) != '(NULL, NULL, NULL)') over win / count(*) over win, 2) p
from (select distinct Site from your_table),
(select distinct Day from your_table),
(select distinct EventType from your_table)
left join your_table t
using (Site, Day, EventType)
window win as (partition by Site, EventType)
# order by Site, Day, EventType
if applied to sample data in your question - output is

Related

How do you get the last entry for each month in SQL?

I am looking to filter very large tables to the latest entry per user per month. I'm not sure if I found the best way to do this. I know I "should" trust the SQL engine (snowflake) but there is a part of me that does not like the join on three columns.
Note that this is a very common operation on many big tables, and I want to use it in DBT views which means it will get run all the time.
To illustrate, my data is of this form:
mytable
userId
loginDate
year
month
value
1
2021-01-04
2021
1
41.1
1
2021-01-06
2021
1
411.1
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-06
2021
2
32
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
And I'm trying to use SQL to get the last value (by loginDate) for each month.
I'm currently doing a groupby & a join as follows:
WITH latest_entry_by_month AS (
SELECT "userId", "year", "month", max("loginDate") AS "loginDate"
FROM mytable
)
SELECT * FROM mytable NATURAL JOIN latest_entry_by_month
The above results in my desired output:
userId
loginDate
year
month
value
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
But I'm not sure if it's optimal.
Any guidance on how to do this faster? Note that I am not materializing the underlying data, so it is effectively un-clustered (I'm getting it from a vendor via the Snowflake marketplace).
Using QUALIFY and windowed function(ROW_NUMBER):
SELECT *
FROM mytable
QUALIFY ROW_NUMBER() OVER(PARTITION BY userId, year, month
ORDER BY loginDate DESC) = 1

Reverse track forced records relationships based on user-defined tagging

I have this table where the tagging [Tag_To] is updated by an algorithm based on Year and Period of coverage. My current task (in question) is to update the Status given the Year.
ID Year Method Period_From Period_To SeqNo Tag_To Status
-----------------------------------------------------------------------------------
10 2019 A 2019-01-01 2019-12-31 1
11 2019 B 2019-01-01 2019-06-30 2 1
12 2019 B 2019-07-01 2019-12-31 3 1
13 2019 C 2019-01-01 2019-06-30 4 2
14 2020 A 2020-01-01 2020-12-31 1
15 2020 B 2020-01-01 2020-06-30 2 1
16 2020 B 2020-07-01 2020-12-31 3 1
17 2020 C 2020-01-01 2020-12-31 4 2,3
18 2021 A 2021-01-01 2021-12-31 1
19 2021 B 2021-01-01 2021-12-31 2 1
20 2021 C 2021-07-01 2021-12-31 3 2
The SeqNo is applied per Year and the Tag_To is done based on period of coverage.
11 and 12 are tagged to 10 since B follows A and their period falls within 10 period coverage.
13 is tagged to 11 since C follows B and the period...
15 and 16 to 14
Also note that 17 is tagged to 15 and 16 (2,3) because 17's coverage spans across the 2 periods of 15 and 16 combined
and so on...
The objective is to update the Status by Year such that each path is considered Closed if the path already has Methods A, B and C (there are actually more methods, but to simplify). Status should be Open for paths that haven't completed the methods.
From the example above, there are 5 paths:
10(A)-->11(B)-->13(C) = Closed
10(A)-->12(B)-->??? = Open
14(A)-->15(B)-->17(C) = Closed
14(A)-->16(B)-->17(C) = Closed
18(A)-->19(B)-->20(C) = Closed
Therefore the status update should be:
ID Year Method Period_From Period_To SeqNo Tag_To Status
-----------------------------------------------------------------------------------
10 2019 A 2019-01-01 2019-12-31 1 Open
11 2019 B 2019-01-01 2019-06-30 2 1 Closed
12 2019 B 2019-07-01 2019-12-31 3 1 Open
13 2019 C 2019-01-01 2019-06-30 4 2 Closed
14 2020 A 2020-01-01 2020-12-31 1 Closed
15 2020 B 2020-01-01 2020-06-30 2 1 Closed
16 2020 B 2020-07-01 2020-12-31 3 1 Closed
17 2020 C 2020-01-01 2020-12-31 4 2,3 Closed
18 2021 A 2021-01-01 2021-12-31 1 Closed
19 2021 B 2021-01-01 2021-12-31 2 1 Closed
20 2021 C 2021-07-01 2021-12-31 3 2 Closed
I hope I have explained everything clearly. Would really appreciate if anyone could help.
Just to update viewers that I have managed to solve this on my own although the solution is super non-dynamic and quite inefficient, it pretty much did the job for me. Here's what I did.
UPDATE Table SET
Status =
CASE WHEN Method = 'B'
AND NOT EXISTS ( SELECT * FROM Table P INNER JOIN
(
SELECT VALUE AS Tag_To
FROM Table AV
CROSS APPLY STRING_SPLIT(AV.Tag_To, ',')
WHERE AV.Method = 'C'
) C ON P.Sequence_No = C.Tag_To
WHERE P.ID = AValue.ID
)
THEN 'Open'
WHEN Method = 'A'
AND NOT EXISTS ( SELECT * FROM Table P INNER JOIN
(
SELECT VALUE AS Tag_To
FROM Table AV
CROSS APPLY STRING_SPLIT(AV.Tag_To, ',')
WHERE AV.Method = 'B'
) C ON P.Sequence_No = C.Tag_To
WHERE P.ID = AValue.ID
)
THEN 'Open'
ELSE 'Closed'
END
FROM Table AValue
WHERE Year = #Year
;WITH CTE AS
(
SELECT
ROW_NUMBER() OVER(PARTITION BY A.Method ORDER BY A.Sequence_No ASC) SN,
A.ID,
A.Method,
A.Sequence_No,
A.Tag_To,
A.Period_From,
A.Period_To,
A.Status
FROM Table A
LEFT JOIN
(
SELECT VALUE AS Tag_To
FROM Table AV
CROSS APPLY STRING_SPLIT(AV.Tag_To, ',')
WHERE Year = #Year
) B ON A.Sequence_No = B.Tag_To
WHERE Year = #Year
),
CTE2 AS
(
SELECT DISTINCT SN FROM CTE
WHERE Status = 'Open'
)
UPDATE Table SET
Status = 'Open'
FROM Table
INNER JOIN CTE ON Table.ID = CTE.ID
INNER JOIN CTE2 ON CTE.SN = CTE2.SN
Yeah, it's ugly but, hey, it did the job! :)

Calculate Average between columns by comparing two rows in SQL Server

I have the below table
BidID AppID AppStatus StatusTime
1 1 In Review 2019-01-02 12:00:00
1 1 Approved 2019-01-02 13:00:00
1 2 In Review 2019-01-04 13:00:00
1 2 Approved 2019-01-04 14:00:00
2 2 In Review 2019-01-07 15:00:00
2 2 Approved 2019-01-07 17:00:00
3 1 In Review 2019-01-09 13:00:00
4 1 Approved 2019-01-09 13:00:00
What I am trying to do is first to calculate the average of statusTime minutes difference by the following logic
First group by the BidID and then by AppID and then calculate the time difference between the StatusTime between In Review and Approved AppStatus
eg
First Group BidID,Then group App ID
, Then First Check for In Review Status and Find the Next Approved status and then have to calculate min difference between the dates
BidID AppID AppStatus BidAverage
1 -> 1,2 -> For App ID 1(2019-01-02 1hour 1.5
15:48:42.000 - 2019-01-02
12:33:36.000)
For App ID 2(2019-01-04 2hour
10:33:12.000 - 2019-01-04
10:33:12.000)
2-> 2 -> For App ID 2(2019-01-04 1 1
10:33:12.000 - 2019-01-04
10:33:12.000)
3-> 1-> No Calculation since no Approved
4-> 1-> No Calculation since no In Review before Approved
Final Average (1.5 + 1) / 2 = 1.25 for the table
The time difference excluding saturday I have already figured out Time Dfference Exluding Weekend using David's suggestion.
I am not sure how to check if AppStatus is first in In Review and then Approved and then only calculate the time difference and if there is no Approved like in BidID 3 then don't use that in the average calculation and then average it across the APPId and then the BidID
Thanks
I think you can just use min() and max() for simplicity to get the times for the bid/app pairs. The rest is just aggregation and more aggregation.
The processing you describe seems to be:
select avg(avg_bid_diff)
from (select bid, avg(diff*1.0) as avg_bid_diff
from (select bid, appid,
datediff(second, min(starttime), max(statustime)) as diff
from t
where appstatus in ('In Review', 'Approved')
group by bid, appid
having count(*) = 2
) ba
group by bid
) b;
This makes assumptions that are consistent with the provided data -- that the statuses don't have duplicates for the bid/app pairs an that approval is always after review.

how to group by date and unique each group and count each group with pandas

how to group by date and unique each group and count each group with pandas?
Count number of unique MAC address each day
pd.concat([df[['date','Client MAC']], df8[['date',"MAC address"]].rename(columns={"MAC address":"Client MAC"})]).groupby(["date"])
one of column , data example
Association Time
Mon May 14 19:41:20 HKT 2018
Mon May 14 19:43:22 HKT 2018
Tue May 15 09:24:57 HKT 2018
Mon May 14 19:53:33 HKT 2018
i use
starttime=datetime.datetime.now()
dff4 = (df4[['Association Time','Client MAC Address']].groupby(pd.to_datetime(df4["Association Time"]).dt.date.apply(lambda x: dt.datetime.strftime(x, '%Y-%m-%d'))).nunique())
print datetime.datetime.now()-starttime
it runs for 2 minutes, but it also group by association time, it is wrong,
not need to group by association time
Association Time Client MAC Address
Association Time
2017-06-21 1 3
2018-02-21 2 8
2018-02-27 1 1
2018-03-07 3 3
I believe need add ['Client MAC'].nunique():
df = (pd.concat([df[['date','Client MAC']],
df8[['date',"MAC address"]].rename(columns={"MAC address":"Client MAC"})])
.groupby(["date"])['Client MAC']
.nunique())
If dates are datetimes:
df = (pd.concat([df[['date','Client MAC']],
df8[['date',"MAC address"]].rename(columns={"MAC address":"Client MAC"})]))
df = df['Client MAC'].groupby(df["date"].dt.date).nunique()

How to perform multiple table calculation with joins and group by

I have two tables client and grouping. They look like this:
Client
C_id
C_grouping_id
Month
Profit
Grouping
Grouping_id
Month
Profit
The client table contains monthly profit for every client and every client belongs to a specific grouping scheme specified by C_grouping_id.
The grouping table contains all the groups and their monthly profits.
I'm struggling with a query that essentially calculates the monthly residual for every subscriber:
Residual= (Subscriber Monthly Profit - Grouping monthly Profit)*(average subscriber monthly profits for all months / average profits for all months for the grouping subscriber belongs to)
I have come up with the following query so far but the results seem to be incorrect:
SELECT client.C_id, client.C_grouping_Id, client.Month,
((client.Profit - grouping.profit) * (avg(client.Profit)/avg(grouping.profit))) as "residual"
FROM client
INNER JOIN grouping
ON "C_grouping_id"="Grouping_id"
group by client.C_id, client.C_grouping_Id,client.Month, grouping.profit
I would appreciate it if someone can shed some light on what I'm doing wrong and how to correct it.
EDIT: Adding sample data and desired results
Client
C_id C_grouping_id Month Profit
001 aaa jul 10$
001 aaa aug 12$
001 aaa sep 8$
016 abc jan 25$
016 abc feb 21$
Grouping
Grouping_id Month Profit
aaa Jul 30$
aaa aug 50$
aaa Sep 15$
abc Jan 21$
abc Feb 27$
Query Result:
C_ID C_grouping_id Month Residual
001 aaa Jul (10-30)*(10/31.3)=-6.38
... and so on for every month for avery client.
This can be done in a pretty straight forward way.
The main difficulty is obviously that you try to deal with different levels of aggregation at once (average of the group and the client as well as the current record).
This is rather difficult/clumsy with simple SELECT FROM GROUP BY-SQL.
But with analytical functions aka Window functions this is very easy.
Start with combining the tables and calculating the base numbers:
select c.c_id as client_id,
c.c_grouping_id as grouping_id,
c.month,
c.profit as client_profit,
g.profit as group_profit,
avg (c.profit) over (partition by c.c_id) as avg_client_profit,
avg (g.profit) over (partition by g.grouping_id) as avg_group_profit
from client c inner join grouping g
on c."C_GROUPING_ID"=g."GROUPING_ID"
and c. "MONTH" = g. "MONTH";
With this you already get the average profits by client and by grouping_id.
Be aware that I changed the data type of the currency column to DECIMAL (10,3) as a VARCHAR with a $ sign in it is just hard to convert.
I also fixed the data for MONTHS as the test data contained different upper/lower case spellings which prevented the join to work.
Finally I turned all column names into upper case to, in order to make typing easier.
Anyhow, running this provides you with the following result set:
CLIENT_ID GROUPING_ID MONTH CLIENT_PROFIT GROUP_PROFIT AVG_CLIENT_PROFIT AVG_GROUP_PROFIT
16 abc JAN 25 21 23 24
16 abc FEB 21 27 23 24
1 aaa JUL 10 30 10 31.666
1 aaa AUG 12 50 10 31.666
1 aaa SEP 8 15 10 31.666
From here it's only one step further to the residual calculation.
You can either put this current SQL into a view to make it reusable for other queries or use it as a inline view.
I chose to use it as a common table expression (CTE) aka WITH clause because it's nice and easy to read:
with p as
(select c.c_id as client_id,
c.c_grouping_id as grouping_id,
c.month,
c.profit as client_profit,
g.profit as group_profit,
avg (c.profit) over (partition by c.c_id) as avg_client_profit,
avg (g.profit) over (partition by g.grouping_id) as avg_group_profit
from client c inner join grouping g
on c."C_GROUPING_ID"=g."GROUPING_ID"
and c. "MONTH" = g. "MONTH")
select client_id, grouping_id, month,
client_profit, group_profit,
avg_client_profit, avg_group_profit,
round( (client_profit - group_profit)
* (avg_client_profit/avg_group_profit), 2) as residual
from p
order by grouping_id, month, client_id;
Notice how easy to read the whole statement is and how straight forward the residual calculation is done.
The result is then this:
CLIENT_ID GROUPING_ID MONTH CLIENT_PROFIT GROUP_PROFIT AVG_CLIENT_PROFIT AVG_GROUP_PROFIT RESIDUAL
1 aaa AUG 12 50 10 31.666 -12
1 aaa JUL 10 30 10 31.666 -6.32
1 aaa SEP 8 15 10 31.666 -2.21
16 abc FEB 21 27 23 24 -5.75
16 abc JAN 25 21 23 24 3.83
Cheers,
Lars