BigQuery SQL: determine the number of daily transactions given a moving counter - sql

I've been stuck for hours with writing a SQL query that would solve the following:
Given a history of a daily customer transaction counter, is it possible to specify exactly how many transactions were made each day?
Each datapoint represents sum of all transactions made in the last 30 days (ignore the missing dates)
The counter will decrement if the number of transactions made on the current day was smaller than the number of transactions that are no longer factored in, as they were made 31 days ago. It would increment otherwise.
The complete history of the counter is unavailable, so we don't know the numbers' evolution from the beginning, but only from certain point in time.
Please refer to the following table (for one offer_id):
transaction_date num_transactions
0 21/05/2022 25
1 22/05/2022 26
2 23/05/2022 25
3 24/05/2022 28
4 25/05/2022 30
5 26/05/2022 32
6 27/05/2022 33
7 28/05/2022 34
8 29/05/2022 33
9 30/05/2022 33
10 31/05/2022 34
11 01/06/2022 35
12 02/06/2022 35
13 03/06/2022 59
14 04/06/2022 73
15 07/06/2022 87
16 08/06/2022 98
17 09/06/2022 109
18 10/06/2022 120
19 11/06/2022 123
20 12/06/2022 122
21 13/06/2022 127
22 14/06/2022 142
23 15/06/2022 145
24 16/06/2022 148
25 17/06/2022 156
26 18/06/2022 162
27 19/06/2022 164
28 20/06/2022 167
29 21/06/2022 173
30 22/06/2022 185
31 23/06/2022 194
32 24/06/2022 206
33 25/06/2022 206
34 26/06/2022 208
35 28/06/2022 227
36 29/06/2022 237
37 30/06/2022 241
38 01/07/2022 248
39 02/07/2022 237
40 03/07/2022 230
41 04/07/2022 217
42 05/07/2022 208
43 06/07/2022 214
44 07/07/2022 216
45 08/07/2022 211
46 09/07/2022 203
47 10/07/2022 194
48 11/07/2022 192
49 12/07/2022 195
50 13/07/2022 193
51 14/07/2022 181
52 15/07/2022 174
53 16/07/2022 169
54 17/07/2022 162
55 18/07/2022 162
56 19/07/2022 164
57 20/07/2022 160
58 21/07/2022 163
59 22/07/2022 155
60 23/07/2022 144
61 24/07/2022 134
62 25/07/2022 139
63 26/07/2022 154
For each day (at least starting with 23/06) I'd like to be able to tell what were the numbers of transactions day-by-day in the preceding 30 days that sum up to the transactions counter on that day.
My current code in BigQuery SQL is below. It is obviously wrong - although the calculated counter evolution history does sum up to the right numbers when negative numbers are included, I'm interested in finding out only the actual transaction counts (thus only positive numbers and 0 are in question) for each last 30-days window.
When I include a simple condition that when a decrement happens, let's round it up to 0...:
WHEN IFNULL(transactions_diff_yesterday + transaction_reference, 0) < 0
THEN 0
... the sum for the last 30 days never matches the counter.
WITH outer_base AS(
WITH base AS(
SELECT
*,
LAG(num_transactions, 31) OVER(PARTITION BY offer_id ORDER BY offer_id, transaction_date) as transactions_31_days_ago,
IFNULL(LAG(num_transactions, 30) OVER(PARTITION BY offer_id ORDER BY offer_id, transaction_date), 0) as transactions_30_days_ago,
IFNULL(LAG(transactions, 1) OVER(PARTITION BY offer_id ORDER BY offer_id, transaction_date), 0) as transactions_yesterday
FROM
`my_table`
ORDER BY
offer_id,
transaction_date
)
SELECT
*,
IFNULL(transactions - transactions_yesterday, 0) AS transactions_diff_yesterday,
IFNULL(transactions_30_days_ago - transactions_31_days_ago, 0) AS transaction_reference
FROM
base
)
SELECT
*,
CASE
WHEN IFNULL(transactions_diff_yesterday + transaction_reference, 0) < 0
THEN 0
ELSE
IFNULL(transactions_diff_yesterday + transaction_reference, 0) END
AS real_transactions
FROM
outer_base;

Related

How can i get a aggregate sum of Average number of product given between two weeks and output for Each week as shown below in Pandas?

StartWeek
End Week
Numberof Week
Number of Product
Avg number of product per week
39
41
3
99
33
40
45
5
150
30
40
42
3
60
20
39
40
2
40
20
39
41
3
99
33
So that the output looks like --
Week
Sum Average Product per week
39
86
40
70
41
66
42
20
45
30
First for each row we create a list of weeks that it applies to, and put in column 'weeks'
df['weeks'] = df.apply(lambda r: np.arange(r['StartWeek'], r['EndWeek']+1),axis=1)
df looks like this
StartWeek EndWeek NumberofWeek NumberofProduct Av weeks
-- ----------- --------- -------------- ----------------- ---- -------------------
0 39 41 3 99 33 [39 40 41]
1 40 45 5 150 30 [40 41 42 43 44 45]
2 40 42 3 60 20 [40 41 42]
3 39 40 2 40 20 [39 40]
4 39 41 3 99 33 [39 40 41]
Then we explode weeks which duplicates each row for each week it is applied to, and then aggregate by the exploded week and sum:
df.explode('weeks').groupby('weeks', as_index = False)['Av'].sum()
output:
weeks Av
-- ------- ----
0 39 86
1 40 136
2 41 116
3 42 50
4 43 30
5 44 30
6 45 30
you can use the group by method in python
df=df.groupby(["StartWeek"])["Avg number of product per week"].sum()

SQL Server : create new column category price according to price column

I have a SQL Server table with a column price looking like this:
10
96
64
38
32
103
74
32
67
103
55
28
30
110
79
91
16
71
36
106
89
87
59
41
56
89
68
32
80
47
45
77
64
93
17
88
13
19
83
12
76
99
104
65
83
95
Now my aim is to create a new column giving a category from 1 to 10 to each of those values.
For instance the max value in my column is 110 the min is 10. Max-min = 100. Then if I want to have 10 categories I do 100/10= 10. Therefore here are the ranges:
10-20 1
21-30 2
31-40 3
41-50 4
51-60 5
61-70 6
71-80 7
81-90 8
91-100 9
101-110 10
Desired output:
my new column called cat should look like this:
price cat
-----------------
10 1
96 9
64 6
38 3
32 3
103 10
74 7
32 3
67 6
103 10
55 5
28 2
30 3
110 10
79 7
91 9
16 1
71 7
36 3
106 10
89 8
87 8
59 5
41 4
56 5
89 8
68 6
32 3
80 7
47 4
45 4
77 7
64 6
93 9
17 1
88 8
13 1
19 1
83 8
12 1
76 7
99 9
104 10
65 6
83 8
95 9
Is there a way to perform this with T-SQL? Sorry if this question is maybe too easy. I searched long time on the web. So either the problem is not as simple as I imagine. Either I entered the wrong keywords.
Yes, almost exactly as you describe the calculation:
select price,
1 + (price - min_price) * 10 / (max_price - min_price + 1) as decile
from (select price,
min(price) over () as min_price,
max(price) over () as max_price
from t
) t;
The 1 + is because you want the values from 1 to 10, rather than 0 to 9.
Yes - a case statement can do that.
select
price
,case
when price between 10 and 20 then 1
when price between 21 and 30 then 2
when price between 31 and 40 then 3
when price between 41 and 50 then 4
when price between 51 and 60 then 5
when price between 61 and 70 then 6
when price between 71 and 80 then 7
when price between 81 and 90 then 8
when price between 91 and 100 then 9
when price between 101 and 110 then 10
else null
end as cat
from [<enter your table name here>]

SQL I need the highest number from column + count duplicate values

I'm looking for a query that gives a list of the RepairCost for each BikeNumber,
but the duplicate values have to be counted as well. So BikeNumber 18 cost total 22 + 58 = 80
Id RepairCost BikeNumber
16 82 23
88 51 20
12 20 19
33 22 **18**
40 58 **18**
69 41 17
10 2 16
66 35 15
If i understand the question, the query is pretty simple:
SELECT BikeNumber, SUM(RepairCost)
FROM YourTable
GROUP BY BikeNumber

Strange results with VAR and STDEV

This
SELECT
AVG(s.Amount/100)[Avg],
STDEV(s.Amount/100) [StDev],
VAR(s.Amount/100) [Var]
Returns this:
Avg StDev Var
133 550.82021581146 303402.910146583
Statistics aren't my strongest suit, but how is it possible that standard deviation and variance are larger than the average? Not only that, but variance is almost 100x larger than the largest sample in set.
Here is the entire sample set, with the above replaced with
SELECT s.Amount/100
while the rest of the query is identical
Amount
4645
3182
422
377
359
298
278
242
230
213
182
180
174
166
150
130
116
113
109
107
102
96
84
78
78
76
66
64
61
60
60
60
59
59
56
49
46
41
41
39
38
36
29
27
26
25
25
25
24
24
24
22
22
22
20
20
19
19
19
19
19
18
17
17
17
16
14
13
12
12
12
11
11
10
10
10
10
9
9
9
8
8
8
7
7
6
6
6
3
3
3
3
2
2
2
2
2
1
1
1
1
1
1
You need to read a book on statistics, or at least start with the Wikipedia pages that describe the concepts.
The standard deviation and variance are very related. The variance is the square (or close enough to the square) of the standard deviation. You can check that this is true of your numbers.
There is not really a relationship between the standard deviation and the average. The standard deviation is measuring the dispersal of the data around the average. The data can be arbitrarily dispersed around an average.
You might be confused because there are estimates on standard deviation/standard error when you assume a particular distribution of the data. However, those estimates are about the distribution and not about the data.

Using SQL SMS How do I return a list of numbers using low and high columns

I have a table SUB_Inst with columns id, low and high. How would I query the low and high numbers returning a new column with a record for each number from low to high?
Current table SUB_Inst
id low High
1 55 63
2 232 234
3 4 7
etc.
Desired Results
id low High Num_list
1 55 63 55
1 55 63 56
1 55 63 57
1 55 63 58
1 55 63 59
1 55 63 60
1 55 63 61
1 55 63 62
1 55 63 63
2 232 234 232
2 232 234 233
2 232 234 234
3 4 7 4
3 4 7 5
3 4 7 6
3 4 7 7
etc.
I tried something like this:
SELECT Low, HIGH,
(SELECT CAST(number as varchar)+','
FROM NUMBERS
WHERE number >= Low and number <= High
FOR XML PATH(''))
FROM SUB_Inst
but it returned all the numbers in one field like this which won't work:
Low High Num_List
24 27 24,25,26,27,
34 36 34,35,36,
10 17 10,11,12,13,14,15,16,17,
34 36 34,35,36,
65 67 65,66,67,
502 504 502,503,504,
56 59 56,57,58,59,
Thank you.
I think you want this :
SELECT id,low,high,number as Num_List
FROM SUB_Inst , NUMBERS
where low<=number and high>=number