How can I match a sequence of numbers with the specific columns based on their names in BigQuery with the SQL query result groupped by this sequence? - sql

I am working with two tables in BigQuery and I need the achieve the following result:
ret_d
ad_arpu
pur_arpu
d1
x
x
d2
y
y
dx
etc
etc
A bit of clarification: ARPU is average revenue per user, which in this case consists of revenbue from purchases (pur_arpu) and from ads (ad_arpu).
I need to return pur_arpu and ad_arpu next to the corresponding day that customer has been retained, i.e. retention day #x (ret_dx). Arpu is revenue divided by number of customers at a certain period of time. I need arpu per user per retained day.
In the table I am working with I have the following relevant columns
C_id
ret_d1
ret_d2
ret_dx
ad_rev_d1
ad_rev_d2
ad_rev_dx
pur_rev_d1
pur_rev_d2
pur_rev_dx
id1
0 OR 1
0 OR 1
0 OR 1
x
x
x
x
x
x
id2
0 OR 1
0 OR 1
0 OR 1
y
y
y
y
y
y
idx
0 OR 1
0 OR 1
0 OR 1
etc
etc
etc
etc
etc
etc
c_id - customer id,
ret_dx - retained day #x, 1 stands for user retaining on that day,
ad_rev_dx - revenue from ads on dx,
pur_rev_dx - revenue from purchases on dx.
My idea of the SQL query was to return a sequence of numbers, ad_arpu, and pur_arpu from subquery. So something alogn these lines
SELECT
["d1", "d2", "dx"] AS ret_d,
ad_arpu AS ad_arpu,
pur_arpu AS pur_arpu
FROM
(SELECT
SUM(ad_rev_d1)/SUM(ret_d1) AS ad_arpu_d1,
SUM(ad_rev_d2)/SUM(ret_d2) AS ad_arpu_d2,
SUM(ad_rev_dx)/SUM(ret_dx) AS ad_arpu_dx,
SUM(pur_rev_d1)/SUM(ret_d1) AS pur_arpu_d1,
SUM(pur_rev_d2)/SUM(ret_d2) AS pur_arpu_d2,
SUM(pur_rev_dx)/SUM(ret_dx) AS pur_arpu_dx
--somehow match ad_arpu_dx and pur_arpu_dx to the corresponding days from the ret_d sequence to get ad_arpu adn pur_arpu
FROM tablex)
But so far my efforts have been getting me nowhere :( Missing how to match arpu numbers and the day sequence numbers.
I really appreciate any tips that could help with this!

Related

Pandas matching algorithm with itself

I'm trying to create a matching algo in pandas that does the following with a given table:
A table contains purchases and sales of products by date, item, quantity (+ for purchases and - for sales) and price.
Conditions:
Create an algorithm that matches purchases and sales per item and the corresponding average profit for each item in total.
Matches can only be on the same date, otherwise they are not matched at all.
Remaining positive or negative inventories per day are ignored
Negative inventories are allowed.
Example with a single product:
date product quantity price
1 X +2 1
1 X -1 2
1 X -2 4
2 X +1 1
2 X +1 2
3 X -1 4
Answer:
The result would be that only on day 1 the 3 trades are matched, with a profit of -2+2+4=4. Because inventory is +2, -1, and then again -1. The remaining inventory of -1 is ignored. Day 2 and 3 have no matches because the trades are not closed on the same day.
Correct output:
product Profit
X +4
Is there any elegant way to get to this result without having to loop over the table multiple times with iterrow?
For reproducing the df:
df = pd.DataFrame({'date':[1,1,1,2,2,3],'product': ['X']*6,'quantity':[2,-1,-2,1,1,-1],'price':[1,2,4,1,2,4]})
The process that you describing could use groupby & aggregate, something like this:
df.groupby('date').sum()
But I don't fully understand your rules for matching. So in Day 1, I got a different total profit. Price * quantity is (+2*1)+(-1*2)+(-2*4)=-8, so profit seems to be 8.
Using iterrow() is a rather bad practice. Not only you're writing excessive code, but also it's likely much slower (check a comparison here).
Most of those type of jobs can be accomplished by combining groupby(), aggregate() and apply(). Check out this great tutorial.
I hope this helps you or future answers :)

SPSS: search for repeated occurrences at previous time points

I have a dataset as such:
Case #|DateA |Drug.1|Drug.2|Drug.3|DateB.1 |DateB.2 |DateB.3 |RcvdDrugXTimeA
------|------|------|------|------|--------|---------|--------|--
1 |DateA1| X | Y | X |DateB1.1|DateB1.2 |DateB1.3| 0
2 |DateA2| X | Y | X |DateB2.1|DateB2.2 |DateB2.3| 1
3 |DateA3| Y | Z | X |DateB3.1|DateB3.2 |DateB3.3| 1
4 |DateA4| Z | Z | Z |DateB4.1|DateB4.2 |DateB4.3| 0
Each case has many records of receiving drugs at different dates. My endpoint is receiving Drug X at Time A, where Time is calculated based on the difference between DateA and DateB for each receipt of a drug.
If a case receives Drug X at Time A, then I want to know if they also received Drug X at a previous date (to determine if them receiving Drug X at Time A is a new occurrence or a repeat occurrence).
For each case that received Drug X at least once during Time A, I have variable RcvdDrugXTimeA = 1.
If they did receive Drug X before Time A, then I want RcvdPriorDrugX = 2. If they did not, then I want RcvdPriorDrugX = 1. If they never received Drug X, then ideally RcvdPriorDrugX = 0.
In the example above, Case 1 received Drug X but not within Time A, whereas Cases 2 and 3 both received Drug X during Time A, thus RcvdDrugXTimeA = 1 for these cases. Say that this occurred for both of them at DateB.3. The difference between these cases is that Case 2 also received Drug X previously, thus making the receipt of Drug X at Time A a repeat occurrence. On the other hand, Case 3 did not receive Drug X previously, thus making the receipt of Drug X at Time A a new occurrence for this case. I would like to create a function that looks back across the records for those cases where the criterion of receiving Drug X during Time A is met, and to produce an output to tell me if this has happened previously or if it's a new occurrence. I would like to exclude those cases where they never received Drug X at Time A in the output variable.
Not completely sure I understand what you're trying to do, so the following syntax covers the example you gave and reaches the wanted output you described. If the example (and my syntax) don't cover other possibilities or considerations you need to include in the process, please let me know.
first I recreate your example dataset:
data list list/Case#(f1) DateA Drug.1 Drug.2 Drug.3 DateB.1 DateB.2 DateB.3 (7a10) RcvdDrugXTimeA(f1).
begin data
1 "DateA1" "X" "Y" "X" "DateB1.1" "DateB1.2" "DateB1.3" 0
2 "DateA2" "X" "Y" "X" "DateB2.1" "DateB2.2" "DateB2.3" 1
3 "DateA3" "Y" "Z" "X" "DateB3.1" "DateB3.2" "DateB3.3" 1
4 "DateA4" "Z" "Z" "Z" "DateB4.1" "DateB4.2" "DateB4.3" 0
end data.
Now I count the number of times drug X was administered, and then if RcvdDrugXTimeA is 0 the output is 0, if it is one the output is 1 if X was only administered once, or 2 if it was administered more than once:
count NofX=Drug.1 Drug.2 Drug.3("X").
compute RcvdPriorDrugX=RcvdDrugXTimeA + (NofX > 1).

Calculating Weekly Returns from Daily Time Series of Prices

I want to calculate weekly returns of a mutual fund from a time series of daily prices. My data looks like this:
A B C D E
DATE WEEK W.DAY MF.PRICE WEEKLY RETURN
02/01/12 1 1 2,7587
03/01/12 1 2 2,7667
04/01/12 1 3 2,7892
05/01/12 1 4 2,7666
06/01/12 1 5 2,7391 -0,007
09/01/12 2 1 2,7288
10/01/12 2 2 2,6707
11/01/12 2 3 2,7044
12/01/12 2 4 2,7183
13/01/12 2 5 2,7619 0,012
16/01/12 3 1 2,7470
17/01/12 3 2 2,7878
18/01/12 3 3 2,8156
19/01/12 3 4 2,8310
20/01/12 3 5 2,8760 0,047
The date is (dd/mm/yy) format and "," is decimal separator. This would be done by using this formula: (Price for last weekday - Price for first weekday)/(Price for first weekday). For example the return for the first week is (2,7391 - 2,7587)/2,7587 = -0,007 and for the second is (2,7619 - 2,7288)/2,7288 = 0,012.
The problem is that the list goes on for a year, and some weeks have less than five working days due to holidays or other reasons. So I can't simply copy and paste the formula above. I added the extra two columns for week number and week day using WEEKNUM and WEEKDAY functions, thought it might help. I want to automate this with a formula or using VBA and hoping to get a table like this:
WEEK RETURN
1 -0,007
2 0,012
3 0,047
.
.
.
As I said some weeks have less than five weekdays, some start with weekday 2 or end with weekday 3 etc. due to holidays or other reasons. So I'm thinking of a way to tell excel to "find the prices that correspond to the max and min weekday of each week and apply the formula (Price for last weekday - Price for first weekday)/(Price for first weekday)".
Sorry for the long post, I tried to be be as clear as possible, I would appreciate any help! (I have 5 separate worksheets for consecutive years, each with daily prices of 20 mutual funds)
To do it in one formula:
=(INDEX(D:D,AGGREGATE(15,6,ROW($D$2:$D$16)/(($C$2:$C$16=AGGREGATE(14,6,$C$2:$C$16/($B$2:$B$16=G2),1))*($B$2:$B$16=G2)),1))-INDEX(D:D,MATCH(G2,B:B,0)))/INDEX(D:D,MATCH(G2,B:B,0))
You may need to change all the , to ; per your local settings.
I would solve it using some lookup formulas to get the values for each week and then do a simple calculation for each week.
Resulting table:
H I J K L M
first last first val last val return
1 02.01.2012 06.01.2012 2,7587 2,7391 -0,007
2 09.01.2012 13.01.2012 2,7288 2,7619 0,012
3 16.01.2012 20.01.2012 2,747 2,876 0,047
Formula in column I:
=MINIFS($A:$A;$B:$B;$H2)
Fomula in column J:
=MAXIFS($A:$A;$B:$B;$H2)
Formula in column K:
=VLOOKUP($I2;$A:$D;4;FALSE)
Formula in column L:
=VLOOKUP($J2;$A:$D;4;FALSE)
Formula in column M:
=(L2-K2)/K2

SQL query to find unique values

I need to write a query that truncates based on a selection and outputs another table. The selection criteria is as follows: for each common ID loop through the AGREE column to find a Y, if no Y then output 0, if a single Y then output that year, if multiple Y then output the most current year.
Input table:
ID AGREE YEAR
1 N 2003
2 Y 2005
2 N 2015
3 N 2005
3 N 2007
3 Y 2011
3 Y 1999
4 N 2005
4 N 2010
Output table:
ID AGREE YEAR
1 N 0
2 Y 2005
3 Y 2011
4 N 0
Here is my solution:
Select id, max(agree), max(case when agree = 'Y' then year else 0 end)
from [input table]
group by id
It rests on grouping by the id field and using max statements to return a "Y" if it is present for the group, and then return the largest number for year when agree is "Y". Note that you say "most recent" - if this table contains years in the future it would not return the most recent but instead the furthest into the future.
Note: There is an alternate way of doing this that is often faster that involves using sub-queries. If you run into performance issues it would be worth pursuing.

How do I remove contiguous sequences of almost identical records from database

I have a SQL Server database containing real-time stock quotes.
There is a Quotes table containing what you would expect-- a sequence number, ticker symbol, time, price, bid, bid size, ask, ask size, etc.
The sequence number corresponds to a message that was received containing data for a set of ticker symbols being tracked. A new message (with a new, incrementing sequence number) is received whenever anything changes for any of the symbols being tracked. The message contains data for all symbols (even for those where nothing changed).
When the data was put into the database, a record was inserted for every symbol in each message, even for symbols where nothing changed since the prior message. So a lot of records contain redundant information (only the sequence number changed) and I want to remove these redundant records.
This is not the same as removing all but one record from the entire database for a combination of identical columns (already answered). Rather, I want to compress each contiguous block of identical records (identical except for sequence number) into a single record. When finished, there may be duplicate records but with differing records between them.
My approach was to find contiguous ranges of records (for a ticker symbol) where everything is the same except the sequence number.
In the following sample data I simplify things by showing only Sequence, Symbol, and Price. The compound primary key would be Sequence+Symbol (each symbol appears only once in a message). I want to remove records where Price is the same as the prior record (for a given ticker symbol). For ticker X it means I want to remove the range [1, 6], and for ticker Y I want to remove the ranges [1, 2], [4, 5] and [7, 7]:
Before:
Sequence Symbol Price
0 X $10
0 Y $ 5
1 X $10
1 Y $ 5
2 X $10
2 Y $ 5
3 X $10
3 Y $ 6
4 X $10
4 Y $ 6
5 X $10
5 Y $ 6
6 X $10
6 Y $ 5
7 X $11
7 Y $ 5
After:
Sequence Symbol Price
0 X $10
0 Y $ 5
3 Y $ 6
6 Y $ 5
7 X $11
Note that (Y, $5) appears twice but with (Y, $6) between.
The following generates the ranges I need. The left outer join ensures I select the first group of records (where there is no earlier record that is different), and the BETWEEN is intended to reduce the number of records that need to be searched to find the next-earlier different record (the results are the same without the BETWEEN, but slower). I would need only to add something like "DELETE FROM Quotes WHERE Sequence BETWEEN StartOfRange AND EndOfRange".
SELECT
GroupsOfIdenticalRecords.Symbol,
MIN(GroupsOfIdenticalRecords.Sequence)+1 AS StartOfRange,
MAX(GroupsOfIdenticalRecords.Sequence) AS EndOfRange
FROM
(
SELECT
Q1.Symbol,
Q1.Sequence,
MAX(Q2.Sequence) AS ClosestEarlierDifferentRecord
FROM
Quotes AS Q1
LEFT OUTER JOIN
Quotes AS Q2
ON
Q2.Sequence BETWEEN Q1.Sequence-100 AND Q1.Sequence-1
AND Q2.Symbol=Q1.Symbol
AND Q2.Price<>Q1.Price
GROUP BY
Q1.Sequence,
Q1.Symbol
) AS GroupsOfIdenticalRecords
GROUP BY
GroupsOfIdenticalRecords.Symbol,
GroupsOfIdenticalRecords.ClosestEarlierDifferentRecord
The problem is that this is way too slow and runs out of memory (crashing SSMS- remarkably) for the 2+ million records in the database. Even if I change "-100" to "-2" it is still slow and runs out of memory. I expected the "ON" clause of the LEFT OUTER JOIN to limit the processing and memory usage (2 million iterations, processing about 100 records each, which should be tractable), but it seems like SQL Server may first be generating all combinations of the 2 instances of the table, Q1 and Q2 (about 4e12 combinations) before selecting based on the criteria specified in the ON clause.
If I run the query on a smaller subset of the data (for example, by using "(SELECT TOP 100000 FROM Quotes) AS Q1", and similar for Q2), it completes in a reasonable amount time. I was trying to figure out how to automatically run this 20 or so times using "WHERE Sequence BETWEEN 0 AND 99999", then "...BETWEEN 100000 AND 199999", etc. (actually I would use overlapping ranges such as [0,99999], [99900, 199999], etc. to remove ranges that span boundaries).
The following generates sets of ranges to split the data into 100000 record blocks ([0,99999], [100000, 199999], etc). But how do I apply the above query repeatedly (once for each range)? I keep getting stuck because you can't group these using "BETWEEN" without applying an aggregate function. So instead of selecting blocks of records, I only know how to get MIN(), MAX(), etc. (single values) which does not work with the above query (as Q1 and Q2). Is there a way to do this? Is there totally different (and better) approach to the problem?
SELECT
CONVERT(INTEGER, Sequence / 100000)*100000 AS BlockStart,
MIN(((1+CONVERT(INTEGER, Sequence / 100000))*100000)-1) AS BlockEnd
FROM
Quotes
GROUP BY
CONVERT(INTEGER, Sequence / 100000)*100000
You can do this with a nice little trick. The groups that you want can be defined as the difference between two sequences of numbers. One is assigned for each symbol in order by sequence. The other is assigned for each symbol and price. This is what is looks like for your data:
Sequence Symbol Price seq1 seq2 diff
0 X $10 1 1 0
0 Y $ 5 1 1 0
1 X $10 2 2 0
1 Y $ 5 2 2 0
2 X $10 3 3 0
2 Y $ 5 3 3 0
3 X $10 4 4 0
3 Y $ 6 4 1 3
4 X $10 5 5 0
4 Y $ 6 5 2 3
5 X $10 6 6 0
5 Y $ 6 6 3 3
6 X $10 7 7 0
6 Y $ 5 7 4 3
7 X $11 8 1 7
7 Y $ 5 8 5 3
You can stare at this and figure out that the combination of symbol, diff, and price define each group.
The following puts this into a SQL query to return the data you want:
select min(q.sequence) as sequence, symbol, price
from (select q.*,
(row_number() over (partition by symbol order by sequence) -
row_number() over (partition by symbol, price order by sequence)
) as grp
from quotes q
) q
group by symbol, grp, price;
If you want to replace the data in the original table, I would suggest that you store the results of the query in a temporary table, truncate the original table, and then re-insert the values from the temporary table.
Answering my own question. I want to add some additional comments to complement the excellent answer by Gordon Linoff.
You're right. It is a nice little trick. I had to stare at it for a while to understand how it works. Here's my thoughts for the benefit of others.
The numbering by Sequence/Symbol (seq1) always increases, whereas the numbering by Symbol/Price (seq2) only increases sometimes (within each group, only when a record for Symbol contains the group's Price). Therefore seq1 either remains in lock step with seq2 (i.e., diff remains constant, until either Symbol or Price changes), or seq1 "runs away" from seq2 (while it is busy "counting" other Prices and other Symbols-- which increases the difference between seq1 and seq2 for a given Symbol and Price). Once seq2 falls behind, it can never "catch up" to seq1, so a given value of diff is never seen again once diff moves to the next larger value (for a given Price). By taking the minimum value within each Symbol/Price group, you get the first record in each contiguous block, which is exactly what I needed.
I don't use SQL a lot, so I wasn't familiar with the OVER clause. I just took it on faith that the first clause generates seq1 and the second generates seq2. I can kind of see how it works, but that's not the interesting part.
My data contained more than just Price. It was a simple thing to add the other fields (Bid, Ask, etc.) to the second OVER clause and the final GROUP BY:
row_number() over (partition by Symbol, Price, Bid, BidSize, Ask, AskSize, Change, Volume, DayLow, DayHigh, Time order by Sequence)
group by Symbol, grp, price, Bid, BidSize, Ask, AskSize, Change, Volume, DayLow, DayHigh, Time
Also, I was able to use use >MIN(...) and <=MAX(...) to define ranges of records to delete.