SQL query to find unique values - sql

I need to write a query that truncates based on a selection and outputs another table. The selection criteria is as follows: for each common ID loop through the AGREE column to find a Y, if no Y then output 0, if a single Y then output that year, if multiple Y then output the most current year.
Input table:
ID AGREE YEAR
1 N 2003
2 Y 2005
2 N 2015
3 N 2005
3 N 2007
3 Y 2011
3 Y 1999
4 N 2005
4 N 2010
Output table:
ID AGREE YEAR
1 N 0
2 Y 2005
3 Y 2011
4 N 0

Here is my solution:
Select id, max(agree), max(case when agree = 'Y' then year else 0 end)
from [input table]
group by id
It rests on grouping by the id field and using max statements to return a "Y" if it is present for the group, and then return the largest number for year when agree is "Y". Note that you say "most recent" - if this table contains years in the future it would not return the most recent but instead the furthest into the future.
Note: There is an alternate way of doing this that is often faster that involves using sub-queries. If you run into performance issues it would be worth pursuing.

Related

How can I match a sequence of numbers with the specific columns based on their names in BigQuery with the SQL query result groupped by this sequence?

I am working with two tables in BigQuery and I need the achieve the following result:
ret_d
ad_arpu
pur_arpu
d1
x
x
d2
y
y
dx
etc
etc
A bit of clarification: ARPU is average revenue per user, which in this case consists of revenbue from purchases (pur_arpu) and from ads (ad_arpu).
I need to return pur_arpu and ad_arpu next to the corresponding day that customer has been retained, i.e. retention day #x (ret_dx). Arpu is revenue divided by number of customers at a certain period of time. I need arpu per user per retained day.
In the table I am working with I have the following relevant columns
C_id
ret_d1
ret_d2
ret_dx
ad_rev_d1
ad_rev_d2
ad_rev_dx
pur_rev_d1
pur_rev_d2
pur_rev_dx
id1
0 OR 1
0 OR 1
0 OR 1
x
x
x
x
x
x
id2
0 OR 1
0 OR 1
0 OR 1
y
y
y
y
y
y
idx
0 OR 1
0 OR 1
0 OR 1
etc
etc
etc
etc
etc
etc
c_id - customer id,
ret_dx - retained day #x, 1 stands for user retaining on that day,
ad_rev_dx - revenue from ads on dx,
pur_rev_dx - revenue from purchases on dx.
My idea of the SQL query was to return a sequence of numbers, ad_arpu, and pur_arpu from subquery. So something alogn these lines
SELECT
["d1", "d2", "dx"] AS ret_d,
ad_arpu AS ad_arpu,
pur_arpu AS pur_arpu
FROM
(SELECT
SUM(ad_rev_d1)/SUM(ret_d1) AS ad_arpu_d1,
SUM(ad_rev_d2)/SUM(ret_d2) AS ad_arpu_d2,
SUM(ad_rev_dx)/SUM(ret_dx) AS ad_arpu_dx,
SUM(pur_rev_d1)/SUM(ret_d1) AS pur_arpu_d1,
SUM(pur_rev_d2)/SUM(ret_d2) AS pur_arpu_d2,
SUM(pur_rev_dx)/SUM(ret_dx) AS pur_arpu_dx
--somehow match ad_arpu_dx and pur_arpu_dx to the corresponding days from the ret_d sequence to get ad_arpu adn pur_arpu
FROM tablex)
But so far my efforts have been getting me nowhere :( Missing how to match arpu numbers and the day sequence numbers.
I really appreciate any tips that could help with this!

SQL Max Consecutive Values in a number set using recursion

The following SQL query is supposed to return the max consecutive numbers in a set.
WITH RECURSIVE Mystery(X,Y) AS (SELECT A AS X, A AS Y FROM R)
UNION (SELECT m1.X, m2.Y
FROM Mystery m1, Mystery m2
WHERE m2.X = m1.Y + 1)
SELECT MAX(Y-X) + 1 FROM Mystery;
This query on the set {7, 9, 10, 14, 15, 16, 18} returns 3, because {14 15 16} is the longest chain of consecutive numbers and there are three numbers in that chain. But when I try to work through this manually I don't see how it arrives at that result.
For example, given the number set above I could create two columns:
m1.x
m2.y
7
7
9
9
10
10
14
14
15
15
16
16
18
18
If we are working on rows and columns, not the actual data, as I understand it WHERE m2.X = m1.Y + 1 takes the value from the next row in Y and puts it in the current row of X, like so
m1.X
m2.Y
9
7
10
9
14
10
15
14
16
15
18
16
18
Null?
The main part on which I am uncertain is where in the SQL recursion actually happens. According to Denis Lukichev recursion is the R part - or in this case the RECURSIVE Mystery(X,Y) - and stops when the table is empty. But if the above is true, how would the table ever empty?
Since I don't know how to proceed with the above, let me try a different direction. If WHERE m2.X = m1.Y + 1 is actually a comparison, the result should be:
m1.X
m2.Y
14
14
15
15
16
16
But at this point, it seems that it should continue recursively on this until only two rows are left (nothing else to compare). If it stops here to get the correct count of 3 rows (2 + 1), what is actually stopping the recursion?
I understand that for the above example the MAX(Y-X) + 1 effectively returns the actual number of recursion steps and adds 1.
But if I have 7 consecutive numbers and the recursion flows down to 2 rows, should this not end up with an incorrect 3 as the result? I understand recursion in C++ and other languages, but this is confusing to me.
Full disclosure, yes it appears this is a common university question, but I am retired, discovered this while researching recursion for my use, and need to understand how it works to use similar recursion in my projects.
Based on this db<>fiddle shared previously, you may find it instructive to alter the CTE to include an iteration number as follows, and then to show the content of the CTE rather than the output of final SELECT. Here's an amended CTE and its content after the recursion is complete:
Amended CTE
WITH RECURSIVE Mystery(X,Y) AS ((SELECT A AS X, A AS Y, 1 as Z FROM R)
UNION (SELECT m1.X, m2.A, Z+1
FROM Mystery m1
JOIN R m2 ON m2.A = m1.Y + 1))
CTE Content
x
y
z
7
7
1
9
9
1
10
10
1
14
14
1
15
15
1
16
16
1
18
18
1
9
10
2
14
15
2
15
16
2
14
16
3
The Z field holds the iteration count. Where Z = 1 we've simply got the rows from the table R. The, values X and Y are both from the field A. In terms of what we are attempting to achieve these represent sequences consecutive numbers, which start at X and continue to (at least) Y.
Where Z = 2, the second iteration, we find all the rows first iteration where there is a value in R which is one higher than our Y value, or one higher than the last member of our sequence of consecutive numbers. That becomes the new highest number, and we add one to the number of iterations. As only three numbers in our original data set have successors within the set, there are only three rows output in the second iteration.
Where Z = 3, the third iteration, we find all the rows of the second iteration (note we are not considering all the rows of the first iteration again), where there is, again, a value in R which is one higher than our Y value, or one higher than the last member of our sequence of consecutive numbers. That, again, becomes the new highest number, and we add one to the number of iterations.
The process will attempt a fourth iteration, but as there are no rows in R where the value is one more than the Y values from our third iteration, no extra data gets added to the CTE and recursion ends.
Going back to the original db<>fiddle, the process then searches our CTE content to output MAX(Y-X) + 1, which is the maximum difference between the first and last values in any consecutive sequence, plus one. This finds it's value from the record produced in the third iteration, using ((16-14) + 1) which has a value of 3.
For this specific piece of code, the output is always equivalent to the value in the Z field as every addition of a row through the recursion adds one to Z and adds one to Y.

Working of Merge in SAS (with IN=)

I have two dataset data1 and data2
data data1;
input sn id $;
datalines;
1 a
2 a
3 a
;
run;
data data2;
input id $ sales x $;
datalines;
a 10 x
a 20 y
a 30 z
a 40 q
;
run;
I am merging them from below code:
data join;
merge data1(in=a) data2(in=b);
by id;
if a and b;
run;
Result: (I was expecting an Inner Join result which is not the case)
1 a 10 x
2 a 20 y
2 a 30 z
2 a 40 w
Result from proc sql inner join.
proc sql;
select data1.id,sn,sales,x from data2 inner join data1 on data1.hh_id;
quit;
Result: (As expected from an inner join)
a 1 10 x
a 1 20 y
a 1 30 z
a 1 40 w
a 2 10 x
a 2 20 y
a 2 30 z
a 2 40 w
b 3 10 x
b 3 20 y
b 3 30 z
b 3 40 w
I want to know the concept and STEP BY STEP working of merge statement in SAS with In= and proving the above result.
PS: I have read this, and it says
An obvious use for these variables is to control what kind of 'merge'
will occur, using if statements. For example, if
ThisRecordIsFromYourData and ThisRecordIsFromOtherData; will make SAS
only include rows that match on the by variables from both input data
sets (like an inner join).
which I guess, (like an Inner Join) is not always the case.
Basically, this is a result of the difference in how the SAS data step and SQL process their respective join/merges.
SQL creates a separate record for each possible combination of keys. This is a Cartesian Product (at the key level).
SAS data step, however, process merges very differently. MERGE is really nothing more than a special case of SET. It still processes rows iteratively, one at a time - it never goes back, and never has more than one row from any dataset in the PDV at once. Thus, it cannot create a Cartesian product in its normal process - that would require random access, which the SAS datastep doesn't do normally.
What it does:
For each unique BY value
Take the next record from the left side dataset, if one exists with that BY value
Take the next record from the right side dataset, if one exists with that BY value
Output a row
Continue until both datasets are exhausted for that BY value
With BY values that yield unique records per value on either side (or both), it is effectively identical to SQL. However, with BY values that yield duplicates on BOTH sides, you get what you have there: a side-by-side merge, and if one runs out before the other, the values from the last row of the shorter dataset (for that by value) are more-or-less copied down. (They're actually RETAINED, so if you overwrite them with changes, they will not reset on new records from the longer dataset).
So, if left has 3 records and right has 4 records for key value a, like in your example, then you get data from the following records (assuming you don't alter the data after):
left right
1 1
2 2
3 3
3 4

MS Access - Roll-up Time-phased data (VBA or SQL?)

I have a set of time-phased data in an Access (2010) table. There are 3 levels, Account (1), Package (2), Element (3). Each row has the Account, Package, Element along with a time Period and dollar amount. I want to be able to roll-this up so I can see what the current period and totals are at each level (one output for Account, one for Package, and one for Element) and save those different levels as their own tables (or just output back to excel).
So if I have this data:
Account Package Element Period Dollars
A 11 X 2010 5
A 11 O 2010 5
A 11 X 2011 5
B 44 X 2010 5
B 52 O 2010 5
B 44 L 2011 5
C 24 X 2011 5
C 14 L 2011 5
C 14 L 2011 5
C 14 L 2010 5
I want to roll it up by element to get this table (if current is 2010)
Account Package Element Current Total
A 11 X 5 5
A 11 O 5 0
B 44 X 5 5
B 52 O 5 0
C 24 X 0 5
C 14 L 5 10
and then roll-it up by element to get this:
Account Package Current Total
A 11 10 5
B 44 5 5
B 52 5 0
C 24 0 5
C 14 5 10
An obvious problem is one table that isn't normalized, but I'm importing this data from an excel file given by a customer. I did create this successfully in Excel using a lot of SUMIFs, but I'm close to 500k rows and it just starts locking up on me.
I'd thought I'd see if Access would work quicer.So If I have just the one table, I tried looping through Account then Package then Element and doing a compare Period to Current and calculating sums.
Is there a better way instead of opening a bunch of recordsets - to use creative SQL queries?
Simply run aggregate group by queries using the one table. The only challenge is the other descriptives will need to be removed or run with an aggregate. As example, below I used Max().
By Element
SELECT Max(Account) As MaxOfAccount, Max(Package) As MaxOfPackage,
Element, Sum(IIF(Period=2010,1,0)) As Current, Sum(Dollars) As TotalDollars
FROM TimePhasedData
GROUP BY Element
By Element for only 2010:
SELECT Max(Account) As MaxOfAccount, Max(Package) As MaxOfPackage,
Element, Count(Period) As Current, Sum(Dollars) As TotalDollars
FROM TimePhasedData
WHERE Period = 2010
GROUP BY Element
Purely by Element
SELECT Element, Sum(IIF(Period=2010,1,0)) As Current, Sum(Dollars) As TotalDollars
FROM TimePhasedData
GROUP BY Element
By Account
SELECT Account, Max(Package) As MaxOfPackage, Max(Element) As MaxOfElement,
Sum(IIF(Period=2010,1,0)) As Current, Sum(Dollars) As TotalDollars
FROM TimePhasedData
GROUP BY Account
By Package
SELECT Max(Account) As MaxOfAccount, Package, Max(Element) As MaxOfElement,
Sum(IIF(Period=2010,1,0)) As Current, Sum(Dollars) As TotalDollars
FROM TimePhasedData
GROUP BY Package
Finally, many Excel functions have their SQL counterparts including SumIf(), CountIf(), VLookup(), Index(), Match(). And with 500K rows, consider the robustness of using Access' default SQL engine.

How do I remove contiguous sequences of almost identical records from database

I have a SQL Server database containing real-time stock quotes.
There is a Quotes table containing what you would expect-- a sequence number, ticker symbol, time, price, bid, bid size, ask, ask size, etc.
The sequence number corresponds to a message that was received containing data for a set of ticker symbols being tracked. A new message (with a new, incrementing sequence number) is received whenever anything changes for any of the symbols being tracked. The message contains data for all symbols (even for those where nothing changed).
When the data was put into the database, a record was inserted for every symbol in each message, even for symbols where nothing changed since the prior message. So a lot of records contain redundant information (only the sequence number changed) and I want to remove these redundant records.
This is not the same as removing all but one record from the entire database for a combination of identical columns (already answered). Rather, I want to compress each contiguous block of identical records (identical except for sequence number) into a single record. When finished, there may be duplicate records but with differing records between them.
My approach was to find contiguous ranges of records (for a ticker symbol) where everything is the same except the sequence number.
In the following sample data I simplify things by showing only Sequence, Symbol, and Price. The compound primary key would be Sequence+Symbol (each symbol appears only once in a message). I want to remove records where Price is the same as the prior record (for a given ticker symbol). For ticker X it means I want to remove the range [1, 6], and for ticker Y I want to remove the ranges [1, 2], [4, 5] and [7, 7]:
Before:
Sequence Symbol Price
0 X $10
0 Y $ 5
1 X $10
1 Y $ 5
2 X $10
2 Y $ 5
3 X $10
3 Y $ 6
4 X $10
4 Y $ 6
5 X $10
5 Y $ 6
6 X $10
6 Y $ 5
7 X $11
7 Y $ 5
After:
Sequence Symbol Price
0 X $10
0 Y $ 5
3 Y $ 6
6 Y $ 5
7 X $11
Note that (Y, $5) appears twice but with (Y, $6) between.
The following generates the ranges I need. The left outer join ensures I select the first group of records (where there is no earlier record that is different), and the BETWEEN is intended to reduce the number of records that need to be searched to find the next-earlier different record (the results are the same without the BETWEEN, but slower). I would need only to add something like "DELETE FROM Quotes WHERE Sequence BETWEEN StartOfRange AND EndOfRange".
SELECT
GroupsOfIdenticalRecords.Symbol,
MIN(GroupsOfIdenticalRecords.Sequence)+1 AS StartOfRange,
MAX(GroupsOfIdenticalRecords.Sequence) AS EndOfRange
FROM
(
SELECT
Q1.Symbol,
Q1.Sequence,
MAX(Q2.Sequence) AS ClosestEarlierDifferentRecord
FROM
Quotes AS Q1
LEFT OUTER JOIN
Quotes AS Q2
ON
Q2.Sequence BETWEEN Q1.Sequence-100 AND Q1.Sequence-1
AND Q2.Symbol=Q1.Symbol
AND Q2.Price<>Q1.Price
GROUP BY
Q1.Sequence,
Q1.Symbol
) AS GroupsOfIdenticalRecords
GROUP BY
GroupsOfIdenticalRecords.Symbol,
GroupsOfIdenticalRecords.ClosestEarlierDifferentRecord
The problem is that this is way too slow and runs out of memory (crashing SSMS- remarkably) for the 2+ million records in the database. Even if I change "-100" to "-2" it is still slow and runs out of memory. I expected the "ON" clause of the LEFT OUTER JOIN to limit the processing and memory usage (2 million iterations, processing about 100 records each, which should be tractable), but it seems like SQL Server may first be generating all combinations of the 2 instances of the table, Q1 and Q2 (about 4e12 combinations) before selecting based on the criteria specified in the ON clause.
If I run the query on a smaller subset of the data (for example, by using "(SELECT TOP 100000 FROM Quotes) AS Q1", and similar for Q2), it completes in a reasonable amount time. I was trying to figure out how to automatically run this 20 or so times using "WHERE Sequence BETWEEN 0 AND 99999", then "...BETWEEN 100000 AND 199999", etc. (actually I would use overlapping ranges such as [0,99999], [99900, 199999], etc. to remove ranges that span boundaries).
The following generates sets of ranges to split the data into 100000 record blocks ([0,99999], [100000, 199999], etc). But how do I apply the above query repeatedly (once for each range)? I keep getting stuck because you can't group these using "BETWEEN" without applying an aggregate function. So instead of selecting blocks of records, I only know how to get MIN(), MAX(), etc. (single values) which does not work with the above query (as Q1 and Q2). Is there a way to do this? Is there totally different (and better) approach to the problem?
SELECT
CONVERT(INTEGER, Sequence / 100000)*100000 AS BlockStart,
MIN(((1+CONVERT(INTEGER, Sequence / 100000))*100000)-1) AS BlockEnd
FROM
Quotes
GROUP BY
CONVERT(INTEGER, Sequence / 100000)*100000
You can do this with a nice little trick. The groups that you want can be defined as the difference between two sequences of numbers. One is assigned for each symbol in order by sequence. The other is assigned for each symbol and price. This is what is looks like for your data:
Sequence Symbol Price seq1 seq2 diff
0 X $10 1 1 0
0 Y $ 5 1 1 0
1 X $10 2 2 0
1 Y $ 5 2 2 0
2 X $10 3 3 0
2 Y $ 5 3 3 0
3 X $10 4 4 0
3 Y $ 6 4 1 3
4 X $10 5 5 0
4 Y $ 6 5 2 3
5 X $10 6 6 0
5 Y $ 6 6 3 3
6 X $10 7 7 0
6 Y $ 5 7 4 3
7 X $11 8 1 7
7 Y $ 5 8 5 3
You can stare at this and figure out that the combination of symbol, diff, and price define each group.
The following puts this into a SQL query to return the data you want:
select min(q.sequence) as sequence, symbol, price
from (select q.*,
(row_number() over (partition by symbol order by sequence) -
row_number() over (partition by symbol, price order by sequence)
) as grp
from quotes q
) q
group by symbol, grp, price;
If you want to replace the data in the original table, I would suggest that you store the results of the query in a temporary table, truncate the original table, and then re-insert the values from the temporary table.
Answering my own question. I want to add some additional comments to complement the excellent answer by Gordon Linoff.
You're right. It is a nice little trick. I had to stare at it for a while to understand how it works. Here's my thoughts for the benefit of others.
The numbering by Sequence/Symbol (seq1) always increases, whereas the numbering by Symbol/Price (seq2) only increases sometimes (within each group, only when a record for Symbol contains the group's Price). Therefore seq1 either remains in lock step with seq2 (i.e., diff remains constant, until either Symbol or Price changes), or seq1 "runs away" from seq2 (while it is busy "counting" other Prices and other Symbols-- which increases the difference between seq1 and seq2 for a given Symbol and Price). Once seq2 falls behind, it can never "catch up" to seq1, so a given value of diff is never seen again once diff moves to the next larger value (for a given Price). By taking the minimum value within each Symbol/Price group, you get the first record in each contiguous block, which is exactly what I needed.
I don't use SQL a lot, so I wasn't familiar with the OVER clause. I just took it on faith that the first clause generates seq1 and the second generates seq2. I can kind of see how it works, but that's not the interesting part.
My data contained more than just Price. It was a simple thing to add the other fields (Bid, Ask, etc.) to the second OVER clause and the final GROUP BY:
row_number() over (partition by Symbol, Price, Bid, BidSize, Ask, AskSize, Change, Volume, DayLow, DayHigh, Time order by Sequence)
group by Symbol, grp, price, Bid, BidSize, Ask, AskSize, Change, Volume, DayLow, DayHigh, Time
Also, I was able to use use >MIN(...) and <=MAX(...) to define ranges of records to delete.