SQL: The second oldest date - sql

Imagine you've got a table similar to this:
|email | purchase_date |
|:--------------|:---------------------|
|stan#gmail.com | Jun 30 2020 12:00AM |
|stan#gmail.com | Aug 05 2020 5:00PM |
|stan#gmail.com | Mar 22 2018 3:00AM |
|eric#yahoo.com | Aug 05 2020 5:00PM |
|eric#yahoo.com | Mar 22 2018 3:00PM |
|kyle#gmail.com | Mar 22 2018 3:00PM |
|kyle#gmail.com | Jun 30 2020 12:00AM |
|kyle#gmail.com | Aug 05 2020 5:00PM |
|kenny#gmail.com| Aug 05 2020 5:00PM |
Totally random. The actual database I work with is actually more complex with much more columns.
Both the columns are STRING type. Which is not convenient. The purchase date should be DATE type. Kenny made only one purchase, so there shouldn't be any row for him in the result table.
Also notice that a there's a lot of identical dates.
I'd like to select the email and the 2nd oldest purchase date (named as 'second_purchase') for each email address, so that the result looks like this:
|email | second_purchase |
|:--------------|:-------------------- |
|stan#gmail.com | Jun 30 2020 12:00AM |
|eric#yahoo.com | Aug 05 2021 5:00PM |
|kyle#gmail.com | Jun 30 2020 12:00AM |
I can't seem to get the logic or syntax right. I don't want to put all my codes in here, because I've tried many variations of my idea...
It didn't seem to work somehow. But I'd love to see an example code from someone skilled in SQL. My idea is maybe not that great..:-)
This version is actually SOQL (Salesforce Object Query Language). That could be important.
Sorry for not styling the table properly, I didn't seem to work either, even when I used the recommended styling. I wasn't able to post. That was actually quite frustrating.
Anyway, thank you for any help!

You could try the following sql which uses a dense_rank over each user's email and orders by a casted purchase_date
Query #1
WITH date_converted_table AS (
SELECT
email,
purchase_date,
DENSE_RANK() OVER (
PARTITION BY email
ORDER BY CAST(purchase_date as timestamp) ASC
) dr
FROM
mytable
)
SELECT
email,
purchase_date as second_purchase
FROM
date_converted_table
WHERE dr=2;
email
second_purchase
eric#yahoo.com
Aug 05 2020 5:00PM
kyle#gmail.com
Jun 30 2020 12:00AM
stan#gmail.com
Jun 30 2020 12:00AM
Query #2
SELECT
email,
purchase_date as second_purchase
FROM (
SELECT
email,
purchase_date,
DENSE_RANK() OVER (
PARTITION BY email
ORDER BY CAST(purchase_date as timestamp) ASC
) dr
FROM
mytable
) tb
WHERE dr=2;
email
second_purchase
eric#yahoo.com
Aug 05 2020 5:00PM
kyle#gmail.com
Jun 30 2020 12:00AM
stan#gmail.com
Jun 30 2020 12:00AM
View on DB Fiddle
Update 1
As it pertains to follow up question in comment:
Is it possible to upgrade the result so that there are first_purchase
dates (where dr=1) adn second_purchase dates (where dr=2) in separate
columns?
A case expression and aggregation may assist you as shown below. The having clause ensures that there is both a first and second purchase date.
SELECT
email,
MAX(CASE
WHEN dr=1 THEN purchase_date
END) as first_purchase,
MAX(CASE
WHEN dr=2 THEN purchase_date
END) as second_purchase
FROM (
SELECT
email,
purchase_date,
DENSE_RANK() OVER (
PARTITION BY email
ORDER BY CAST(purchase_date as timestamp) ASC
) dr
FROM
mytable
) tb
GROUP BY email
HAVING
SUM(
CASE WHEN dr=1 THEN 1 ELSE 0 END
) > 0 AND
SUM(
CASE WHEN dr=2 THEN 1 ELSE 0 END
) > 0;
email
first_purchase
second_purchase
eric#yahoo.com
Mar 22 2018 3:00PM
Aug 05 2020 5:00PM
kyle#gmail.com
Mar 22 2018 3:00PM
Jun 30 2020 12:00AM
stan#gmail.com
Mar 22 2018 3:00AM
Jun 30 2020 12:00AM
View on DB Fiddle
Let me know if this works for you.

Related

Calculate running sum of previous 3 months from monthly aggregated data

I have a dataset that I have aggregated at monthly level. The next part needs me to take, for every block of 3 months, the sum of the data at monthly level.
So essentially my input data (after aggregated to monthly level) looks like:
month
year
status
count_id
08
2021
stat_1
1
09
2021
stat_1
3
10
2021
stat_1
5
11
2021
stat_1
10
12
2021
stat_1
10
01
2022
stat_1
5
02
2022
stat_1
20
and then my output data to look like:
month
year
status
count_id
3m_sum
08
2021
stat_1
1
1
09
2021
stat_1
3
4
10
2021
stat_1
5
8
11
2021
stat_1
10
18
12
2021
stat_1
10
25
01
2022
stat_1
5
25
02
2022
stat_1
20
35
i.e 3m_sum for Feb = Feb + Jan + Dec. I tried to do this using a self join and wrote a query along the lines of
WITH CTE AS(
SELECT date_part('month',date_col) as month
,date_part('year',date_col) as year
,status
,count(distinct id) as count_id
FROM (date_col, status, transaction_id) as a
)
SELECT a.month, a.year, a.status, sum(b.count_id) as 3m_sum
from cte as a
left join cte as b on a.status = b.status
and b.month >= a.month - 2 and b.month <= a.month
group by 1,2,3
This query NEARLY works. Where it falls apart is in Jan and Feb. My data is from August 2021 to Apr 2022. The means, the value for Jan should be Nov + Dec + Jan. Similarly for Feb it should be Dec + Jan + Feb.
As I am doing a join on the MONTH, all the months of Aug - Nov are treated as being values > month of jan/feb and so the query isn't doing the correct sum.
How can I adjust this bit to give the correct sum?
I did think of using a LAG function, but (even though I'm 99% sure a month won't ever be missed), I can't guarantee we will never have a month with 0 values, and therefore my LAG function will be summing the wrong rows.
I also tried doing the same join, but at individual date level (and not aggregating in my nested query) but this gave vastly different numbers, as I want the sum of the aggregation and I think the sum from the individual row was duplicated a lot of stuff I do a COUNT DISTINCT on to remove.
You can use a SUM with a window frame of 2 PRECEDING. To ensure you don't miss rows, use a calendar table and left-join all the results to it.
SELECT *,
SUM(a.count_id) OVER (ORDER BY c.year, c.month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM Calendar c
LEFT JOIN a ON a.year = c.year AND a.month = c.month
WHERE c.year >= 2021 AND c.year <= 2022;
db<>fiddle
You could also use LAG but you would need it twice.
It should be #Charlieface's answer - only that I get one different result than you put in your expected result table:
WITH
-- your input - and I avoid keywords like "MONTH" or "YEAR"
-- and also identifiers starting with digits are forbidden -
indata(mm,yy,status,count_id,sum_3m) AS (
SELECT 08,2021,'stat_1',1,1
UNION ALL SELECT 09,2021,'stat_1',3,4
UNION ALL SELECT 10,2021,'stat_1',5,8
UNION ALL SELECT 11,2021,'stat_1',10,18
UNION ALL SELECT 12,2021,'stat_1',10,25
UNION ALL SELECT 01,2022,'stat_1',5,25
UNION ALL SELECT 02,2022,'stat_1',20,35
)
SELECT
*
, SUM(count_id) OVER(
ORDER BY yy,mm
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS sum_3m_calc
FROM indata;
-- out mm | yy | status | count_id | sum_3m | sum_3m_calc
-- out ----+------+--------+----------+--------+-------------
-- out 8 | 2021 | stat_1 | 1 | 1 | 1
-- out 9 | 2021 | stat_1 | 3 | 4 | 4
-- out 10 | 2021 | stat_1 | 5 | 8 | 9
-- out 11 | 2021 | stat_1 | 10 | 18 | 18
-- out 12 | 2021 | stat_1 | 10 | 25 | 25
-- out 1 | 2022 | stat_1 | 5 | 25 | 25
-- out 2 | 2022 | stat_1 | 20 | 35 | 35

How to display the oldest date for a unique user who has multiple dates in a database?

Let's say that my output looks like this (simplified example):
UserName
ProfileCreation
PurchasePrice
PurchaseDate
Alice
Dec 21 2019 6:00AM
120.00
Dec 21 2019 8:00AM
Alice
Dec 21 2019 6:00AM
90.00
Dec 25 2019 9:00AM
Alice
Dec 21 2019 6:00AM
150.00
Jan 02 2020 10:00AM
Bob
Jan 01 2020 9:00PM
50.00
Jan 03 2020 11:00PM
Bob
Jan 01 2020 9:00PM
70.00
Jan 07 2020 11:00PM
The code for this output would look like this, I guess (not that important):
SELECT
UserName, ProfileCreation, PurchasePrice, PurchaseDate
FROM Some_Random_Database
But my desired output should look like this:
UserName
ProfileCreation
PurchasePrice
FirstPurchaseDate
NumberOfPurchases
AvgOfPurchasePrice
Alice
Dec 21 2019
120.00
Dec 21 2019
3
120.00
Bob
Jan 01 2020
50.00
Jan 03 2020
2
60.00
Hopefully, it's understandable what my goal is - to have unique user with date of his/her oldest purchase and with some calculated metrics for all purchases. Price of the first purchase can stay, but it is not necessary.
I'm writing in SOQL dialect - Salesforce Marketing Cloud.
Obviously, I've got some ideas how to do some of the intended tweaks in my code, but I'd like to see a solution from any expert who is willing to show me the best way possible. I'm really just a noob :-)
I appreciate any help, guys!
Note: i know nothing about Salesforce Marketing Cloud, but...
There's few ways to achieve that:
#1 - standard sql
SELECT UserName, ProfileCreation
, MIN(PurchaseDate) FirstPurchaseDate
, COUNT(PurchasePrice) NoOfPurchases
, AVG(PurchasePrice) AvgPurchasePrice
FROM Foo
GROUP BY UserName, ProfileCreation;
#2 - window functions
SELECT DISTINCT UserName, ProfileCreation
, MIN(PurchaseDate) OVER(PARTITION BY UserName ORDER BY UserName) FirstPurchaseDate
, COUNT(PurchasePrice) OVER(PARTITION BY UserName ORDER BY UserName) NoOfPurchases
, AVG(PurchasePrice) OVER(PARTITION BY UserName ORDER BY UserName) AvgPurchasePrice
FROM Foo;
SELECT
UserName, ProfileCreation, PurchasePrice, PurchaseDate
FROM
Some_Random_Database
WHERE
(UserName, PurchaseDate) IN
(SELECT UserName, max(PurchaseDate) FROM Some_Random_Database GROUP BY UserName);

Adding set lists of future dates to rows in a SQL query

So I am doing a cohort analysis for customers, where a cohort is a group of people who started using the product in the same month. I then keep track of each cohort's total use for every subsequent month up till present time.
For example, the first "cohort month" is January 2012, then I have "use months" January 12, Feb 12, March 12, ..., March 17(current month). One column is "cohort month", and another is "use month". This process repeats for every subsequent cohort month. The table looks like:
Jan 12 | Jan 12
Jan 12 | Feb 12
...
Jan 12 | Mar 17
Feb 12 | Feb 12
Feb 12 | Mar 12
...
Feb 12 | Mar 17
...
Feb 17 | Feb 17
Feb 17 | Mar 17
Mar 17 | Mar 17
The problem arises because I want to do forecasting for one year out for both existing and future cohorts.
That means for the Jan 12 cohort, I want to do prediction for April 17 to Mar 18.
I also want to do predictions for the April 17 cohort (which doesn't exist yet) from April 17 to Mar 18. And so on till predictions for the Mar 18 cohort in Mar 18.
I can handle the predictions, don't worry about that.
My issue is that I cannot figure out how to add in this list of (April 17 .. Mar 17) in the "use month" column before every cohort switches.
I also need to add in cohorts April 17 to Mar 18, and have the applicable parts of this list of (April 17 ... Mar 17) for each of these future cohorts.
So I want the table to look like:
Jan 12 | Jan 12
Jan 12 | Feb 12
...
Jan 12 | Mar 17
Jan 12 | Apr 17
..
Jan 12 | Mar 18
Feb 12 | Feb 12
Feb 12 | Mar 12
...
Feb 12 | Mar 17
Feb 12 | Apr 17
...
Feb 12 | Mar 18
...
...
Feb 17 | Feb 17
Feb 17 | Mar 17
...
Feb 17 | Mar 18
Mar 17 | Mar 17
...
Mar 17 | Mar 18
I know the first solution to come to mind is to do a create a list of all dates Jan 12 to Mar 18, cross join it to itself, and then left outer join to the current table I have (where cohort / use months range from Jan 12 to Mar 17). However, this is not scalable.
Is there a way I can just iteratively add in this list of the months of the next year?
I am using HP Vertica, could use Presto or Hive if absolutely necessary
I think you should use the query here below to create a temporary table out of nothing, and join it with the rest of your query. You can't do anything in a procedural manner in SQL, I'm afraid. You won't be able to get away without a CROSS JOIN. But here, you limit the CROSS JOIN to the generation of the first-of-month pairs that you need.
Here goes:
WITH
-- create a list of integers from 0 to 100 using the TIMESERIES clause
i(i) AS (
SELECT dt::DATE - '2000-01-01'::DATE
FROM (
SELECT '2000-01-01'::DATE + 0
UNION ALL SELECT '2000-01-01'::DATE + 100
) d(d)
TIMESERIES dt AS '1 day' OVER(ORDER BY d::TIMESTAMP)
)
,
-- limits are Jan-2012 to the first of the current month plus one year
month_limits(month_limit) AS (
SELECT '2012-01-01'::DATE
UNION ALL SELECT ADD_MONTHS(TRUNC(CURRENT_DATE,'MONTH'),12)
)
-- create the list of possible months as a CROSS JOIN of the i table
-- containing the integers and the month_limits table, using ADD_MONTHS()
-- and the smallest and greatest month of the month limits
,month_list AS (
SELECT
ADD_MONTHS(MIN(month_limit),i) AS month_first
FROM month_limits CROSS JOIN i
GROUP BY i
HAVING ADD_MONTHS(MIN(month_limit),i) <= (
SELECT MAX(month_limit) FROM month_limits
)
)
-- finally, CROSS JOIN the obtained month list with itself with the
-- filters needed.
SELECT
cohort.month_first AS cohort_month
, use.month_first AS use_month
FROM month_list AS cohort
CROSS JOIN month_list AS use
WHERE use.month_first >= cohort.month_first
ORDER BY 1,2
;

How to change start date in a table to a pair of start date and end date using SQL

The title must be confusing, but the thing I am trying to do is very easy to understand with an example. I have a table like this:
Code Date_ Ratio
73245 Jan 1 1975 12:00AM 10
73245 Apr 18 2006 12:00AM 4
73245 Dec 26 2007 12:00AM 10
73245 Jan 30 2009 12:00AM 4
73245 Apr 21 2011 12:00AM 2
Basically for each security it gives some ratio for it with a date when the ratio starts to be effective. This table will be much easier to use if instead of just having a start date, it has a pair of start date and end date, like the following:
Code StartDate_ EndDate_ Ratio
73245 Jan 1 1975 12:00AM Apr 18 2006 12:00AM 10
73245 Apr 18 2006 12:00AM Dec 26 2007 12:00AM 4
73245 Dec 26 2007 12:00AM Jan 30 2009 12:00AM 10
73245 Jan 30 2009 12:00AM Apr 21 2011 12:00AM 4
73245 Apr 21 2011 12:00AM Dce 31 2049 12:00AM(or some random date in far future) 2
How do I transform the original table to the table I want using SQL statements? I have little experience with SQL and I could not figure how.
Please help! Thanks!
In SQL Server 2012:
SELECT code,
date_ AS startDate,
LEAD(date_) OVER (PARTITION BY code ORDER BY date_) AS endDate,
ratio
FROM mytable
In SQL Server 2005 and 2008:
WITH q AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY code ORDER BY date_) AS rn
FROM mytable
)
SELECT q1.code, q1.date_ AS startDate, q2.date_ AS endDate, q1.ratio
FROM q q1
LEFT JOIN
q q2
ON q2.code = q1.code
AND q2.rn = q1.rn + 1
Maybe it would also be possible to use OUTER APPLY, something like:
SELECT t1.Code, t1.Date_ AS StartDate_, ISNULL(t2.EndDate_, CAST('20491231' AS DATETIME)) AS EndDate_
FROM t1 AS t1o
OUTER APPLY
(
SELECT TOP 1 Date_ AS EndDate_
FROM t1
WHERE t1.Code = t1o.Code AND t1.Date_ > t1o.Date_
ORDER BY t1.Date_ ASC
) AS t2

datetime order by

I have a website which contains news data and I want show the most updated data by time,
I have field column_time contains 8 data. Why when I use this SQL:
select * from table_name order by waktu desc
is the result this:
28 Jan 2013 | 15:36:47
28 Jan 2013 | 15:30:48
27 Jan 2013 | 21:38:36
27 Jan 2013 | 21:38:32
27 Jan 2013 | 21:38:29
11 Feb 2013 | 20:41:05
11 Feb 2013 | 20:40:37
11 Feb 2013 | 20:36:11
and not this?
11 Feb 2013 | 20:41:05
11 Feb 2013 | 20:40:37
11 Feb 2013 | 20:36:11
28 Jan 2013 | 15:36:47
28 Jan 2013 | 15:30:48
27 Jan 2013 | 21:38:36
27 Jan 2013 | 21:38:32
27 Jan 2013 | 21:38:29
The column is sorted like character data, type varchar or text.
You probably want to use timestamp or datetime as data type, depending on your secret RDBMS.
Try this to order by latest record (not just time) (DEMO - Converting to DATETIME)
Select * from table_name
Order by convert(Datetime,replace(your_column,' | ',' ')) desc
OR if you just need to order by time regardless of date then use; (Also you can convert to Time if you are on above sql-server 2008)
Order by convert(Datetime, substring(your_column,
charindex('|',your_column,1)+2,len(your_column))) desc