Using MIN(TIMESTAMP) in WHERE - sql

First off, I am a beginner in SQL just learning and I am stuck on one problem looked everywhere but not able to find an answer for it.
SCHEMA: time_ts TIMESTAMP,id BYTES,sale_amount FLOAT,client STRING.
The report I am trying to export is the clients who are newly acquired within the last 12 months that has made 2 and 3 purchase over the last 12 months as well.
DATA SAMPLE:
Row time_ts id sale_amount client
1 2011-12-02 16:17:01.280 UTC James 97.67 104795
2 2010-03-29 19:43:07.723 UTC Mark 90.0 106186
EXPECTED RES
Number_of_Orders Revenue_Total Year Total_Num_of_orders
1, 100$ 2010 60
2, 150$ 2010 65
What I have so far ( Which returns 0 results):
SELECT client, COUNT(id) AS sales, MIN(time_ts),
FROM [bigquery-public-data:hacker_news.comments]
WHERE time_ts >= TIMESTAMP(time_ts) > DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -12, 'MONTH')
GROUP BY client
HAVING COUNT(id) = 2;

You are close. But the condition in the where needs to be in the having:
SELECT client, COUNT(id) AS sales, MIN(time_ts),
FROM [bigquery-public-data:hacker_news.comments]
GROUP BY client
HAVING COUNT(id) = 2 AND
MIN(time_ts) > DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -12, 'MONTH');
I am assuming that the date arithmetic is right. I stopped using legacy SQL a while ago and you should use standard SQL as well.

Related

Calculating Datediff of two days based on when the sum of a column hits a number cap

Tried to see if this was asked anywhere else but doesn't seem like it. Trying to create a sql query to give me the date difference in days between '2022-10-01' and the date when our impression sum hits our cap of 5.
For context, we may see duplicate dates because someone revisit our website that day so we'll get a different session number to pair with that count. Here's an example table of one individual and how many impressions logged.
My goal is to get the number of days it takes to hit an impression cap of 5. So for this individual, they would hit the cap on '2022-10-07' and the days between '2022-10-01' and '2022-10-07' is 6. I am also calculating the difference before/after '2023-01-01' since I need this count for Q4 of '22 and Q1 of '23 but will not include in the example table. I have other individuals to include but for the purpose of asking here, I kept it to one.
Current Query:
select
click_date,
case
when date(click_date) < date('2023-01-01') and sum(impression_cnt = 5) then datediff('day', '2022-10-01', click_date)
when date(click_date) >= date('2023-01-01') and sum(impression_cnt = 5) then datediff('day', '2023-01-01', click_date)
else 0
end days_to_capped
from table
group by customer, click_date, impression_cnt
customer
click date
impression_cnt
123456
2022-10-05
2
123456
2022-10-05
1
123456
2022-10-06
1
123456
2022-10-07
1
123456
2022-10-11
1
123456
2022-10-11
3
Result Table
customer
days_to_cap
123456
6
I'm currently only getting 0 days and then 81 days once it hits 2022-12-21 (last date) for this individual so i know I need to fix my query. Any help would be appreciated!
Edited: This is in snowflake!
So, the issue with your query is that the sum is being calculated at the level that you are grouping by, which is every field, so it will always just be the value of the impressions field every time.
What you need to do is a running sum, which is a SUM() OVER (PARTITION BY...) statement. And then qualify the results of that:
First, just to get the data that you have:
with x as (
select *
from values
(123456,'2022-10-05'::date,2),
(123456,'2022-10-05'::date,1),
(123456,'2022-10-06'::date,1),
(123456,'2022-10-07'::date,1),
(123456,'2022-10-11'::date,1),
(123456,'2022-10-11'::date,3) x (customer,click_date,impression_cnt)
)
Then, I query the CTE to do the running sum with a QUALIFY statement to choose the record that actually has the value I'm looking for
select
customer,
case
when click_date < '2023-01-01'::date and sum(impression_cnt) OVER (partition by customer order by click_date) = 5 then datediff('day', '2022-10-01', click_date)
when click_date >= '2023-01-01'::date and sum(impression_cnt) OVER (partition by customer order by click_date) = 5 then datediff('day', '2023-01-01', click_date)
else 0
end days_to_capped
from x
qualify days_to_capped > 0;
The qualify filters your results to just the record that you cared about.

SQL/ Return MIN values of multiple rows

I'm trying to get the minimum value of open, across multiple rows of year. This is from app.mode.com and the site only says SQL, not sure which version
SELECT year, open
FROM tutorial.aapl_historical_stock_price
WHERE open =
(
select MIN(open)
FROM tutorial.aapl_historical_stock_price
)
When I use the code above, the result is
Table result vs actual output
Year
Open
2000
0
2000
0
2000
0
What I'm trying to get is
Year
Open
2002
0
2001
0
2000
0
Can someone help point me what I'm doing wrong?
select year and get the min by grouping each year as following:
select
year
, min(open) as <desired_alias>
from your_table
group by 1
order by 1 desc;

Can I query a aggregated query and a specific row's query when using subqueries?

I am new to SQL and I wanted to return the results of a specific value and the average of similar values. I have gotten the average part working but I'm not sure how to do the specific value part.
For more context, I have a list of carbon emissions by companies. I wanted the average of a industry based on a company's industry(working perfectly below), but I am not sure how to add the specific companies info.
Here's my query:
SELECT
year, AVG(carbon) AS AVG_carbon,
-- carbon as CompanyCarbon, <--my not working attempt
FROM
"company"."carbon" c
WHERE
LOWER(c.ticker) IN (SELECT LOWER(g4.ticker)
FROM "company"."General" g4
WHERE industry = (SELECT industry
FROM "company"."General" g3
WHERE LOWER(g3.ticker) = 'ibm.us'))
GROUP BY
c.year
ORDER BY
year ASC;
The current result is:
year avg_carbon
--------------------------------
1998 7909.0000000000000000
1999 19465.500000000000
2000 19478.000000000000
2001 182679.274509803922
2002 179821.156862745098
My desired output is:
year avg_carbon. Carbon
---------------------------------------
1998 7909.0000000000000000 343
1999 19465.500000000000 544
2000 19478.000000000000 653
2001 182679.274509803922 654
2002 179821.156862745098 644
(adding the carbon column based on "IBM" carbon
Here's my Carbon table:
ticker year carbon
-----------------------
hurn.us 2016 6282
hurn.us 2015 6549
hurn.us 2014 5897
hurn.us 2013 5300
hurn.us 2012 5340
ibm.us 2019 1496520
ibm.us 2018 1438365
Based on my limited knowledge, I think my where the statement is causing the problem. Right now I took at a company, get a list of tickers/identifiers of the same industry then create an average for each year.
I tried to just call the carbon column but I think because it's processing the list of tickers, it's not outputting the result I want.
What can I do? Also if I'm making any other mistakes you see above please let me know.
Sample data nd output do not match. So I can't say for sure but this might be the answer you are looking for.
select year, AVG(carbon) AS AVG_carbon,
max(case when lower(ticker) = 'ibm.us' then carbon else 0 end) as CompanyCarbon
from "company"."carbon" c
GROUP BY c.year
order by year ASC;
This will select max(carbon) for any year as CompanyCarbon if lower(ticker) = 'ibm.us'. Average will be calculated as you did.
To select only rows having positive value in CompanyCarbon column:
select year, AVG_carbon, CompanyCarbon
from
(
select year, AVG(carbon) AS AVG_carbon,
max(case when lower(ticker) = 'ibm.us' then carbon else 0 end) as CompanyCarbon
from "company"."carbon" c
GROUP BY c.year
order by year ASC;
)t where carbon > 0
Similar to the answer that Kazi provided you can use the FILTER syntax on an aggregate which makes it a bit more readable than the case/when IMO.
SELECT
year,
AVG(carbon) as avg_carbon,
MAX(carbon) FILTER (WHERE ticker = 'ibm.us') as company_carbon
FROM company_carbon
GROUP BY year
ORDER by year;

Data aggregation by sliding time periods

[Query and question edited and fixed thanks to comments from #Gordon Linoff and #shawnt00]
I recently inherited a SQL query that calculates the number of some events in time windows of 30 days from a log database. It uses a CTE (Common Table Expression) to generate the 30 days ranges since '2019-01-01' to now. And then it counts the cases in those 30/60/90 days intervals. I am not sure this is the best method. All I know is that it takes a long time to run and I do not understand 100% how exactly it works. So I am trying to rebuild it in an efficient way (maybe as it is now is the most efficient way, I do not know).
I have several questions:
One of the things I notice is that instead of using DATEDIFF the query simply substracts a number of days from the date.Is that a good practice at all?
Is there a better way of doing the time comparisons?
Is there a better way to do the whole thing? The bottom line is: I need to aggregate data by number of occurrences in time periods of 30, 60 and 90 days.
Note: LogDate original format is like 2019-04-01 18:30:12.000.
DECLARE #dt1 Datetime='2019-01-01'
DECLARE #dt2 Datetime=getDate();
WITH ctedaterange
AS (SELECT [Dates]=#dt1
UNION ALL
SELECT [dates] + 30
FROM ctedaterange
WHERE [dates] + 30<= #dt2)
SELECT
[dates],
lt.Activity, COUNT(*) as Total,
SUM(CASE WHEN lt.LogDate <= dates and lt.LogDate > dates - 90 THEN 1 ELSE 0 END) AS Activity90days,
SUM(CASE WHEN lt.LogDate <= dates and lt.LogDate > dates - 60 THEN 1 ELSE 0 END) AS Activity60days,
SUM(CASE WHEN lt.LogDate <= dates and lt.LogDate > dates - 30 THEN 1 ELSE 0 END) AS Activity30days
FROM ctedaterange AS cte
JOIN (SELECT Activity, CONVERT(DATE, LogDate) as LogDate FROM LogTable) AS lt
ON cte.[dates] = lt.LogDate
group by [dates], lt.Activity
OPTION (maxrecursion 0)
Sample dataset (LogTable):
LogDate, Activity
2020-02-25 01:10:10.000, Activity01
2020-04-14 01:12:10.000, Activity02
2020-08-18 02:03:53.000, Activity02
2019-10-29 12:25:55.000, Activity01
2019-12-24 18:11:11.000, Activity03
2019-04-02 03:33:09.000, Activity01
Expected Output (the output does not reflect the data shown above for I would need too many lines in the sample set to be shown in this post)
As I said above, the bottom line is: I need to aggregate data by number of occurrences in time periods of 30, 60 and 90 days.
Activity, Activity90days, Activity60days, Activity30days
Activity01, 3, 0, 1
Activity02, 1, 10, 2
Activity03, 5, 1, 3
Thank you for any suggestion.
SQL Server doesn't yet have the option to range over values of the window frame of an analytic function. Since you've generated all possible dates though and you've already got the counts by date, it's very easy to look back a specific number of (aggregated) rows to get the right totals. Here is my suggested expression for 90 days:
sum(count(LogDate)) over (
partition by Activity order by [dates]
with rows between 89 preceding and current row
)

Calculate the number of clients (new and returning) for each year

everyone,
I am trying to form a query that shows the number of clients for a specific year. The clients table contains a field, client_since, all the clients info, active_client, date_deleted (that's used as a flag when clients unsubscribe to communications).
Every record's client_since shows the year they became clients (char(4)).
I get the record of the new clients when I query by year, however, I am trying to form the query to show me the number of clients (new and returning). For argument's sake, all clients are returning.
Say that I had 1 client in 2008, 6 clients signed up in 2009, another 6 clients signed up in 2010, another 10 clients signed up in 2011, and so on. I need the query to sum all the clients by year.
I got as far as:
select count (id) as [New Clients],client_since
from tax_clients
where client_since >= 2008
group by client_since
and the result is:
new clients
1 2009
8 2010
6 2011
6 2012
11 2013
6 2014
9 2015
17 2016
20 2017
13 2018
26 2019
41 2020
7 2021
So, the calculation would adding all the new clients.
Can anyone give me some direction as to how to structure the query?
Thanks
I am trying to form the query to show me the number of clients (new and returning). For argument's sake, all clients are returning.
The "returning clients" logic looks a window sum:
select
client_since,
count(*) as new_clients,
sum(count(*)) over(order by client_since) returning_clients
from tax_clients
where client_since >= 2008
group by client_since
order by client_since
Maybe very old versions of SQL Server would prefer a subquery rather than mixing aggregation and window functions in the same scope:
select t.*,
sum(new_clients) over(order by client_since) returning_clients
from (
select client_since, count(*) as new_clients
from tax_clients
where client_since >= 2008
group by client_since
) t
order by client_since
Assuming you have one record per year when the client is active, then you can use lag():
select client_since, count(*) as num_active_clients,
sum(case when prev_cs = client_since - 1 then 1 else 0 end) as num_new_clients
from (select t.*,
lag(client_since) over (partition by id order by client_since) as prev_cs
from tax_clients t
) t
group by client_since