PostgreSQL: how do build a table with an ever-increasing daily total? - sql

Assuming this data:
ID DATE
44 2019-12-31
45 2020-01-01
46 2020-01-01
47 2020-01-02
48 2020-01-03
48 2020-01-03
48 2020-01-03
How do I make a query that returns something like this?
TOTAL DATE
2 2020-01-01
3 2020-01-02
6 2020-01-03
I want all entries after a certain data, but with a counter that starts with the number of entries on the first day, then for every day, it adds the number of entries on that day. I want to plot them on a chart that shows the speed of the growth.
Is this possible? I'm on PostgreSQL 10.
Thanks!

You can use aggregation and window functions:
select date,
count(*) as count_on_date,
sum(count(*)) over (order by min(date)) as running_count
from t
where date >= '2020-01-01'
group by date
order by date;
If you wanted a count going back in time, then you would use:
select greatest(date, '2020-01-01'::date) as date
count(*) as count_on_date,
sum(count(*)) over (order by min(date)) as running_count
from t
group by greatest(date, '2020-01-01'::date)
order by greatest(date, '2020-01-01'::date);

Related

Count distinct customers who bought in previous period and not in next period Bigquery

I have a dataset in bigquery which contains order_date: DATE and customer_id.
order_date | CustomerID
2019-01-01 | 111
2019-02-01 | 112
2020-01-01 | 111
2020-02-01 | 113
2021-01-01 | 115
2021-02-01 | 119
I try to count distinct customer_id between the months of the previous year and the same months of the current year. For example, from 2019-01-01 to 2020-01-01, then from 2019-02-01 to 2020-02-01, and then who not bought in the same period of next year 2020-01-01 to 2021-01-01, then 2020-02-01 to 2021-02-01.
The output I am expect
order_date| count distinct CustomerID|who not buy in the next period
2020-01-01| 5191 |250
2020-02-01| 4859 |500
2020-03-01| 3567 |349
..........| .... |......
and the next periods shouldn't include the previous.
I tried the code below but it works in another way
with customers as (
select distinct date_trunc(date(order_date),month) as dates,
CUSTOMER_WID
from t
where date(order_date) between '2018-01-01' and current_date()-1
)
select
dates,
customers_previous,
customers_next_period
from
(
select dates,
count(CUSTOMER_WID) as customers_previous,
count(case when customer_wid_next is null then 1 end) as customers_next_period,
from (
select prev.dates,
prev.CUSTOMER_WID,
next.dates as next_dates,
next.CUSTOMER_WID as customer_wid_next
from customers as prev
left join customers
as next on next.dates=date_add(prev.dates,interval 1 year)
and prev.CUSTOMER_WID=next.CUSTOMER_WID
) as t2
group by dates
)
order by 1,2
Thanks in advance.
If I understand correctly, you are trying to count values on a window of time, and for that I recommend using window functions - docs here and here a great article explaining how it works.
That said, my recommendation would be:
SELECT DISTINCT
periods,
COUNT(DISTINCT CustomerID) OVER 12mos AS count_customers_last_12_mos
FROM (
SELECT
order_date,
FORMAT_DATE('%Y%m', order_date) AS periods,
customer_id
FROM dataset
)
WINDOW 12mos AS ( # window of last 12 months without current month
PARTITION BY periods ORDER BY periods DESC
ROWS BETWEEN 12 PRECEEDING AND 1 PRECEEDING
)
I believe from this you can build some customizations to improve the aggregations you want.
You can generate the periods using unnest(generate_date_array()). Then use joins to bring in the customers from the previous 12 months and the next 12 months. Finally, aggregate and count the customers:
select period,
count(distinct c_prev.customer_wid),
count(distinct c_next.customer_wid)
from unnest(generate_date_array(date '2020-01-01', date '2021-01-01', interval '1 month')) period join
customers c_prev
on c_prev.order_date <= period and
c_prev.order_date > date_add(period, interval -12 month) left join
customers c_next
on c_next.customer_wid = c_prev.customer_wid and
c_next.order_date > period and
c_next.order_date <= date_add(period, interval 12 month)
group by period;

Is there a method to write a SQL query that returns cumulative results based on the count of another column?

I have a query where I am counting the total number of new users signed up to a particular service each day since the service started.
So far I have:
SELECT DISTINCT CONVERT(DATE, Account_Created) AS Date_Created,
COUNT(ID) OVER (PARTITION BY CONVERT(DATE, Account_Created)) AS New_Users
FROM My_Table.dbo.NewAccts
ORDER BY Date_Created
This returns:
Date_Created | New_Users
--------------------------
2020-01-01 1
2020-01-03 3
2020-01-04 2
2020-01-06 5
2020-01-07 9
What I would like is to return a third column with a cumulative total for each day starting from the beginning until the present. So the first day there was only one new user. On January 3rd, three new users signed up for a total of four since the beginning--so on and so forth.
Date_Created | New_Users | Cumulative_Tot
------------------------------------------
2020-01-01 1 1
2020-01-03 3 4
2020-01-04 2 6
2020-01-06 5 11
2020-01-07 9 20
My thought process was to involve the ROW_NUMBER() function so that I can separate and add each consecutive row together, though I am not sure if that is correct. My feeling is that I am probably thinking about this too hard and the logic is simply just escaping me at the moment. Thank you for any help.
As a starter: I would recommend aggregation rather than DISTINCT and a window count. This makes the intent clearer, and is likely more efficient.
Then, you can make use of a window sum to compute the cumulative count.
SELECT
CONVERT(DATE, Account_Created) AS Date_Created,
COUNT(*) AS New_Users
SUM(COUNT(*)) OVER(ORDER BY CONVERT(DATE, Account_Created)) Cumulative_New_Users
FROM My_Table.dbo.NewAccts
GROUP BY CONVERT(DATE, Account_Created)
ORDER BY Date_Created

SQL Sum and self join?

I have data organised as below in table T_FORECAST
Location Sublocation Delivery_forecast Delivery_Date Forecast_date
----------------------------------------------------------------------------------
1 1 100 2020-01-01 2019-01-01
1 2 50 2020-01-01 2019-05-01
1 1 90 2020-01-01 2019-06-01
1 2 70 2020-01-01 2019-10-01
. . .
I am trying to write a query that would output the sum of Delivery_forecast per location, Delivery_date, and Forecast_date.
in the example below, I would expect:
Location Delivery_forecast Delivery_Date Forecast_date
----------------------------------------------------------------------------------
1 100 2020-01-01 2019-01-01
1 150 2020-01-01 2019-05-01
1 140 2020-01-01 2019-06-01
1 160 2020-01-01 2019-10-01
I could find the list of lines I need using the request below but I cannot find the way to get the right sum. I believe I have to do a self join
SELECT DISTINCT f.Location, f.Delivery_Date, f.Forecast_date
FROM T_FORECAST f
Use a cumulative sum:
select f.*,
sum(delivery_forecast) over (partition by location, delivery_date order by forecast_date) as running_delivery_forecast
from T_FORECAST f;
First you want an aggregation to get the sums per location and dates (SUM(delivery_forecast) / GROUP BY LOCATION, DELIVERY_DATE, FORECAST_DATE). Then you want to show a running total (SUM OVER). Per location, I guess. The two combined:
select
location,
delivery_date,
forecast_date,
sum(delivery_forecast) as forcast_for_day,
sum(sum(delivery_forecast)) over (partition by location
order by delivery_date, forecast_date
) as forcast_cumulated
from t_forecast
group by location, delivery_date, forecast_date
order by location, delivery_date, forecast_date;
They key aspect of your question is that you want to always add values from two consecutive rows.
To do this you can use a "window frame" according you your specific ordering. See SQL Server - Over clause.
For example:
select
location,
sum(delivery_forecast) over(
partition by location, delivery_date
order by forecast_date
rows between 1 preceding and current row -- the magic is here!
) as delivery_forecast,
delivery_date,
forecast_date
from t_forecast
group by location, delivery_date, forecast_date

Sum of item count in an SQL query based on DATE_TRUNC

I've got a table which contains event status data, similar to this:
ID Time Status
------ -------------------------- ------
357920 2019-12-25 09:31:38.854764 1
362247 2020-01-02 09:31:42.498483 1
362248 2020-01-02 09:31:46.166916 1
362249 2020-01-02 09:31:47.430933 1
362300 2020-01-03 09:31:46.932333 1
362301 2020-01-03 09:31:47.231288 1
I'd like to construct a query which returns the number of successful events each day, so:
Time Count
-------------------------- -----
2019-12-25 00:00:00.000000 1
2020-01-02 00:00:00.000000 3
2020-01-03 00:00:00.000000 2
I've stumbled across this SO answer to a similar question, but the answer there is for all the data returned by the query, whereas I need the sum grouped by date range.
Also, I cannot use BETWEEN to select a specific date range, since this query is for a Grafana dashboard, and the date range is determined by the dashboard's UI. I'm using Postgres for the SQL dialect, in case that matters.
You need to remove the time from time component. In most databases, you can do this by converting to a date:
select cast(time as date) as dte,
sum(case when status = 1 then 1 else 0 end) as num_successful
from t
group by cast(time as date)
order by dte;
This assumes that 1 means "successful".
The cast() does not work in all databases. Other alternatives are things like trunc(time), date_trunc('day', time), date_trunc(time, day) -- and no doubt many others.
In Postgres, I would phrase this as:
select date_trunc('day', time) as dte,
count(*) filter (where status = 1) as num_successful
from t
group by dte
order by dte;
How about like this:
SELECT date(Time), sum(status)
FROM table
GROUP BY date(Time)
ORDER BY min(Time)

Get MAX count but keep the repeated calculated value if highest

I have the following table, I am using SQL Server 2008
BayNo FixDateTime FixType
1 04/05/2015 16:15:00 tyre change
1 12/05/2015 00:15:00 oil change
1 12/05/2015 08:15:00 engine tuning
1 04/05/2016 08:11:00 car tuning
2 13/05/2015 19:30:00 puncture
2 14/05/2015 08:00:00 light repair
2 15/05/2015 10:30:00 super op
2 20/05/2015 12:30:00 wiper change
2 12/05/2016 09:30:00 denting
2 12/05/2016 10:30:00 wiper repair
2 12/06/2016 10:30:00 exhaust repair
4 12/05/2016 05:30:00 stereo unlock
4 17/05/2016 15:05:00 door handle repair
on any given day need do find the highest number of fixes made on a given bay number, and if that calculated number is repeated then it should also appear in the resultset
so would like to see the result set as follows
BayNo FixDateTime noOfFixes
1 12/05/2015 00:15:00 2
2 12/05/2016 09:30:00 2
4 12/05/2016 05:30:00 1
4 17/05/2016 15:05:00 1
I manage to get the counts of each but struggling to get the max and keep the highest calculated repeated value. can someone help please
Use window functions.
Get the count for each day by bayno and also find the min fixdatetime for each day per bayno.
Then use dense_rank to compute the highest ranked row for each bayno based on the number of fixes.
Finally get the highest ranked rows.
select distinct bayno,minfixdatetime,no_of_fixes
from (
select bayno,minfixdatetime,no_of_fixes
,dense_rank() over(partition by bayno order by no_of_fixes desc) rnk
from (
select t.*,
count(*) over(partition by bayno,cast(fixdatetime as date)) no_of_fixes,
min(fixdatetime) over(partition by bayno,cast(fixdatetime as date)) minfixdatetime
from tablename t
) x
) y
where rnk = 1
Sample Demo
You are looking for rank() or dense_rank(). I would right the query like this:
select bayno, thedate, numFixes
from (select bayno, cast(fixdatetime) as date) as thedate,
count(*) as numFixes,
rank() over (partition by cast(fixdatetime as date) order by count(*) desc) as seqnum
from t
group by bayno, cast(fixdatetime as date)
) b
where seqnum = 1;
Note that this returns the date in question. The date does not have a time component.