SQL Rolling Count in Time Series - sql

I have a SQLite table of companies and how many news articles was written about each company each day for 10 years (and about 3000 companies). I want to do a "rolling" count where for each company, I count the number of total news articles in a 3-day window, conditional on there being a positive number of news articles. For example, starting on day 1, if the number of articles is 0, skip and go to day 2, etc until we hit a day (say day 4) where there is 1 article, and then count the total number of articles in the next 3 days (so days 4,5,6). After that, I go to day 7 and keep scanning until I find the first day that has a news article, and repeat this 3-day sum, and keep scanning after that, etc. I will repeat this for each company.
I've thought of doing a rolling sum using window functions but with 3000 companies times 365*10 days of data rolling sum may computationally take too long, and plus I wouldn't need the sums to be computed on days that I skip over (so either days with 0 or days that are not the first days in the 3-day interval).
For example, the time series for each company may be (Day #:Number of Articles)
Day 1:0
Day 2:0
Day 3:0
Day 4:1
Day 5:3
Day 6:2
Day 7:0
Day 8:0
Day 9:20
Day 10:2
Day 11:0
Then the output would be
Day 4:6 (1 from Day 4, 3 from Day 5, and 2 from Day 6)
Day 9:22 (20 from day 9, 2 from day 10, 0 from day 11).

In the more recent versions of SQLite, you would use row_number():
select company, min(date), max(date), sum(num_articles)
from (select t.*,
row_number() over (partition by company order by date) as seqnum
from t
where num_articles > 0
) t
group by company, floor((seqnum - 1) / 3);

Related

How to count input throughout the day averaged by weekdays

I want to get an output of the average amount of tasks solved by hour throughout the day.
But I want to get this average amount by an hour of the day throughout the week.
number of tasks
7AM (avg of all 7-8AM on Monday to Friday)
8AM
9AM
etc
select count(*),
date_trunc('hour', tasks) as "dateAxis",
FROM tasks
group by "dateAxis"
demo:db<>fiddle
date_trunc('hour', ...) only "cuts off" the minutes and seconds, so to speak. So you
are normalizing your datetime value to hours. The date will be kept. For example: 2010-01-02 12:23:42 will be converted to 2010-01-02 12:00:00. So, because the date will be kept, you cannot group throughout different dates but only throughout the related hour. Instead you could use extract('hour' from ...) which extracts the hour value of your datetime as a separate value which can be used for grouping throughout different days.
The following query groups all given dates be their hour:
select count(*),
extract('hour', tasks) as "dateAxis"
FROM tasks
group by "dateAxis"
If you only want to recognize all days of a certain week, you'll need to recognize the week number as as well in your group:
SELECT
count(*),
extract('week' from mydate) as week,
extract('hour' from mydate) as hour
FROM mytable
GROUP BY week, hour
Of course, if you are dealing with data sets which include different years, you should add the year component into your group as well.

Write a SQL Query in Google Big Query which pulls all values from last week, all values from 2 weeks ago, and calculate percent change between them

I'm trying to query a table comparing order numbers from last week (Sunday to Saturday) vs 2 weeks ago, and calculate percent change between the two. My thought process so far has been to group my date column by week, then use a lag function to pull last week and the previous week in to the same row. From there use basic arithmetic functions to calculate percent change. In practice, I haven't been able to get a working query, but I picture the table to look as follows:
Week
Orders
Orders - Previous Week
% Change
2023-02-05
5
10
-0.5
2023-01-29
10
2
+5.0
2023-01-29
2
Important to note that the days in last week should not change regardless of what day it is today (i.e not use today -7 days to calculate last week, and -14 days to calculate 2 weeks ago)
My query so far:
SELECT
min(date) as date,
orders,
coalesce(lag(order) over (order by (date), 0)) as Orders - Previous Week
FROM `table`
WHERE date BETWEEN '2023-01-01' AND current_date()
group by date_trunc(date, WEEK)
ORDER BY date desc
I realize I'm not using coalesce and my lag function correctly, but a bit lost on how to correct it
To calculate the percent change, you can use the following query:
sql
Copy code
SELECT
min(date) as Week,
sum(orders) as Orders,
coalesce(sum(lag(orders) over (order by date_trunc(date, WEEK))), 0) as "Orders - Previous Week",
(sum(orders) - coalesce(sum(lag(orders) over (order by date_trunc(date, WEEK))), 0)) / coalesce(sum(lag(orders) over (order by date_trunc(date, WEEK))), 0) as "% Change"
FROM `table`
WHERE date BETWEEN '2023-01-01' AND current_date()
group by date_trunc(date, WEEK)
ORDER BY Week desc
In this query, the sum function is used to aggregate the orders by week. The coalesce function is used to handle the case where there is no previous week data, and default to 0. The percent change calculation uses the same formula you described.

Count distinct customers, active within a year, for every week of the year

I am working with an existing E-commerce database. Actually, this process is usually done in Excel, but we want to try it directly with a query in PostgreSQL (version 10.6).
We define as an active customer a person who has bought at least once within 1 year. This means, if I analyze week 22 in 2020, an active customer will be the one that has bought at least once since week 22, 2019.
I want the output for each week of the year (2020). Basically what I need is ...
select
email,
orderdate,
id
from
orders_table
where
paid = true;
|---------------------|-------------------|-----------------|
| email | orderdate | id |
|---------------------|-------------------|-----------------|
| email1#email.com |2020-06-02 05:04:32| Order-2736 |
|---------------------|-------------------|-----------------|
I can't create new tables. And I would like to see the output like this:
Year| Week | Active customers
2020| 25 | 6978
2020| 24 | 3948
depending on whether there is a year and week column you can use a OVER (PARTITION BY ...) with extract:
SELECT
extract(year from orderdate),
extract(week from orderdate),
sum(1) as customer_count_in_week,
OVER (PARTITION BY extract(YEAR FROM TIMESTAMP orderdate),
extract(WEEK FROM TIMESTAMP orderdate))
FROM ordertable
WHERE paid=true;
Which should bucket all orders by year and week, thus showing the total count per week in a year where paid is true.
references:
https://www.postgresql.org/docs/9.1/tutorial-window.html
https://www.postgresql.org/docs/8.1/functions-datetime.html
if I analyze week 22 in 2020, an active customer will be the one that has bought at least once since week 22, 2019.
Problems on your side
This method has some corner case ambiguities / issues:
Do you include or exclude "week 22 in 2020"? (I exclude it below to stay closer to "a year".)
A year can have 52 or 53 full weeks. Depending on the current date, the calculation is based on 52 or 53 weeks, causing a possible bias of almost 2 %!
If you start the time range on "the same date last year", then the margin of error is only 1 / 365 or ~ 0.3 %, due to leap years.
A fixed "period of 365 days" (or 366) would eliminate the bias altogether.
Problems on the SQL side
Unfortunately, window functions do not currently allow the DISTINCT key word (for good reasons). So something of the form:
SELECT count(DISTINCT email) OVER (ORDER BY year, week
GROUPS BETWEEN 52 PRECEDING AND 1 PRECEDING)
FROM ...
.. triggers:
ERROR: DISTINCT is not implemented for window functions
The GROUPS keyword has only been added in Postgres 10 and would otherwise be just what we need.
What's more, your odd frame definition wouldn't even work exactly, since the number of weeks to consider is not always 52, as discussed above.
So we have to roll our own.
Solution
The following simply generates all weeks of interest, and computes the distinct count of customers for each. Simple, except that date math is never entirely simple. But, depending on details of your setup, there may be faster solutions. (I had several other ideas.)
The time range for which to report may change. Here is an auxiliary function to generate weeks of a given year:
CREATE OR REPLACE FUNCTION f_weeks_of_year(_year int)
RETURNS TABLE(year int, week int, week_start timestamp)
LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE
ROWS 52 COST 10 AS
$func$
SELECT _year, d.week::int, d.week_start
FROM generate_series(date_trunc('week', make_date(_year, 01, 04)::timestamp) -- first day of first week
, LEAST(date_trunc('week', localtimestamp), make_date(_year, 12, 28)::timestamp) -- latest possible start of week
, interval '1 week') WITH ORDINALITY d(week_start, week)
$func$;
Call:
SELECT * FROM f_weeks_of_year(2020);
It returns 1 row per week, but stops at the current week for the current year. (Empty set for future years.)
The calculation is based on these facts:
The first ISO week of the year always contains January 04.
The last ISO week cannot start after December 28.
Actual week numbers are computed on the fly using WITH ORDINALITY. See:
PostgreSQL unnest() with element number
Aside, I stick to timestamp and avoid timestamptz for this purpose. See:
Generating time series between two dates in PostgreSQL
The function also returns the timestamp of the start of the week (week_start), which we don't need for the problem at hand. But I left it in to make the function more useful in general.
Makes the main query simpler:
WITH weekly_customer AS (
SELECT DISTINCT
EXTRACT(YEAR FROM orderdate)::int AS year
, EXTRACT(WEEK FROM orderdate)::int AS week
, email
FROM orders_table
WHERE paid
AND orderdate >= date_trunc('week', timestamp '2019-01-04') -- max range for 2020!
ORDER BY 1, 2, 3 -- optional, might improve performance
)
SELECT d.year, d.week
, (SELECT count(DISTINCT email)
FROM weekly_customer w
WHERE (w.year, w.week) >= (d.year - 1, d.week) -- row values, see below
AND (w.year, w.week) < (d.year , d.week) -- exclude current week
) AS active_customers
FROM f_weeks_of_year(2020) d; -- (year int, week int, week_start timestamp)
db<>fiddle here
The CTE weekly_customer folds to unique customers per calendar week once, as duplicate entries are just noise for our calculation. It's used many times in the main query. The cut-off condition is based on Jan 04 once more. Adjust to your actual reporting period.
The actual count is done with a lowly correlated subquery. Could be a LEFT JOIN LATERAL ... ON true instead. See:
What is the difference between LATERAL and a subquery in PostgreSQL?
Using row value comparison to make the range definition simple. See:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'

SQLite - Determine average sales made for each day of week

I am trying to produce a query in SQLite where I can determine the average sales made each weekday in the year.
As an example, I'd say like to say
"The average sales for Monday are $400.50 in 2017"
I have a sales table - each row represents a sale you made. You can have multiple sales for the same day. Columns that would be of interest here:
Id, SalesTotal, DayCreated, MonthCreated, YearCreated, CreationDate, PeriodOfTheDay
Day/Month/Year are integers that represent the day/month/year of the week. DateCreated is a unix timestamp that represents the date/time it was created too (and is obviously equal to day/month/year).
PeriodOfTheDay is 0, or 1 (day, or night). You can have multiple records for a given day (typically you can have at most 2 but some people like to add all of their sales in individually, so you could have 5 or more for a day).
Where I am stuck
Because you can have two records on the same day (i.e. a day sales, and a night sales, or multiple of each) I can't just group by day of the week (i.e. group all records by Saturday).
This is because the number of sales you made does not equal the number of days you worked (i.e. I could have worked 10 saturdays, but had 30 sales, so grouping by 'saturday' would produce 30 sales since 30 records exist for saturday (some just happen to share the same day)
Furthermore, if I group by daycreated,monthcreated,yearcreated it works in the sense it produces x rows (where x is the number of days you worked) however that now means I need to return this resultset to the back end and do a row count. I'd rather do this in the query so I can take the sales and divide it by the number of days you worked.
Would anyone be able to assist?
Thanks!
UPDATE
I think I got it - I would love someone to tell me if I'm right:
SELECT COUNT(DISTINCT CAST(( julianday((datetime(CreationDate / 1000, 'unixepoch', 'localtime'))) ) / 7 AS INT))
FROM Sales
WHERE strftime('%w', datetime(CreationDate / 1000, 'unixepoch'), 'localtime') = '6'
AND YearCreated = 2017
This would produce the number for saturday, and then I'd just put this in as an inner query, dividing the sale total by this number of days.
Buddy,
You can group your query by getting the day of week and week number of day created or creation date.
In MSSQL
DATEPART(WEEK,'2017-08-14') // Will give you week 33
DATEPART(WEEKDAY,'2017-08-14') // Will give you day 2
In MYSQL
WEEK('2017-08-14') // Will give you week 33
DAYOFWEEK('2017-08-14') // Will give you day 2
See this figures..
Day of Week
1-Sunday, 2- Monday, 3-Tuesday, 4-Wednesday, 5-Thursday, 6-Saturday
Week Number
1 - 53 Weeks in a year
This will be the key so that you will have a separate Saturday's in every month.
Hope this can help in building your query.

Bigquery SQL for sliding window aggregate

Hi I have a table that looks like this
Date Customer Pageviews
2014/03/01 abc 5
2014/03/02 xyz 8
2014/03/03 abc 6
I want to get page view aggregates grouped by week but showing aggregates for past 30 days - (sliding window aggregates with window-size of 30 days for every week)
I am using google bigquery
EDIT: Gordon - re your comment about "Customer", Actually what I need is slightly more complicated thats why I included customer in the table above. I am looking to get the number of customers who had >n pageviews in a 30day window every week. something like this
Date Customers>10 pageviews in 30day window
2014/02/01 10
2014/02/08 5
2014/02/15 6
2014/02/22 15
However to keep it simple, I will work my way if I could just get a sliding window aggregate of pageviews ignoring customers altogether. something like this
Date count of pageviews in 30day window
2014/02/01 50
2014/02/08 55
2014/02/15 65
2014/02/22 75
How about this:
SELECT changes + changes1 + changes2 + changes3 changes28days, login, USEC_TO_TIMESTAMP(week)
FROM (
SELECT changes,
LAG(changes, 1) OVER (PARTITION BY login ORDER BY week) changes1,
LAG(changes, 2) OVER (PARTITION BY login ORDER BY week) changes2,
LAG(changes, 3) OVER (PARTITION BY login ORDER BY week) changes3,
login,
week
FROM (
SELECT SUM(payload_pull_request_changed_files) changes,
UTC_USEC_TO_WEEK(created_at, 1) week,
actor_attributes_login login,
FROM [publicdata:samples.github_timeline]
WHERE payload_pull_request_changed_files > 0
GROUP BY week, login
))
HAVING changes28days > 0
For each user it counts how many changes they have submitted per week. Then with LAG() we can peek into the next row, how many changes they submitted the -1, -2, and -3 week. Then we just add those 4 weeks to see how many changes were submitted on the last 28 days.
Now you can wrap everything in a new query to filter users with changes>X, and count them.
I have created the following "Times" table:
Table Details: Dim_Periods
Schema
Date TIMESTAMP
Year INTEGER
Month INTEGER
day INTEGER
QUARTER INTEGER
DAYOFWEEK INTEGER
MonthStart TIMESTAMP
MonthEnd TIMESTAMP
WeekStart TIMESTAMP
WeekEnd TIMESTAMP
Back30Days TIMESTAMP -- the date 30 days before "Date"
Back7Days TIMESTAMP -- the date 7 days before "Date"
and I use such query to handle "running sums"
SELECT Date,Count(*) as MovingCNT
FROM
(SELECT Date,
Back7Days
FROM DWH.Dim_Periods
where Date < timestamp(current_date()) AND
Date >= (DATE_ADD (CURRENT_TIMESTAMP(), -5, 'month'))
)P
CROSS JOIN EACH
(SELECT repository_url,repository_created_at
FROM publicdata:samples.github_timeline
) L
WHERE timestamp(repository_created_at)>= Back7Days
AND timestamp(repository_created_at)<= Date
GROUP EACH BY Date
Note that it can be used for "Month to date", Week to Date" "30 days back" etc. aggregations as well.
However, performance is not the best and the query can take a while on larger data sets due to the Cartesian join.
Hope this helps