Postgres aggregate function by date range - sql

I have two tables. The first is named page with the schema id and name. The second table is page_counts. It has the schema id, page_id (foreign key to page table), views, and date. Basically, I'm tracking how many views some pages get every single day. Views for each day are cumulative, so it will always be equal to or greater than views for the previous day.
I want to be able to track how many views a page gets by week. This comes down to taking the most recent day's views and subtracting from that the total number of views from a week before that day. I want to be able to do this over multiple weeks as well, so finding out the total number of views for the past week, total number starting from last week and going back one more week etc.
I looked into the postgres date functions, but not much is making sense. Thanks for the help

It is hard to do it without data to test but it should be something like this
select page_id, week,
views - lag(views, 1, 0) over (partition by page_id order by week) as views
from (
select page_id, date_trunc('week', "date") as week, max(views) as views
from page_counts pc
group by 1, 2
) s
order by 1, 2 desc
The subquery just groups by page and week getting the corresponding maximum number of views.
Then the lag window function get the number of views from the previous week for that page.

Related

How to get records month wise on rollover calender

I am using following query to get records count on month wise and it is working fine:
SELECT MONTH(dte_cycle_count) MONTH, COUNT(*) COUNT
FROM inventory
WHERE YEAR(dte_cycle_count)='2021' --OR (MONTH(dte_cycle_count) = '1' OR MONTH(dte_cycle_count) = '12')
GROUP BY MONTH(dte_cycle_count);
Problem:
Now I need to bind rollover calendar so user can scroll or click on next or previous button the next 12 Months record will be visible.
eg. Current month is MARCH, So default records will be from APR2020 to MARCH2021. If user click on previous then records will come MAR2020 to FEB2021.
How I can achieve this?
Please let me know if need more information. I will try my best to provide.
I think what you are after is a date list from which to join to your inventory table.
Like a numbers table, build a static table with columns for date, year, month, populated from whenever you need to far in the future.
You then select from this, applying your filtering range critera, and join to your inventory table.
For an efficient query, ideally your inventory table should have the relevant date portions eg year and month stored to match.
You don't want to be using functions on a datetime to extract the year or month as this is not sargable and will not allow any index to be used for a seek lookup.

Use SQL to ensure I have data for each day of a certain time period

I'm looking to only select one data point from each date in my report. I want to ensure each day is accounted for and has at least one row of information, as we had to do a few different things to move a large data file into our data warehouse (import one large Google Sheet for some data, use Python for daily pulls of some of the other data - want to make sure no date was left out), and this data goes from now through last summer. I could do a COUNT DISTINCT clause to just make sure the number of days between the first data point and yesterday (the latest data point), but I want to verify each day is accounted for. Should mention I am in BigQuery. Also, an example of the created_at style is: 2021-02-09 17:05:44.583 UTC
This is what I have so far:
SELECT FIRST(created_at)
FROM 'large_table'
ORDER BY created_at
**I know FIRST is probably not the best clause for this case, and it's currently acting to grab the very first data point in created_at, but just as a jumping-off point.
You can use aggregation:
select any_value(lt).*
from large_table lt
group by created_at
order by min(created_at);
Note: This assumes that created_at is a date -- or at least only has one value per date. You might need to convert it to a date:
select any_value(lt).*
from large_table lt
group by date(created_at)
order by min(created_at);
BigQuery equivalent of the query in your question
SELECT created_at
FROM 'large_table'
ORDER BY created_at
LIMIT 1

Redshift - How to SUM number over last 4 weeks as a window function per row?

is it possible to SUM a number over a special time period in Amazon Redshift with a WINDOW-Function?
As an example I'm counting login numbers for different companies per day.
What I now want per row is, that it sums up the logins over the last 4 weeks (referenced by the date of the row): The field which I'm serarching for is marked yellow in the screenshot.
Thanks in advance for your help.
If you have data for each day, then you can use rows:
select t.*,
sum(logs) over (partition by company
order by date
rows between 27 preceding and current row
) as logins_4_weeks
from t;
Redshift does not yet support range for the window frame, so this is your best bet.

Aggregating 15-minute data into weekly values

I'm currently working on a project in which I want to aggregate data (resolution = 15 minutes) to weekly values.
I have 4 weeks and the view should include a value for each week AND every station.
My dataset includes more than 50 station.
What I have is this:
select name, avg(parameter1), avg(parameter2)
from data
where week in ('29','30','31','32')
group by name
order by name
But it only displays the avg value of all weeks. What I need is avg values for each week and each station.
Thanks for your help!
The problem is that when you do a 'GROUP BY' on just name you then flatten the weeks and you can only perform aggregate functions on them.
Your best option is to do a GROUP BY on both name and week so something like:
select name, week, avg(parameter1), avg(parameter2)
from data
where week in ('29','30','31','32')
group by name, week
order by name
PS - It' not entirely clear whether you're suggesting that you need one set of results for stations and one for weeks, or whether you need a set of results for every week at every station (which this answer provides the solution for). If you require the former then separate queries are the way to go.

Is there a way to handle immutability that's robust and scalable?

Since bigquery is append-only, I was thinking about stamping each record I upload to it with an 'effective date' similar to how peoplesoft works, if anybody is familiar with that pattern.
Then, I could issue a select statement and join on the max effective date
select UTC_USEC_TO_MONTH(timestamp) as month, sum(amt)/100 as sales
from foo.orders as all
join (select id, max(effdt) as max_effdt from foo.orders group by id) as latest
on all.effdt = latest.max_effdt and all.id = latest.id
group by month
order by month;
Unfortunately, I believe this won't scale because of the big query 'small joins' restriction, so I wanted to see if anyone else had thought around this use case.
Yes, adding a timestamp for each record (or in some cases, a flag that captures the state of a particular record) is the right approach. The small side of a BigQuery "Small Join" can actually return at least 8MB (this value is compressed on our end, so is usually 2 to 10 times larger), so for "lookup" table type subqueries, this can actually provide a lot of records.
In your case, it's not clear to me what the exact query you are trying to run is.. it looks like you are trying to return the most recent sales times of every individual item - and then JOIN this information with the SUM of sales amt per month of each item? Can you provide more info about the query?
It might be possible to do this all in one query. For example, in our wikipedia dataset, an example might look something like...
SELECT contributor_username, UTC_USEC_TO_MONTH(timestamp * 1000000) as month,
SUM(num_characters) as total_characters_used FROM
[publicdata:samples.wikipedia] WHERE (contributor_username != '' or
contributor_username IS NOT NULL) AND timestamp > 1133395200
AND timestamp < 1157068800 GROUP BY contributor_username, month
ORDER BY contributor_username DESC, month DESC;
...to provide wikipedia contributions per user per month (like sales per month per item). This result is actually really large, so you would have to limit by date range.
UPDATE (based on comments below) a similar query that finds "num_characters" for the latest wikipedia revisions by contributors after a particular time...
SELECT current.contributor_username, current.num_characters
FROM
(SELECT contributor_username, num_characters, timestamp as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL)
AS current
JOIN
(SELECT contributor_username, MAX(timestamp) as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL AND timestamp > 1265073722 GROUP BY contributor_username) AS latest
ON
current.contributor_username = latest.contributor_username
AND
current.time = latest.time;
If your query requires you to use first build a large aggregate (for example, you need to run essentially an accurate COUNT DISTINCT) another option is to break this query up into two queries. The first query could provide the max effective date by month along with a count and save this result as a new table. Then, could run a sum query on the resulting table.
You could also store monthly sales records in separate tables, and only query the particular table for the months you are interested in, simplifying your monthly sales summaries (this could also be a more economical use of BigQuery). When you need to find aggregates across all tables, you could run your queries with multiple tables listed after the FROM clause.