I'm trying to calculate the difference between two weeks but I'm getting a weird peak when plotting the results ( SQL / BigQuery ) - sql

so I have this daily table that contains the number of visitors per store, everyday.
My tables columns are:
Date
Store
Number_of_Visitors
Views : number of views of the stores' ads.
So I first started with aggregating my table to a weekly table so that I can calculate the variance between a week and the next one.
Here is how I defined variance:
Variance = `Number Of Visitors in WEEK N+1 / Number of Visitors in WEEK N
I wrote the following query to do that (new table called: weekly)
SELECT
year_week,
min(date) as date,
Store,
SUM(Number_Of_Visitors) AS TOTAL_VISITORS
FROM (
SELECT
*,
CONCAT(cast((extract(YEAR from date)), LPAD(cast((extract(WEEK from date)) as string), 2, '0') ) AS year_week
FROM `my-project`)
GROUP BY
year_week, Store
ORDER BY year_week
Then, in order to calculate the variance, I used the following query as well:
SELECT
base.*,
((base.TOTAL_VISITORS-lw.TOTAL_VISITORS)/lw.TOTAL_VISITORS) AS VAR_FF,
FROM
`weekly` base
JOIN (
SELECT
* EXCEPT (date),
DATE_ADD(DATE(TIMESTAMP(date)), INTERVAL 1 Week)AS n_date
FROM
`weekly` ) lw
ON
base.date = lw.n_date
AND base.Store= lw.Store
When I'm plotting the variance (VAR_FF) using Data Studio and I'm getting the following plot that doesnt 't seem to be making sense with the high peak in the middle;

I am thinking your code should look like this:
SELECT date_trunc(date, week) as year_week,
Store,
SUM(Number_Of_Visitors) AS TOTAL_VISITORS,
(1 -
(LAG(SUM(Number_Of_Visitors)) OVER (PARTITION BY Store ORDER BY MIN(date) /
SUM(Number_Of_Visitors)
)
) as VAR_FF,
FROM`my-project`
GROUP BY year_week, Store
ORDER BY year_week;
I'm not sure what your weird calculations for calculating the week are really doing. This is based on the previous week in the data.

Related

How to conditional SQL select

My table consists of user_id, revenue, publish_month columns.
Right now I use group_by user_id and sum(revenue) to get revenue for all individual users.
Is there a single SQL query I can use to query for user revenue across a time period conditionally? If for a specific user, there is a row for this month, I want to query for this month, last month and the month before. If there is not yet a row for this month, I want to query for last month and the two months before.
Any advice with which approach to take would be helpful. If I should be using cases, if-elses with exists or if this is do-able with a single SQL query?
UPDATE---since I did a bad job of describing the question, I've come to include some example data and expected results
Where current month is not present for user 33
Where current month is present
Assuming publish_month is a DATE datatype, this should get the most recent three months of data per user...
SELECT
user_id, SUM(revenue) as s_revenue
FROM
(
SELECT
user_id, revenue, publish_month,
MAX(publish_month) OVER (PARTITION BY user_id) AS user_latest_publish_month
FROM
yourtableyoudidnotname
)
summarised
WHERE
publish_month >= DATEADD(month, -2, user_latest_publish_month)
GROUP BY
user_id
If you want to limit that to the most recent 3 months out of the last 4 calendar months, just add AND publish_month >= DATEADD(month, -3, DATE_TRUNC(month, GETDATE()))
The ambiguity here is why it is important to include a Minimal Reproducible Example
With input data and require results, we could test our code against your requirements
If you're using strings for the publish_month, you shouldn't be, and should fix that with utmost urgency.
You can use a windowing function to "number" the months. In this way the most recent one will have a value of 1, the prior 2, and the one before 3. Then you can only select the items with a number of 3 or less.
Here is how:
SELECT user_id, revienue, publish_month,
ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY publish_month DESC) as RN
FROM yourtableyoudidnotname
now you just select the items with RN less than 3 and do your sum
SELECT user_id, SUM(revenue) as s_revenue
FROM (
SELECT user_id, revenue, publish_month,
ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY publish_month DESC) as RN
FROM yourtableyoudidnotname
) X
WHERE RN <= 3
GROUP BY user_id
You could also do this without a sub query if you use the windowing function for SUM and a range, but I think this is easier to understand.
From the comment -- there could be an issue if you have months from more than one year. To solve this make the biggest number in the order by always the most recent. so instead of
ORDER BY publish_month DESC
you would have
ORDER BY (100*publish_year)+publish_month DESC
This means more recent years will always have a higher number so january of 2023 will be 202301 while december of 2022 will be 202212. Since january is a bigger number it will get a row number of 1 and december will get a row number of 2.

Separate by month, with each month having data from beginning of time to end of month in question - SQL

I currently have a window function that takes all my data and finds the latest value (amount) for each account and then averages this across all accounts.
Now I want to segment by month. The problem is if there has been no data for the account in the month specified we need to get the last possible value used. Therefore we need each month to segment from the beginning of the month to the chosen month. Currently the query provides one value 'average amount'. Ideally I would like this average value for each month from inception
SELECT AVG(amount) as "average amount"
FROM (
SELECT *
FROM(
SELECT account_no,amount,_date,row_number() over(partition by account_no order by _date desc) as rn, source
FROM ('another subquery too long to write out fully') k
) j
WHERE j.rn = 1
) l

SQL Pivot table, with multiple pivots on criteria

Here is my dataset,
It has a reservation (unique ID) a reservation_dt a fiscal year (all the same year for the most part) month both numerical and name as well as a reservation status then it has total number reserved followed by a counter (basically
1 for each reservation row)
these are my guidelines (they need to be turned into columns by Month)
Requested - Count of All Distinct reservations
Num_Requested (sum total_number_requested by month)
Booked (count of All Distinct reservations status is order created)
Num_Booked (sum total_number_requested by month) where status is order created
Not_Booked (count of All Distinct reservations where status unfulfilled)
Not_Num_Booked, (sum total_number_requested by month where status is unfulfilled)
I am looking to translate this into a pivot table and this is what I've got so far and can't figure out why its not working.
I figured I would turn each of the above guidlines into a column, using either sum(total_number_Requested) or count(total_requested) where reseravation status is ... and such.
I'm open to any other ideas of how to make this simpler and make it work.
SELECT [month_name],
fyear AS fyear,
Requested,
Num_Requested
FROM (SELECT reservation,
reservation_status,
total_number_requested,
fyear,
[month_name],
[month],
total_requested
FROM #temp2) SourceTable
PIVOT (SUM(total_number_requested)
FOR reservation_status IN ([Requested])) PivotNumbRequested PIVOT(COUNT(reservation)
FOR total_requested IN ([Num_Requested])) PivotCountRequested
WHERE [month] = 7
ORDER BY fyear,
[month];
Use conditional expressions to emulate data pivot. Example:
SELECT fyear, Month, Monthname, Count(*) AS CountALL, Sum(total_number_requested) AS TotNum,
Sum(IIf(reservation_status = "Order Created", total_number_Requested, Null)) AS SumCreated
FROM tablename
GROUP BY fyear, Month, MonthName
More info:
SQLServer - Multiple PIVOT on same columns
Crosstab Query on multiple data points

SQL/HSQLDB query and sub-query in Aggregate Function

My database looks like this (very simple) and is called "RideDate":
BikeDate Bike Miles
What I am looking to achieve is a query that for each month is a total(Sum) across all years, average(Avg) across all years, and a total for a specific year
(WHERE YEAR("Date")= '2014"). (I don"t have my exact code in front of me due to power fluctuations, pushing me onto an iPad (high winds and wet/heavy snow)).
My attempt goes something like this:
SElECT MONTH("BikeDate") AS "Month", SUM("Miles") AS "SMiles", AVG("AMiles") AS "Average",
(SELECT MONTH("BikeDate") SUM("Miles") WHERE YEAR("BikeDate") = '2014') AS "2014"
FROM "RideDate"
GROUP BY MONTH("BikeDate")
ORDER BY MONTH("BikeDate") ASC
The results should be:
(month) (sum of month over all years) (avg of month over all years) (sum of month for '14)
The last column will not collate by the 'group by month' and gives a sum for the whole year.
How can I write the sub-query to sum across the iterated month of the main query for the selected year? Is there another way of solving this?
You can try it with a CROSS JOIN
SELECT * FROM
(
(SELECT MONTH("BikeDate") AS "Month", SUM("Miles") AS "SMiles", AVG("AMiles") AS "Average",
FROM "RideDate"
GROUP BY MONTH("BikeDate"))a
CROSS JOIN
(SELECT SUM("Miles") as "YearSum"
FROM "RideDate"
WHERE YEAR("BikeDate") = '2014')b
) results

How can I select one row for each week in a date range that spans more than a year?

In my postgreSQL data base, I have a table with columns of dates and prices. ('transdate' and 'price')
I would like to form a query which selects one row for each week over a date range which spans more than one year.
From another question/answer here, I implemented this code which works for date ranges of less than a year:
;with cte as
(
select *,
row_number() over (partition by Extract (week from transdate) order by transdate desc) as rn
from "tablename" where transdate between '06-01-1999' and '06-01-1999'::timestamp + `'50 week'::interval
)
select transdate, price from cte where rn = 1 order by transdate;
However, when I extend the interval greater than 50 weeks, it still only selects a max of 12 months.
How can I re-write this code to select one date/price from every week in the range?
Your problem is that week numbers wrap around at year boundaries but you want to look at the week number and the year at the same time. Lucky for you, you can PARTITION BY several things at once:
row_number() over (
partition by extract(week from transdate),
extract(year from transdate)
order by transdate desc
) as rn