SQL dynamic WHERE clause for every row - sql

I need your expertise and guidance to understand the approach I should use to solve the task below. I have skimmed through existing answers but have not found good matches (unless they were there but they required extensive SQL knowledge and experience)
TASK
calculate total sales forecast per product for a period of time (from beginning of sales until end of end sales).
DATA
Table_1: Product - Calendar Week - Sales Forecast
Table_2: Product - Sales Start (Date_1, Individual for every product)
Table_3: Product - Sales End (Date_2, Individual for every product )
SOLUTION
My initial thought was to use simple WHERE BETWEEN clause but SQL crashes every time I execute the
SELECT
Product,
SUM(Sales_Forecast)
FROM Table_1 as T1
LEFT JOIN Table_2 as T2
ON T1.Product=T2.Product.
LEFT JOIN Table_3 as T3
ON T1.Product=T3.Product
WHERE T1.Calender_week BETWEEN T2.Sales_Start AND T3.Sales_End
GROUP BY Product
Thus the time period I am looking for is not static and it varies depending on a product and its start and end date of sales.
How should I better approach this task?
Thanks in advance,
f

Related

Adding column based on dynamic criteria that changes for every row in snowflake

Trying to add a column that counts distinct customers in snowflake based on criteria that changes for every row i.e. needs to count customers between 52 weeks before current week_ending date to current week_ending date.
The query that goes like
select week_ending, sales, last_year_cust_count
from table where year = 2022
now i want the last_year_cust_count to have distinct customers between 52 weeks before week_ending till current week_ending and this needs to show following results as example
Week_ending
Sales
last_year_cust_count
02/01/22
$300
3479
09/01/22
$350
3400
16/01/22
$450
3500
... and so on
The optimal way to solve this over complex structure, is to use a bitmap, and then roll that up to the projections you over.
You should read Using Bitmaps to Compute Distinct Values for Hierarchical Aggregations
The simple, non-performant way is to self join and throw processing power at it.
select a.week_ending, a.sales, count(distinct b.customer) as last_year_cust_count
from table_a as a
join table_a as b
on <filter that I cannot bothered writing to select last 52 weeks base on years and weeks>
where year = 2022

Cohort retention with SQL BigQuery

I am trying to create a retention table like the following using SQL in Big Query but with MONTHLY cohorts;
I have the following columns to use in my dataset, I am only using one table and it's name is 'curious-furnace-341507.TEST.Test_Dataset_-_Orders'
order_date
order_id
customer_id
2020-01-02
12345
6789
I do not need the new user column and the data goes through June 2020 I think ideally a cohort month column that lists January-June cohorts and then 5 periods across.
I have tried so many different things and keep getting errors in BigQuery I think I am approaching it all wrong. The online queries I am trying to pull from seem to use dates rather than months which is also causing some confusion as I think I need to truncate my date column to months only in the query?
Does anyone have a go-to query that will work in BigQuery for a retention table or can help me approach this? Thanks!
This may help you:
With cohorts AS (
SELECT
customer_id,
MIN(DATE(order_date)) AS cohort_date
FROM 'curious-furnace-341507.TEST.Test_Dataset_-_Orders'
GROUP BY 1)
SELECT
FORMAT_DATE("%Y-%m", c.cohort_date) AS cohort_mth,
t.customer_id AS cust_id,
DATE_DIFF(t.order_date, c.cohort_date, month) AS order_period,
FROM 'curious-furnace-341507.TEST.Test_Dataset_-_Orders' t
JOIN cohorts c ON t.customer_id = c.customer_id
WHERE cohort_date >= ('2020-01-01')
AND DATE_DIFF(t.order_date, c.cohort_date, month) <=5
GROUP BY 1, 2, 3
I typically do pivots and % calcs in excel/ sheets. So this will give just you the input data you need for that.
NOTE:
This will give you a count of unique customers who ordered in period X (ignores repeat orders in period).
This also has period 0 (ordered again in cohort_mth) which you may wish to keep/ exclude.

I need to calculate running total in Azure SQL Datawarehouse. Tried recursive CTEs but not supported

Azure SQL Datawarehouse doesn't support recursive CTEs and I need a solution that works in Azure SQL DW for my problem.
I have table with my inventory details of various products. I have the quantities of products Produced, Sold and Returned per day. The initial quantity of the present day will be final quantity from previous day(I have it in other table, refer to Stocklevel table in image) and I have to calculate the final quantity for present day using Initial, Produced, Sold and Returned quantities of the day and the final quantity calculated of present day should be passed as an initial quantity for the next day for the same product and so on.
I tried using recursive CTEs and got error “Recursive common table expressions are not supported in this version.”
Please help if you have any other ideas. Thanks in Advance.
Final=Initial+Produced-Sold+Returned
In the image are the details I have.
CTE is not supported in Synapse at this time and hence you got that message . I think we can play around with the Self join and achieve the ask here .
select a.productid, a.initial,a.produced,a.sold,a.Returned,a.Final,convert([date],a.date) as date
,convert(date,DATEADD(dd,-1,a.Date))as yesterday,
final = isnull(a.initial,0) + a.produced + a.sold +a.Returned
from Inventary a
JOIN Inventary B
On A.Productid = b.Productid
and a.Date = DATEADD(dd,-1,b.Date)
group by a.productid, a.initial,a.produced,a.sold,a.Returned,a.Final,convert([date],a.date),convert(date,DATEADD(dd,-1,a.Date))

Joining a second instance of Sales table to get last weeks Sales

I have a Sales table showing product number, sales value, and sales volume per week. I need to build a report to display these values and volumes along with the equivalent values from the previous week. I also have a Weeks table which gives me the previous week number for the current week (for instance if current week is 2013-01, then the previous week value is 2012-52).
I therefore assumed it would be simple enough to join to another instance of Sales on product number and previous week number from the Weeks table. However Teradata is not letting me do this, initially it threw an error of Improper column reference in the search condition of a joined table and when I re-ordered the query to reference Weeks before the second instance of Sales it now tries to run but gives me a No more spool space error, so I assume my approach is incorrect. My SQL is as follows:
select s.Week_Number,
s.Product_Number,
s.Sales_Value,
s.Sales_Volume,
s_lw.Sales_Value,
s_lw.Sales_Volume
from SALES s
inner join WEEKS w
on s.Week_Number = w.Week_Number
left join SALES s_lw
on s.Product_Number = s_lw.Product_Number
and s_lw.Week_Number = w.Last_week_Number
Could anyone please suggest what I'm doing wrong here? It seems like this should be achievable.
I would suggest using a Window Aggregate Function to accomplish this with a single pass of the SALES table:
SELECT DISTINCT
s.Week_Number,
s.Product_Number,
s.Sales_Value,
s.Sales_Volume,
MAX(s.Sales_Value) OVER (PARTITION BY s.Product_Number
ORDER BY s.Week_Number DESC
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS LW_Sales_Value,
MAX(s.Sales_Volume) OVER (PARTITION BY s.Product_Number
ORDER BY s.Week_Number DESC
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS LW_Sales_Volume
FROM SALES s;

Is there a way to handle immutability that's robust and scalable?

Since bigquery is append-only, I was thinking about stamping each record I upload to it with an 'effective date' similar to how peoplesoft works, if anybody is familiar with that pattern.
Then, I could issue a select statement and join on the max effective date
select UTC_USEC_TO_MONTH(timestamp) as month, sum(amt)/100 as sales
from foo.orders as all
join (select id, max(effdt) as max_effdt from foo.orders group by id) as latest
on all.effdt = latest.max_effdt and all.id = latest.id
group by month
order by month;
Unfortunately, I believe this won't scale because of the big query 'small joins' restriction, so I wanted to see if anyone else had thought around this use case.
Yes, adding a timestamp for each record (or in some cases, a flag that captures the state of a particular record) is the right approach. The small side of a BigQuery "Small Join" can actually return at least 8MB (this value is compressed on our end, so is usually 2 to 10 times larger), so for "lookup" table type subqueries, this can actually provide a lot of records.
In your case, it's not clear to me what the exact query you are trying to run is.. it looks like you are trying to return the most recent sales times of every individual item - and then JOIN this information with the SUM of sales amt per month of each item? Can you provide more info about the query?
It might be possible to do this all in one query. For example, in our wikipedia dataset, an example might look something like...
SELECT contributor_username, UTC_USEC_TO_MONTH(timestamp * 1000000) as month,
SUM(num_characters) as total_characters_used FROM
[publicdata:samples.wikipedia] WHERE (contributor_username != '' or
contributor_username IS NOT NULL) AND timestamp > 1133395200
AND timestamp < 1157068800 GROUP BY contributor_username, month
ORDER BY contributor_username DESC, month DESC;
...to provide wikipedia contributions per user per month (like sales per month per item). This result is actually really large, so you would have to limit by date range.
UPDATE (based on comments below) a similar query that finds "num_characters" for the latest wikipedia revisions by contributors after a particular time...
SELECT current.contributor_username, current.num_characters
FROM
(SELECT contributor_username, num_characters, timestamp as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL)
AS current
JOIN
(SELECT contributor_username, MAX(timestamp) as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL AND timestamp > 1265073722 GROUP BY contributor_username) AS latest
ON
current.contributor_username = latest.contributor_username
AND
current.time = latest.time;
If your query requires you to use first build a large aggregate (for example, you need to run essentially an accurate COUNT DISTINCT) another option is to break this query up into two queries. The first query could provide the max effective date by month along with a count and save this result as a new table. Then, could run a sum query on the resulting table.
You could also store monthly sales records in separate tables, and only query the particular table for the months you are interested in, simplifying your monthly sales summaries (this could also be a more economical use of BigQuery). When you need to find aggregates across all tables, you could run your queries with multiple tables listed after the FROM clause.