I have a list of evaluation dates stored in a table, datelist. It's technically two columns, start_date and end_date, for each evaluation period. The end_date will definitely need to be used, but the start_date may not. I only care about periods that are completed, so, for example, the period from 2016-01-01 to 2016-07-01 is in progress but not complete. So, it's not in the table.
start_date end_date
2012-01-01 2012-07-01
2012-07-01 2013-01-01
2013-01-01 2013-07-01
2013-07-01 2014-01-01
2014-01-01 2014-07-01
2014-07-01 2015-01-01
2015-01-01 2015-07-01
2015-07-01 2016-01-01
I have a separate table that lists cumulative sales by customer, sales_table with three columns, customer_ID, cumul_sales, transaction_date. For example, let's say customer 4793 bought $100 worth of stuff on 2/14/2014 and $200 worth of stuff on 3/30/2014 and $75 on 7/27/2014, the table will have the following rows:
customer_ID cumul_sales transaction_date
4793 100 2014-02-14
4793 300 2014-03-30
4793 375 2014-07-27
Now, for each evaluation date and for each customer, I want to know what's the cumulative sales as of the evaluation date for that customer? If a customer hadn't purchased anything by an evaluation date, then I wouldn't want a row for that customer at all corresponding to said evaluation date. This would be stored in a new table, called sales_by_eval, with columns customer_ID, cumul_sales, eval_date. For the example customer above, I'd have the following rows:
customer_ID cumul_sales eval_date
4793 300 2014-07-01
4793 375 2015-01-01
4793 375 2015-07-01
4793 375 2016-01-01
I can do this, but I'm looking to do it in an efficient way so I don't have to read through the data once for each evaluation date. If there are a lot of rows in the sales_table and 40 evaluation dates, that would be a large waste to read through the data 40 times, once for each evaluation date. Would it be possible with only one read through the data, for example?
The basic idea of the current process is a macro loop that loops once per evaluation period. Each loop has a data step that creates a new table (one table per loop) to check each transaction to see if it has occurred before or on the end_date of that corresponding evaluation period. That is, each table has all the transactions that occur before or on that evaluation date but none of the ones that occur after. Then, a later data step uses "last." to get only the last transaction for each customer before that evaluation date. And, finally, all the various tables created are put back together in another data step where they are all listed in the SET statement.
This is in SAS, so anything SAS can do, including SQL and macros, is fine with me.
In SAS, when you use group by statement, you can still use not grouping variables in select statement, like this:
proc sql;
create table sales_by_eval as
select s.customer_ID, s.cumul_sales, d.end_date as eval_date
from datelist d
join sales_table s
on d.end_date > s.transaction_date
group by s.customer_ID, d.end_date
having max(s.transaction_date) = s.transaction_date
;
quit;
This mean that for each combination of selected variablem SAS will return rekord with measures summarized within defined group. To limit the result to the last state of transaction value, use having condition, where you select only those records that have transaction_date equal to max(transaction_date) within s.customer_ID, d.end_date group.
Related
I have the following SQL Server database table with a starting and end date for every dataset:
Task
StartDate
EndDate
FirstTask
2022-12-02
2022-12-06
SecondTask
2022-12-03
2022-12-06
ThirdTask
2022-12-06
2022-12-07
Now I am looking for a query which gives me for every date between the lowest start and the highest end date the number of active tasks for every day:
Day
NumberOfActiveTasks
2022-12-02
1
2022-12-03
2
2022-12-04
2
2022-12-05
2
2022-12-06
3
2022-12-07
1
Can someone please point me in the right direction? I guess with the standard SQL functions I can not do this :-(
As I mentioned in the comment, use your Calendar Table. If you don't have one, invest in one. There are 100's (possibly 1,000's) of articles out there on how to both create and populate one, so I'm not going to cover that here, and every person's/business' calender table tends to a "little" different for bespoke needs.
Once you have your Calender table, you can just JOIN to it and then aggregate on the calendar date:
SELECT CT.CalendarDate,
COUNT(*) AS ActiveTasks
FROM (VALUES('FirstTask','2022-12-02','2022-12-06'),
('SecondTask','2022-12-03','2022-12-06'),
('ThirdTask','2022-12-06','2022-12-07'))V(Task,StartDate,EndDate)
JOIN dbo.CalendarTable CT ON CT.CalendarDate BETWEEN V.StartDate AND V.EndDate
GROUP BY CT.CalendarDate
ORDER BY CT.CalendarDate;
I have a growing table of orders which looks something like this:
units_sold
timestamp
1
2021-03-02 10:00:00
2
2021-03-02 11:00:00
4
2021-03-02 12:00:00
3
2021-03-03 13:00:00
9
2021-03-03 14:00:00
I am trying to partition the table into each day, and gather statistics on units sold on the day, and on the day before. I can pretty easily get the units sold today and yesterday for just today, but I need to cross apply a date range for every date in my orders table.
The expected result would look like this:
units_sold_yesterday
units_sold_today
date_measured
12
7
2021-03-02
NULL
12
2021-03-03
One way of doing it, is by creating or appending the order data every day to a new table. However, this table could grow very large and also I need historical data as well.
In my minds eye I know I have cascade the data, so that BigQuery compares the data to "todays date" which would shift across a all the dates in the table.
I'm thinking this shift could come from a cross apply of all the distinct dates in the table, and so I would get a copy of the orders table for each date, but with a different "todays date" column that I can extrapolate the units_sold_today data from by using that column to date-diff the salesdate to.
This would still, however, create a massive amount of data to process, and I guess maybe there is a simple function for this in BigQuery or standard SQL syntax.
This sounds like aggregation and lag():
select timestamp_trunc(timestamp, day), count(*) as sold_today,
lag(count(*)) over (order by min(timestamp)) as sold_yesterday
from t
group by 1
order by 1;
Note: This assumes that you have data for every day.
Consider below
select date_measured, units_sold_today,
lag(units_sold_today) over(order by date_measured) units_sold_yesterday,
from (
select date(timestamp) date_measured,
sum(units_sold) units_sold_today
from `project.dataset.table`
group by date_measured
)
if applied to sample data in your question - output is
I have a dataset that I am trying to get the total number of times a customer has left during a same day period (basically a refund).
If a customer has a new business and a churn with the same transaction time, then it is considered a refund. i was trying to use the lag function for this, but I am simply getting any result if there is change from new_business to churn. What I need is a change from new_business to churn as well as happening during the same day period.
Data looks like:
user_id time transaction_type
1234 2020-01-10 new_business
1234 2020-01-10 churn
5678 2020-01-10 new_business
5678. 2020-05-01 churn
1011 2020-01-10 new_business
In the above example, user_id 1234 would be a refund but 5678 would not be. user 1011 is still a customer. I am trying to get the total count of refund customers
My query:
select count(*)
lag(time) over (partition by user_id order by time)
from data
where transaction_type in('churn','new_business')
However whats happening with this query is that I am getting all times there is a change with both of them. So I am getting user_id 1234 and 5678. What am I missing in order to limit this to only user_id 1234?
If you want people who have the two types on the same date, then you can use aggregation:
select user_id, time
from data
where transaction_type in ('churn', 'new_business')
group by user_id, time
having count(distinct transaction_type) = 2;
If you want a count of these, you can use a subquery.
I have a dataset of parts, price per part, and month. I am accessing this data via a live connection to a SQL Server database. This database gets updated monthly with new prices for each part. What I would like to do is graph one year of price data for the ten parts whose prices changed the most over the last month (either as a percentage of last month's price or as a total change in dollars.)
Since my database connection is live, ideally Tableau would grab the new price data each month, updating the top ten parts whose prices changed for the new period. I don't want to manually have to change the months or use a stored procedure if possible.
part price date
110 167.66 2018-12-01 00:00:00.000
113 157.82 2018-12-01 00:00:00.000
121 99.16 2018-12-01 00:00:00.000
133 109.82 2018-12-01 00:00:00.000
137 178.66 2018-12-01 00:00:00.000
138 154.99 2018-12-01 00:00:00.000
143 67.32 2018-12-01 00:00:00.000
149 103.82 2018-12-01 00:00:00.000
113 167.34 2018-11-01 00:00:00.000
121 88.37 2018-11-01 00:00:00.000
133 264.02 2018-11-01 00:00:00.000
Create a calculated field called Recent_Price as
if DateDiff(‘month’, [date], Today()) <= 1 then [price] end. This returns the price for recent records and null for older records. You might need to tweak the condition based on details, or use an LOD calc to always get the last 2 values regardless of today’s date.
Create a calculated field called Price_Change as Max([Recent_Price]) - Min([Recent_Price]) Note you can’t tell from this whether the change was positive or negative, just its magnitude.
Make sure part is a discrete dimension. Drag it to the Filter Shelf. Set the filter to show the the Top N part by Price_Change
It’s not hard to extend this to include the sign in the price change, or to convert it a percentage. Hint, you’ll probably need a pair of calcs like that in step 1 to select prices for specific months
You haven't provided any sample data, but you could follow something like this,
;WITH top_parts AS (
-- select the top 10 parts based on some criteria
SELECT TOP 10 parts.[id], parts.[name] FROM parts
ORDER BY <most changed>
)
SELECT price.[date], p.[name], price.[price] FROM top_parts p
INNER JOIN part_price price ON p.[id] = price.[part_id]
ORDER BY price.[date]
Use a CTE to get your top parts.
Select from the CTE, join to the price table to get the prices for each part.
Order the prices or bucketize them into months.
Feed it to your graph.
It will be something like this for just one month. If you need the whole year you have to specify clearly what exactly you want to see:
;WITH cte as (
SELECT TOP 10 m0.Part
, Diff = ABS(m0.Price - m1.Price)
, DiffPrc = ABS(m0.Price - m1.Price) / m1.Price
FROM Parts as m0
INNER JOIN (SELECT MaxDate = MAX([Date] FROM Parts) as md
ON md.MaxDate = m0.[Date]
INNER JOIN Parts as m1 ON m0.Part = m1.Part and DATEADD(MONTH,-1,md.MaxDate) = m1.[Date]
ORDER BY ABS(m0.Price - m1.Price) DESC
-- Top 10 by percentage:
-- ORDER BY ABS(m0.Price - m1.Price) / m1.Price DESC
)
SELECT * FROM Parts as p
INNER JOIN cte ON cte.Part = p.Part
-- Input from user,you decide in which format last month date will be pass
-- In other words , #InputLastMonth is parameter of proc
--Suppose it pass in yyyy-MM-dd manner
Declare #InputLastMonth date='2018-12-31'
-- to get last one year data
--Declare local variable which is not pass
declare #From date= dateadd(day,1,dateadd(month,-12, #InputLastMonth))
Declare #TopN int=10-- requirement
--select #InputLastMonth,#From
Select TOP (#TopN) parts,ChangePrice
(
select parts,ABS(max(price)-min(price)) as ChangePrice
from dbo.Table1
where dates>=#From and dates<=#InputLastMonth
group by parts
)t4
order by ChangePrice desc
By change most ,I understand that,suppose there is one parts 'Part1' which was price 100 in first month and change to 1000 in last months.
On the other hand Part2 change several times during same period but final change was only 12.
In other word Part1 change only twice but change difference was huge,Part2 change several time but change difference was small.
So Part1 will be preferred.
Second thing is change can be negative as well as positive.
Correct me if I have not understood your requirement.
I have a table for a ticket system that has the following cols;
order<string> startDate<datetime> endDate<datetime>
The majority of rows are not duplicated based on a column 'order' however in the scenario where an 'order' crosses from one day to the next it is split into 2 orders, the 1st has an endDate of 5pm (end of day) and the 2nd has a startDate of 8am (start of day). Their corresponding start and end dates are as needed.
Some orders can be >2 days long and so will be split into >2 rows.
example
order startDate endDate
1 2016-03-29 11:00:53.000 2016-03-29 17:00:53.000
1 2016-03-30 08:00:53.000 2016-03-30 12:48:53.000
2 2016-03-30 10:17:53.000 2016-03-30 13:08:53.000
would transform to
1 2016-03-29 11:00:53.000 2016-03-30 12:48:53.000
2 2016-03-30 10:17:53.000 2016-03-30 13:08:53.000
I need to combine all rows to give me a table of unique 'order' ids with start and ends. i.e. a row with the lowest start date of its duplicates and highest enddate of its duplicates.
I plan to do this by creating a new table and populating it and can choose 1 of duplicate rows based on a certain value but in not sure how to create a new row based on values from multiple rows.
SELECT order, MIN(startDate), MAX(endDate)
FROM your_table_name
GROUP BY order
There may be no need to create a new table for this — GROUP BY queries are extremely common in production usage, and there's no inherent harm in simply running that query to get the results you need when you need them.