Creating additional date rows for non existent data (Redshift SQL) - sql

I am looking at agent data and want to create an overview of their sales performance within the last 6 months. I have cases where agents just started, some started 3 months ago, but I want to create a view, where there alsways be 6 rows for each agent, no matter when she/he started, there just won't be any sales listed in these rows. This view is importat, because I want to have the option to average the values and show different granularities at some point (agent level, team level etc..)
I am working with Redshift SQL and have the agent data. That is my query:
select date, id, name, team, country, sum(sales)
from agent
where date >= date_trunc('month', dateadd(month, -6, current_date) and date <= current_date
group by 1,2,3,4,5
order by 1
Which gives me the output below (without the green rows), how could I add additional rows/months for Roman, an agen that started in February. Any ideas, suggestions?

Assuming that all dates are available in the table (as shown in your sample data), one option is to cross join the available dates with the list of agents to generate all possible combinations, then bring the original table with a left join:
select d.date, n.id, n.name, n.team, n.country, sum(a.sales)
from (select distinct date from agent) d
cross join (select distinct id, name, team, country from agent) n
left join agent a on a.date = d.date and a.id = n.id
group by 1, 2, 3, 4, 5
order by 1, 2
This assumes that id uniquely identies an agent; otherwise, you would need additional join conditions in the left join (on name, team, country).

Related

How to have different restrictions to calculate max(Date) and min(Date) in one SELECT statement

I need a query that will return the earliest and latest hour of the transaction for a specific day.
The issue is that I often get the earliest transaction before 5 AM, where I want to include them only if they are later than 5 AM. But with the latest transaction, I want to include every transaction, also ones that happened earlier than 5 AM (due to some shops being open overnight).
Below is what my script looks like. Is there any possibility of giving different restrictions to how I calculate max(s.Date) and min(s.Date)? I thought of creating two select statements but not sure how to connect them within one FROM.
from (
select l.Name,
s.ShopID,
Day,
Time,
s.Date,
max(s.Date) over (partition by s.Day) as max_date ,
min(s.Date) over (partition by s.Day) as min_date
from [Shops].[Transaction].[Transactions] s
INNER JOIN [Shops].[Location].[Locations] l ON s.ShopID= l.ShopID
WHERE s.ShopID IN (1, 2, 3, 4, 5) AND Day > 20210131 AND Time <> 4
) t
You can implement this in several ways. The easiest in my opinion is to set the condition directly in the aggregate. Like:
from (
select l.Name,
s.ShopID,
Day,
Time,
s.Date,
max(s.Date) over (partition by s.Day) as max_date ,
min(IIF(DATEPART(HOUR, s.Date) > 5, s.date, NULL)) over (partition by s.Day) as min_date
from [Shops].[Transaction].[Transactions] s
INNER JOIN [Shops].[Location].[Locations] l ON s.ShopID= l.ShopID
WHERE s.ShopID IN (1, 2, 3, 4, 5) AND Day > 20210131 AND Time <> 4
) t
UPDATE: A little clarification. Only strings with a date hour greater than 5 are included in the aggregation here. The value for strings with an hour less than 5 is set to NULL

Running Total - Create row for months that don't have any sales in the region (1 row for each region in each month)

I am working on the below query that I will use inside Tableau to create a line chart that will be color-coded by year and will use the region as a filter for the user. The query works, but I found there are months in regions that don't have any sales. These sections break up the line chart and I am not able to fill in the missing spaces (I am using a non-date dimension on the X-Axis - Number of months until the end of its fiscal year).
I am looking for some help to alter my query to create a row for every month and every region in my dataset so that my running total will have a value to display in the line chart. if there are no values in my table, then = 0 and update the running total for the region.
I have a dimDate table and also a Regions table I can use in the query.
My Query now, (Results sorted in Excel to view easier) Results Table Now
What I want to do; New rows highlighted in Yellow What I want to do
My Code using SQL Server:
SELECT b.gy,
b.sales_month,
b.region,
b.gs_year_total,
b.months_away,
Sum(b.gs_year_total)
OVER (
partition BY b.gy, b.region
ORDER BY b.months_away DESC) RT_by_Region_GY
FROM (SELECT a.gy,
a.region,
a.sales_month,
Sum(a.gy_total) Gs_Year_Total,
a.months_away
FROM (SELECT g.val_id,
g.[gs year] AS GY
,
g.sales_month
AS
Sales_Month,
g.gy_total,
Datediff(month, g.sales_month, dt.lastdayofyear) AS
months_away,
g.value_type,
val.region
FROM uv_sales g
JOIN dbo.dimdate AS dt
ON g.[gs year] = dt.gsyear
JOIN dimvalsummary val
ON g.val_id = val.val_id
WHERE g.[gs year] IN ( 2017, 2018, 2019, 2020, 2021 )
GROUP BY g.valuation_id,
g.[gs year],
val.region,
g.sales_month,
dt.lastdayofyear,
g.gy_total,
g.value_type) a
WHERE a.months_away >= 0
AND sales_month < Dateadd(month, -1, Getdate())
GROUP BY a.gy,
a.region,
a.sales_month,
a.months_away) b
It's tough to envision the best method to solve without data and the meaning of all those fields. Here's a rough sketch of how one might attempt to solve it. This is not complete or tested, sorry, but I'm not sure the meaning of all those fields and don't have data to test.
Create a table called all_months and insert all the months from oldest to whatever date in the future you need.
01/01/2017
02/01/2017
...
12/01/2049
May need one query per region and union them together. Select the year & month from that all_months table, and left join to your other table on month. Coalesce your dollar values.
select 'East' as region,
extract(year from m.month) as gy_year,
m.month as sales_month,
coalesce(g.gy_total, 0) as gy_total,
datediff(month, m.month, dt.lastdayofyear) as months_away
from all_months m
left join uv_sales g on g.sales_month = m.month
--and so on

SQL Retention Cohort Analysis

I am trying to write a query for monthly retention, to calculate percentage of users returning from their initial start month and moving forward.
TABLE: customer_order
fields
id
date
store_id
TABLE: customer
id
person_id
job_id
first_time (bool)
This gets me the initial monthly cohorts based on the first dates
SELECT first_job_month, COUNT( DISTINCT person_id) user_counts
FROM
( SELECT DATE_TRUNC(MIN(CAST(date AS DATE)), month) first_job_month, person_id
FROM customer_order cd
INNER JOIN consumer co ON co.job_id = cd.id
GROUP BY 2
ORDER BY 1 ) first_d GROUP BY 1 ORDER BY 1
first_job_month user_counts
2018-04-01 36
2018-05-01 37
2018-06-01 39
2018-07-01 45
2018-08-01 38
I have tried a bunch of things, but I can't figure out how to keep track of the original cohorts/users from the first month onwards
Get your the first order month for every customer
Join orders to the previous subquery to find out what is the difference in months between the given order and the first order
Use conditional aggregates to count customers that still order by X month
There are some alternative options like using window functions to do (1) and (2) in the same subquery but the easiest option is this one:
WITH
cohorts as (
SELECT person_id, DATE_TRUNC(MIN(CAST(date AS DATE)), month) as first_job_month
FROM customer_order cd
JOIN consumer co
ON co.job_id = cd.id
GROUP BY 1
)
,orders as (
SELECT
*
,round(1.0*(DATE_TRUNC(MIN(CAST(cd.date AS DATE))-c.first_job_month)/30) as months_since_first_order
FROM cohorts c
JOIN customer_order cd
USING (person_id)
)
SELECT
first_job_month as cohort
,count(distinct person_id) as size
,count(distinct case when months_since_first_order>=1 then person_id end) as m1
,count(distinct case when months_since_first_order>=2 then person_id end) as m2
,count(distinct case when months_since_first_order>=3 then person_id end) as m3
-- hardcode up to the number of months you want and the history you have
FROM orders
GROUP BY 1
ORDER BY 1
See, you can use CASE statements inside the aggregate functions like COUNT to identify different subsets of rows that you'd like to aggregate within the same group. This is one of the most important BI techniques in SQL.
Note, >= not = is used in the conditional aggregate so that for example if the customer buys in m3 after m1 and doesn't buy in m2 they will still be counted in m2. If you want your customers to buy every month and/or see the actual retention for every month and are ok if subsequent months values can be higher than previous you can use =.
Also, if you don't want the "triangle" view like one you get from this query or you don't want to hardcode the "mX" part you would just group by first_job_month and months_since_first_order and count distinct. Some visualization tools might consume this simple format and make a triangle view out of it.

Average Group size per month Over previous ten years

I need to find the average size (average number of employees) of all the groups (employers) that we do business with per month for the last ten years.
So I have no problem getting the average group size for each month. For the Current month I can use the following:
Select count(*)
from Employees EE
join Employers ER on EE.employerid = ER.employerid
group by ER.EmployerName
This will give me a list of how many employees are in each group. I can then copy and paste the column into excel get the average for the current month.
For the previous month, I want exclude any employees that were added after that month. I have a query for this too:
Select count(*)
from Employees EE
join Employers ER on EE.employerid = ER.employerid
where EE.dateadded <= DATEADD(month, -1,GETDATE())
group by ER.EmployerName
That will exclude all employees that were added this month. I can continue to this all the way back ten years, but I know there is a better way to do this. I have no problem running this query 120 times, copying and pasting the results into excel to compute the average. However, I'd rather learn a more efficient way to do this.
Another Question, I can't do the following, anyone know a way around it:
Select avg(count(*))
Thanks in advance guys!!
Edit: Employees that have been terminated can be found like this. NULL are employees that are currently employed.
Select count(*)
from Employees EE
join Employers ER on EE.employerid = ER.employerid
join Gen_Info gne on gne.id = EE.newuserid
where EE.dateadded <= DATEADD(month, -1,GETDATE())
and (gne.TerminationDate is NULL OR gen.TerminationDate < DATEADD(day, -14,GETDATE())
group by ER.EmployerName
Are you after a query that shows the count by year and month they were added? if so this seems pretty straight forward.
this is using mySQL date functions Year & month.
Select AVG(cnt) FROM (
Select count(*) cnt, Year(dateAdded), Month(dateAdded)
from System_Users su
join system_Employers se on se.employerid = su.employerid
group by Year(dateAdded), Month(dateAdded)) B
The inner query counts and breaks out the counts by year and month We then wrap that in a query to show the avg.
--2nd attempt but I'm Brain FriDay'd out.
This uses a Common table Expression (CTE) to generate a set of data for the count by Year, Month of the employees, and then averages out by month.
if this isn't what your after, sample data w/ expected results would help better frame the question and I can making assumptions about what you need/want.
With CTE AS (
Select Year(dateAdded) YR , Month(DateAdded) MO, count(*) over (partition by Year(dateAdded), Month(dateAdded) order by DateAdded Asc) as RunningTotal
from System_Users su
join system_Employers se on se.employerid = su.employerid
Order by YR ASC, MO ASC)
Select avg(RunningTotal), mo from cte;

How to join two queries with different GROUP BY levels, leaving some records null

In MS Access, using SQL, I've combined two queries with inner join that both require the user to input a Start Date and End Date range. The first query (query 1) lists the count of how many people have left the program, the situation with which they left, the month, and the year they left. It is grouping the Count function first on the year, then month, then the leaving situation (which can be any of 5 options), which means that there are multiple records for each month (but not necessarily the same number of records for each month). The second query (query 2) counts the number of people we've admitted to the program, the month, and the year. It is grouped first on the year, then month. So, with this query, there is only one record per month. My inner join combines the queries correctly, except that it repeats query 2's values multiple times, depending on how many records each of query 1's months have. Is there a way to have query 2's values only listed once per month, therefore leaving the rest of the records for that month null?
Here's query 1:
SELECT Count(clients.ssn) AS CountOfDepartures, clients.[leaving situation], a.monthname, a.year1, a.month1
FROM clients INNER JOIN (SELECT month(clients.[departure date]) AS Month1, year(clients.[departure date]) AS Year1, months.monthname, clients.ssn FROM clients
INNER JOIN months ON month(clients.[departure date])=months.monthnumber WHERE clients.[departure date] BETWEEN [Enter Start Date] AND [Enter End Date]) AS A
ON clients.ssn=a.ssn
GROUP BY a.year1, a.monthname, clients.[leaving situation], a.month1
ORDER BY a.year1 DESC , a.month1 DESC;
Here's query 2
SELECT Count(clients.ssn) AS CountofIntakes, b.monthname, b.year2, b.month2
FROM clients
INNER JOIN (SELECT month(clients.prog_start) AS Month2, year(clients.prog_start) AS Year2, months.monthname, clients.ssn
FROM clients INNER JOIN months ON month(clients.prog_start)=months.monthnumber WHERE clients.prog_start BETWEEN [Enter Start Date] AND [Enter End Date]) AS B
ON clients.ssn=b.ssn
GROUP BY b.monthname, b.year2, b.month2
ORDER BY b.year2 DESC , b.month2 DESC;
Here's how I combined them, but it gives me the repeating values:
SELECT countofdeparturesbyleavingsituationmonth.countofdepartures, countofdeparturesbyleavingsituationmonth.[leaving situation], countofdeparturesbyleavingsituationmonth.monthname, countofdeparturesbyleavingsituationmonth.year1, countofdeparturesbyleavingsituationmonth.month1, countofintakesbymonth.countofintakes
FROM countofdeparturesbyleavingsituationmonth
INNER JOIN countofintakesbymonth ON (countofdeparturesbyleavingsituationmonth.monthname=countofintakesbymonth.monthname) AND (countofdeparturesbyleavingsituationmonth.year1=countofintakesbymonth.year2) AND (countofdeparturesbyleavingsituationmonth.monthname=countofintakesbymonth.monthname)
ORDER BY year1 DESC , month1 DESC;
The CLIENTS table has a record for each client with a bunch of columns for different clinical data (I work for a non-profit drug/alcohol rehabilitation center). The MONTHS table just have the twelve months written out in one column with a corresponding number in the second column. I use the inner join with the MONTHS table in order to list the monthname rather than the number (though I just realized that I could probably just do this in my report with the MonthName function....). Any advice is very appreciated!
Also, I'm asking this question because when I make a report on this query the SUM value for the total report (and per calendar year) for the intakes is incorrect because, in the query, each month has multiple intake values. So the SUM is much larger than it should be. Beyond that issue, this query generates the correct report.
You could use DISTINCT clause:
SELECT DISTINCT
queryA.people,
Null as situation,
queryA.[month],
queryA.[year]
FROM
queryA INNER JOIN queryB
ON (queryA.people=queryB.people)
AND (queryA.[month]=queryB.[month])
AND (queryA.[year]=queryB.[year])