SQL Server Active Record counts by month - sql

I've created a database storing Incident tickets.
I have created a fact and a number of dimension tables.
Here is some sample data
+---------------------+--------------+--------------+-------------+------------+
| LastModifiedDateKey | TicketNumber | Status | factCurrent | Date |
+---------------------+--------------+--------------+-------------+------------+
| 2774 | T:9992260 | Open | 1 | 4/12/2017 |
| 2777 | T:9992805 | Open | 1 | 7/12/2017 |
| 2777 | T:9993068 | Open | 1 | 7/12/2017 |
| 2777 | T:9993098 | Open | 0 | 7/12/2017 |
| 2793 | T:9993098 | Acknowledged | 0 | 23/12/2017 |
| 2928 | T:9993098 | Closed | 1 | 5/01/2018 |
| 2777 | T:9993799 | Open | 0 | 7/12/2017 |
| 2928 | T:9993799 | Closed | 1 | 5/01/2018 |
| 2778 | T:9994729 | Open | 1 | 8/12/2017 |
| 2774 | T:9994791 | Open | 0 | 4/12/2017 |
| 2928 | T:9994791 | Closed | 1 | 5/01/2018 |
| 2777 | T:9994912 | Open | 1 | 7/12/2017 |
| 2778 | T:9995201 | Open | 0 | 8/12/2017 |
| 2793 | T:9995201 | Closed | 1 | 23/12/2017 |
| 2931 | T:9718629 | Open | 1 | 8/01/2018 |
| 2933 | T:9718629 | Closed | 1 | 10/01/2018 |
| 2932 | T:9855664 | Open | 1 | 9/01/2018 |
| 2931 | T:9891975 | Open | 1 | 8/01/2018 |
+---------------------+--------------+--------------+-------------+------------+
I want a query that will give me the total of tickets open at the end of each month.
In the data January should have 8 and Feb 2.
Note: that a ticket can have multiple rows with same status because a dimension key has changed or multiple rows with different status all in the same month. e.g. T:9993098.

This approach first uses ROW_NUMBER to identify the most recent record for each ticket, for each month/year. It is assumed that the most recent record in a month will contain the status in which a ticket ended for that month. Then, it aggregates over this modified table, counting only tickets which ended the month in an open status.
SELECT
YEAR(Date) + "-" + MONTH(Date) AS date,
COUNT(*) AS num_open_tickets
FROM
(
SELECT *,
ROW_NUMBER() OVER (PARITION BY YEAR(Date), MONTH(Date), TicketNumber
ORDER BY BY Date DESC) rn
FROM yourTable
) t
WHERE t.rn = 1 AND t.Status = 'Open'
GROUP BY
YEAR(Date) + "-" + MONTH(Date);

First, I would generate the months. Then do a cumulative count of the opens minus the closes. Alas, that is a bit tricky because of the repeated rows for a ticket and because you are using an old version of SQL Server.
But . . . you can do this:
with months as (
select dateadd(day, 1 - day(min(date)), min(date)) as mon_start,
max(date) as max_date
from sample
union all
select dateadd(month, 1, mon_start), max_date
from months
where dateadd(month, 1, mon_start) < max_date
)
select m.mon_end,
(select count(distinct case when status = 'Open' then ticket end) -
count(distinct case when status = 'Closed' then ticket end)
from sample s
where s.date <= m.mon_end
) as open_tickets
from (select dateadd(day, -1, mon_start) as mon_end
from months
) m;
This uses a recursive CTE to generate the months. It is easier to generate the first day of the months and then subtract one day afterwards (what is the date when you add 1 month to the last day of February?)
The rest uses a correlated subquery to count the number of open tickets on that date.

Related

How to add records for each user based on another existing row in BigQuery?

Posting here in case someone with more knowledge than may be able to help me with some direction.
I have a table like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201125 | 1 | 0 |
-----------------------------------
| 4 | 20201114 | 2 | 32 |
-----------------------------------
| 5 | 20201116 | 2 | 0 |
-----------------------------------
| 6 | 20201120 | 2 | 23 |
-----------------------------------
However, from this, I need to have a record for each user for each day where if a day is missing for a user, then the last score recorded should be maintained then I would have something like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201122 | 1 | 14 |
-----------------------------------
| 4 | 20201123 | 1 | 14 |
-----------------------------------
| 5 | 20201124 | 1 | 14 |
-----------------------------------
| 6 | 20201125 | 1 | 0 |
-----------------------------------
| 7 | 20201114 | 2 | 32 |
-----------------------------------
| 8 | 20201115 | 2 | 32 |
-----------------------------------
| 9 | 20201116 | 2 | 0 |
-----------------------------------
| 10 | 20201117 | 2 | 0 |
-----------------------------------
| 11 | 20201118 | 2 | 0 |
-----------------------------------
| 12 | 20201119 | 2 | 0 |
-----------------------------------
| 13 | 20201120 | 2 | 23 |
-----------------------------------
I'm trying to to this in BigQuery using StandardSQL. I have an idea of how to keep the same score across following empty dates, but I really don't know how to add new rows for missing dates for each user. Also, just to keep in mind, this example only has 2 users, but in my data I have more than 1500.
My end goal would be to show something like the average of the score per day. For background, because of our logic, if the score wasn't recorded in a specific day, this means that the user is still in the last score recorded which is why I need a score for every user every day.
I'd really appreciate any help I could get! I've been trying different options without success
Below is for BigQuery Standard SQL
#standardSQL
select date, user_id,
last_value(score ignore nulls) over(partition by user_id order by date) as score
from (
select user_id, format_date('%Y%m%d', day) date,
from (
select user_id, min(parse_date('%Y%m%d', date)) min_date, max(parse_date('%Y%m%d', date)) max_date
from `project.dataset.table`
group by user_id
) a, unnest(generate_date_array(min_date, max_date)) day
)
left join `project.dataset.table` b
using(date, user_id)
-- order by user_id, date
if applied to sample data from your question - output is
One option uses generate_date_array() to create the series of dates of each user, then brings the table with a left join.
select d.date, d.user_id,
last_value(t.score ignore nulls) over(partition by d.user_id order by d.date) as score
from (
select t.user_id, d.date
from mytable t
cross join unnest(generate_date_array(min(date), max(date), interval 1 day)) d(date)
group by t.user_id
) d
left join mytable t on t.user_id = d.user_id and t.date = d.date
I think the most efficient method is to use generate_date_array() but in a very particular way:
with t as (
select t.*,
date_add(lead(date) over (partition by user_id order by date), interval -1 day) as next_date
from t
)
select row_number() over (order by t.user_id, dte) as id,
t.user_id, dte, t.score
from t cross join join
unnest(generate_date_array(date,
coalesce(next_date, date)
interval 1 day
)
) dte;

30 day rolling count of distinct IDs

So after looking at what seems to be a common question being asked and not being able to get any solution to work for me, I decided I should ask for myself.
I have a data set with two columns: session_start_time, uid
I am trying to generate a rolling 30 day tally of unique sessions
It is simple enough to query for the number of unique uids per day:
SELECT
COUNT(DISTINCT(uid))
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '30 days'
it is also relatively simple to calculate the daily unique uids over a date range.
SELECT
DATE_TRUNC('day',session_start_time) AS "date"
,COUNT(DISTINCT uid) AS "count"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY date(session_start_time)
I then I tried several ways to do a rolling 30 day unique count over a time interval
SELECT
DATE(session_start_time) AS "running30day"
,COUNT(distinct(
case when date(session_start_time) >= running30day - interval '30 days'
AND date(session_start_time) <= running30day
then uid
end)
) AS "unique_30day"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '3 months'
GROUP BY date(session_start_time)
Order BY running30day desc
I really thought this would work but when looking into the results, it appears I'm getting the same results as I was when doing the daily unique rather than the unique over 30days.
I am writing this query from Metabase using the SQL query editor. the underlying tables are in redshift.
If you read this far, thank you, your time has value and I appreciate the fact that you have spent some of it to read my question.
EDIT:
As rightfully requested, I added an example of the data set I'm working with and the desired outcome.
+-----+-------------------------------+
| UID | SESSION_START_TIME |
+-----+-------------------------------+
| | |
| 10 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 5 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 3 | 2020-01-18T02:49:23.000-05:00 |
| | |
| 9 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 2 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 8 | 2020-03-31T23:13:33.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 2 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 9 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 8 | 2020-09-15T16:40:29.000-04:00 |
| | |
| 3 | 2020-09-21T20:49:09.000-04:00 |
| | |
| 1 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 6 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 5 | 2020-12-12T04:42:00.000-05:00 |
+-----+-------------------------------+
bellow is what the result I would like looks like:
+------------+---------------------+
| DATE | UNIQUE 30 DAY COUNT |
+------------+---------------------+
| | |
| 2020-01-13 | 3 |
| | |
| 2020-01-18 | 1 |
| | |
| 2020-03-06 | 3 |
| | |
| 2020-03-31 | 1 |
| | |
| 2020-08-28 | 4 |
| | |
| 2020-09-15 | 2 |
| | |
| 2020-09-21 | 1 |
| | |
| 2020-11-05 | 2 |
| | |
| 2020-12-12 | 2 |
+------------+---------------------+
Thank you
You can approach this by keeping a counter of when users are counted and then uncounted -- 30 (or perhaps 31) days later. Then, determine the "islands" of being counted, and aggregate. This involves:
Unpivoting the data to have an "enters count" and "leaves" count for each session.
Accumulate the count so on each day for each user you know whether they are counted or not.
This defines "islands" of counting. Determine where the islands start and stop -- getting rid of all the detritus in-between.
Now you can simply do a cumulative sum on each date to determine the 30 day session.
In SQL, this looks like:
with t as (
select uid, date_trunc('day', session_start_time) as s_day, 1 as inc
from users_sessions
union all
select uid, date_trunc('day', session_start_time) + interval '31 day' as s_day, -1
from users_sessions
),
tt as ( -- increment the ins and outs to determine whether a uid is in or out on a given day
select uid, s_day, sum(inc) as day_inc,
sum(sum(inc)) over (partition by uid order by s_day rows between unbounded preceding and current row) as running_inc
from t
group by uid, s_day
),
ttt as ( -- find the beginning and end of the islands
select tt.uid, tt.s_day,
(case when running_inc > 0 then 1 else -1 end) as in_island
from (select tt.*,
lag(running_inc) over (partition by uid order by s_day) as prev_running_inc,
lead(running_inc) over (partition by uid order by s_day) as next_running_inc
from tt
) tt
where running_inc > 0 and (prev_running_inc = 0 or prev_running_inc is null) or
running_inc = 0 and (next_running_inc > 0 or next_running_inc is null)
)
select s_day,
sum(sum(in_island)) over (order by s_day rows between unbounded preceding and current row) as active_30
from ttt
group by s_day;
Here is a db<>fiddle.
I'm pretty sure the easier way to do this is to use a join. This creates a list of all the distinct users who had a session on each day and a list of all distinct dates in the data. Then it one-to-many joins the user list to the date list and counts the distinct users, the key here is the expanded join criteria that matches a range of dates to a single date via a system of inequalities.
with users as
(select
distinct uid,
date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01'),
dates as
(select
distinct date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01')
select
count(distinct uid),
dates.dt
from users
join
dates
on users.dt >= dates.dt - 29
and users.dt <= dates.dt
group by dates.dt
order by dt desc
;

SQL Query to Find Min and Max Values between Values, dates and companies in the same Query

This is to find the historic max and min price of a stock in the same query for every past 10 days from the current date. below is the data. I've tried the query but getting the same high and low for all the rows. The high and low needs to be calculated per stock for a period of 10 days.
RDBMS -- SQL Server 2014
Note: also duration might be past 30 to 2months if required ie... 30 days. or 60 days.
for example, the output needs to be like ABB,16-12-2019,1480 (MaxClose),1222 (MinClose) (test data) in last 10 days.
+------+------------+-------------+
| Name | Date | Close |
+------+------------+-------------+
| ABB | 26-12-2019 | 1272.15 |
| ABB | 24-12-2019 | 1260.15 |
| ABB | 23-12-2019 | 1261.3 |
| ABB | 20-12-2019 | 1262 |
| ABB | 19-12-2019 | 1476 |
| ABB | 18-12-2019 | 1451.45 |
| ABB | 17-12-2019 | 1474.4 |
| ABB | 16-12-2019 | 1480.4 |
| ABB | 13-12-2019 | 1487.25 |
| ABB | 12-12-2019 | 1484.5 |
| INFY | 26-12-2019 | 73041.66667 |
| INFY | 24-12-2019 | 73038.33333 |
| INFY | 23-12-2019 | 73036.66667 |
| INFY | 20-12-2019 | 73031.66667 |
| INFY | 19-12-2019 | 73030 |
| INFY | 18-12-2019 | 73028.33333 |
| INFY | 17-12-2019 | 73026.66667 |
| INFY | 16-12-2019 | 73025 |
| INFY | 13-12-2019 | 73020 |
| INFY | 12-12-2019 | 73018.33333 |
+------+------------+-------------+
The query I tried but no luck
select max([close]) over (PARTITION BY name) AS MaxClose,
min([close]) over (PARTITION BY name) AS MinClose,
[Date],
name
from historic
where [DATE] between [DATE] -30 and [DATE]
and name='ABB'
group by [Date],
[NAME],
[close]
order by [DATE] desc
If you just want the highest and lowest close per name, then simple aggregation is enough:
select name, max(close) max_close, min(close) min_close
from historic
where close >= dateadd(day, -10, getdate())
group by name
order by name
If you want the entire corresponding records, then rank() is a solution:
select name, date, close
from (
select
h.*,
rank() over(partition by name order by close) rn1,
rank() over(partition by name order by close desc) rn2
from historic h
where close >= dateadd(day, -10, getdate())
) t
where rn1 = 1 or rn2 = 1
order by name, date
Top and bottom ties will show up if any.
You can add a where condition to filter on a given name.
If you are looking for a running min/max
Example
Select *
,MinClose = min([Close]) over (partition by name order by date rows between 10 preceding and current row)
,MaxClose = max([Close]) over (partition by name order by date rows between 10 preceding and current row)
From YourTable
Returns

Duplicate records upon joining table

I am still very new to SQL and Tableau however I am trying to work myself towards achieving a personal project of mine.
Table A; shows a table which contains the defect quantity per product category and when it was raised
+--------+-------------+--------------+-----------------+
| Issue# | Date_Raised | Category_ID# | Defect_Quantity |
+--------+-------------+--------------+-----------------+
| PCR12 | 11-Jan-2019 | Product#1 | 14 |
| PCR13 | 12-Jan-2019 | Product#1 | 54 |
| PCR14 | 5-Feb-2019 | Product#1 | 5 |
| PCR15 | 5-Feb-2019 | Product#2 | 7 |
| PCR16 | 20-Mar-2019 | Product#1 | 76 |
| PCR17 | 22-Mar-2019 | Product#2 | 5 |
| PCR18 | 25-Mar-2019 | Product#1 | 89 |
+--------+-------------+--------------+-----------------+
Table B; shows the consumption quantity of each product by month
+-------------+--------------+-------------------+
| Date_Raised | Category_ID# | Consumed_Quantity |
+-------------+--------------+-------------------+
| 5-Jan-2019 | Product#1 | 100 |
| 17-Jan-2019 | Product#1 | 200 |
| 5-Feb-2019 | Product#1 | 100 |
| 8-Feb-2019 | Product#2 | 50 |
| 10-Mar-2019 | Product#1 | 100 |
| 12-Mar-2019 | Product#2 | 50 |
+-------------+--------------+-------------------+
END RESULT
I would like to create a table/bar chart in tableau that shows that Defect_Quantity/Consumed_Quantity per month, per Category_ID#, so something like this below;
+----------+-----------+-----------+
| Month | Product#1 | Product#2 |
+----------+-----------+-----------+
| Jan-2019 | 23% | |
| Feb-2019 | 5% | 14% |
| Mar-2019 | 89% | 10% |
+----------+-----------+-----------+
WHAT I HAVE TRIED SO FAR
Unfortunately i have not really done anything, i am struggling to understand how do i get rid of the duplicates upon joining the tables based on Category_ID#.
Appreciate all the help I can receive here.
I can think of doing left joins on both product1 and 2.
select to_char(to_date(Date_Raised,'d-mon-yyyy'),'mon-yyyy')
, (p2.product1 - sum(case when category_id='Product#1' then Defect_Quantity else 0 end))/p2.product1 * 100
, (p2.product2 - sum(case when category_id='Product#2' then Defect_Quantity else 0 end))/p2.product2 * 100
from tableA t1
left join
(select to_char(to_date(Date_Raised,'d-mon-yyyy'),'mon-yyyy') Date_Raised
, sum(Comsumed_Quantity) as product1 tableB
where category_id = 'Product#1'
group by to_char(to_date(Date_Raised,'d-mon-yyyy'),'mon-yyyy')) p1
on p1.Date_Raised = t1.Date_Raised
left join
(select to_char(to_date(Date_Raised,'d-mon-yyyy'),'mon-yyyy') Date_Raised
, sum(Comsumed_Quantity) as product2 tableB
where category_id = 'Product#2'
group by to_char(to_date(Date_Raised,'d-mon-yyyy'),'mon-yyyy')) p2
on p2.Date_Raised = t1.Date_Raised
group by to_char(to_date(Date_Raised,'d-mon-yyyy'),'mon-yyyy')
By using ROW_NUMBER() OVER (PARTITION BY ORDER BY ) as RN, you can remove duplicate rows. As of your end result you should extract month from date and use pivot to achieve.
I would do this as:
select to_char(date_raised, 'YYYY-MM'),
(sum(case when product = 'Product#1' then defect_quantity end) /
sum(case when product = 'Product#1' then consumed_quantity end)
) as product1,
(sum(case when product = 'Product#2' then defect_quantity end) /
sum(case when product = 'Product#2' then consumed_quantity end)
) as product2
from ((select date_raised, product, defect_quantity, 0 as consumed_quantity
from a
) union all
(select date_raised, product, 0 as defect_quantity, consumed_quantity
from b
)
) ab
group by to_char(date_raised, 'YYYY-MM')
order by min(date_raised);
(I changed the date format because I much prefer YYYY-MM, but that is irrelevant to the logic.)
Why do I prefer this method? This will include all months where there is a row in either table. I don't have to worry that some months are inadvertently filtered out, because there are missing production or defects in one month.

T-SQL - Turn table with current page and previous pages into a sequential order per session

I'm trying to create a table to show the activy per session on a website.
Should look like something like that
Prefered table:
+------------+---------+--------------+-----------+
| SessionID | PageSeq| Page | Duration |
+------------+---------+--------------+-----------+
| 1 | 1 | Home | 5 |
| 1 | 2 | Sales | 10 |
| 1 | 3 | Contact | 9 |
| 2 | 1 | Sales | 5 |
| 3 | 1 | Home | 30 |
| 3 | 2 | Sales | 5 |
+------------+---------+--------------+-----------+
Unfortunetly my current dataset doesn't have information about the session_id, but can be deducted based on the time and the path.
Current table:
+------------------+---------+------------+---------------+----------+
| DATE_HOUR_MINUTE | Page | Prev_page | Total_session | Duration |
+------------------+---------+------------+---------------+----------+
| 201801012020 | Home | (entrance) | 24 | 5 |
| 201801012020 | Sales | Home | 24 | 10 |
| 201801012020 | Contact | Sales | 24 | 9 |
| 201801012020 | Sales | (entrance) | 5 | 5 |
| 201801012020 | Home | (entrance) | 35 | 30 |
| 201801012020 | Sales | Home | 35 | 5 |
+------------------+---------+------------+---------------+----------+
What is the best way to turn the current table into the prefered table format?
I've tried searching for nested tables, looped tables, haven't found a something related to this problem yet.
So if you can risk sessions starting at the same time with the same duration, should be easy enough to do using a recursive query.
;WITH sessionTree AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) as sessionId
, 1 AS PageSeq
, *
FROM Session
WHERE PrevPage = '(entrance)'
UNION ALL
SELECT prev.sessionId
, prev.PageSeq + 1
, next.*
FROM sessionTree prev
JOIN Session next
ON next.TotalDuration = prev.TotalDuration
AND next.PrevPage = prev.Page
AND next.date_hour_minute >= prev.date_hour_minute
)
SELECT * FROM sessionTree
ORDER BY sessionId, PageSeq
sessionId is generated for each entry with (entrance) as prevPage, with PageSeq = 1. Then in the recursive part visits with the timestamp later than the previous page and with the same duration are joined on prev.page = next.PrevPage condition.
Here's a working example on dbfiddle