find every n-th date in a continuous date stream - sql

i would like to find/mark every 4th day in a continuous date stream inserted into my table for each user in a given date range
CREATE TABLE mytable (
id INTEGER,
myuser INTEGER,
day DATE NOT NULL,
PRIMARY KEY (id)
);
the problem is, that only 3 continuous days are valid per user, after that, there has to be a one day "break"
id | myuser | day |
-----+--------+------------+
0 | 200 | 2012-01-12 | }
1 | 200 | 2012-01-13 | }--> 3 continuous days
2 | 200 | 2012-01-14 | }
3 | 200 | 2012-01-15 | <-- not ok, user 200 should get warned and delete this
4 | 200 | 2012-01-16 | }
5 | 200 | 2012-01-17 | }--> 3 continuous days
6 | 200 | 2012-01-18 | }
7 | 200 | 2012-01-19 | <-- not ok, user 200 should get warned and delete this
8 | 201 | 2012-01-12 | }
9 | 201 | 2012-01-13 | }--> 3 continuous days
10 | 201 | 2012-01-14 | }
11 | 201 | 2012-01-16 | <-- ok, there is a one day gap here
12 | 201 | 2012-01-17 |
the main goal is to look at a given date range (usually a month) and identify days, which are not allowed. Also i have to take care that the overlapping dates are handled correctly, for example, if i look on a date range from 2012-02-01 to 2012-02-29, 2012-02-01 could be a "break" day if 2012-01-29 to 2012-01-31 is present in that table for the same user.

I don't have access to PostgreSQL, but hopefully this works...
WITH
grouped_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY myuser ORDER BY day) - (day - start_date) AS user_group_id,
myuser,
day
FROM
myTable
WHERE
day >= start_date - 3
AND day <= end_date
)
,
sequenced_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY myuser, user_group_id ORDER BY day) AS sequence_id,
myuser,
day
FROM
grouped_data
)
SELECT
myuser,
day,
CASE WHEN sequence_id % 4 = 0 THEN 1 ELSE 0 END as should_be_a_break_day
FROM
sequenced_data
WHERE
day >= start_date
Sorry I didn't explain the workings, I had to jump into a meeting :)
Example with start_date = '2012-01-14'...
id | myuser | day | ROW_NUMBER() | day - start_date | user_group_id
----+--------+------------+--------------+------------------+---------------
0 | 200 | 2012-01-12 | 1 | -2 | 1 - -2 = 3
1 | 200 | 2012-01-13 | 2 | -1 | 2 - -1 = 3
2 | 200 | 2012-01-14 | 3 | 0 | 3 - 0 = 3
3 | 200 | 2012-01-15 | 4 | 1 | 4 - 1 = 3
4 | 200 | 2012-01-16 | 5 | 2 | 5 - 2 = 3
----+--------+------------+--------------+------------------+---------------
5 | 201 | 2012-01-12 | 1 | -2 | 1 - -2 = 3
6 | 201 | 2012-01-13 | 2 | -1 | 2 - -1 = 3
7 | 201 | 2012-01-14 | 3 | 0 | 3 - -1 = 3
8 | 201 | 2012-01-16 | 4 | 2 | 4 - 2 = 2
Any sequential dates will have the same user_group_id. Each 'gap' in the days makes that user_group_id decrease by 1 (see row 8, if the record was for the 17th, a 2 day gap, the id would have been 1).
Once you have a group_id, row_number() can be easily used to say which day in the sequence it is. A max of 3 day is the same as "Every 4th day should be a gap", and "x % 4 = 0" identifies every 4th day.

Much simpler and faster with the window function lag():
SELECT myuser
,day
,COALESCE(lag(day, 3) OVER (PARTITION BY myuser ORDER BY day) = (day - 3)
,FALSE) AS break_overdue
FROM mytable
WHERE day BETWEEN ('2012-01-12'::date - 3) AND '2012-01-16'::date;
Result:
myuser | day | break_overdue
--------+------------+---------------
200 | 2012-01-12 | f
200 | 2012-01-13 | f
200 | 2012-01-14 | f
200 | 2012-01-15 | t
200 | 2012-01-16 | t
201 | 2012-01-12 | f
201 | 2012-01-13 | f
201 | 2012-01-14 | f
201 | 2012-01-16 | f
Major points:
The query marks all days as break_overdue after three consecutive days. It is unclear whether you want all of them marked after the rule has been broken or just every 4th day.
I include 3 days before the start date (not just two) to determine whether the first day is already in violation of the rule.
The test is simple: if the 3rd row before the current row within the partition equals the current day - 3 then the rule has been broken. I wrap it all in COALESCE to fold NULL values to FALSE for cosmetic reasons only. Guaranteed to work as long as (myuser, day) is unique.
In PostgreSQL you can subtract integers form a date, effectively subtracting days.
Can be done in a single query level, no CTE or subquery needed. Should be much faster.
You need PostgreSQL 8.4 or later for window functions.

Related

Calculating user retention on daily basis between the dates in SQL

I have a table that has the data about user_ids, all their last log_in dates to the app
Table:
|----------|--------------|
| User_Id | log_in_dates |
|----------|--------------|
| 1 | 2021-09-01 |
| 1 | 2021-09-03 |
| 2 | 2021-09-02 |
| 2 | 2021-09-04 |
| 3 | 2021-09-01 |
| 3 | 2021-09-02 |
| 3 | 2021-09-03 |
| 3 | 2021-09-04 |
| 4 | 2021-09-03 |
| 4 | 2021-09-04 |
| 5 | 2021-09-01 |
| 6 | 2021-09-01 |
| 6 | 2021-09-09 |
|----------|--------------|
From the above table, I'm trying to understand the user's log in behavior from the present day to the past 90 days.
Num_users_no_log_in defines the count for the number of users who haven't logged in to the app from present_day to the previous days (last_log_in_date)
I want the table like below:
|---------------|------------------|--------------------|-------------------------|
| present_date | days_difference | last_log_in_date | Num_users_no_log_in |
|---------------|------------------|--------------------|-------------------------|
| 2021-09-01 | 0 | 2021-09-01 | 0 |
| 2021-09-02 | 1 | 2021-09-01 | 3 |->(Id = 1,5,6)
| 2021-09-02 | 0 | 2021-09-02 | 3 |->(Id = 1,5,6)
| 2021-09-03 | 2 | 2021-09-01 | 2 |->(Id = 5,6)
| 2021-09-03 | 1 | 2021-09-02 | 1 |->(Id = 2)
| 2021-09-03 | 0 | 2021-09-03 | 3 |->(Id = 2,5,6)
| 2021-09-04 | 3 | 2021-09-01 | 2 |->(Id = 5,6)
| 2021-09-04 | 2 | 2021-09-02 | 0 |
| 2021-09-04 | 1 | 2021-09-03 | 1 |->(Id= 1)
| 2021-09-04 | 0 | 2021-09-04 | 3 |->(Id = 1,5,6)
| .... | .... | .... | ....
|---------------|------------------|--------------------|-------------------------|
I was able to get the first three columns Present_date | days_difference | last_log_in_date using the following query:
with dts as
(
select distinct log_in from users_table
)
select x.log_in_dates as present_date,
DATEDIFF(DAY, y.log_in_dates ,x.log_in_dates ) as Days_since_last_log_in,
y.log_in_dates as log_in_dates
from dts x, dts y
where x.log_in_dates >= y.log_in_dates
I don't understand how I can get the fourth column Num_users_no_log_in
I do not really understand your need: are there values base on users or dates? It it's based on dates, as it looks like (elsewhere you would probably have user_id as first column), what does it mean to have multiple times the same date? I understand that you would like to have a recap for all dates since the beginning until the current date, but in my opinion in does not really make sens (imagine your dashboard in 1 year!!)
Once this is said, let's go to the approach.
In such cases, I develop step by step using common table extensions. For you example, it required 3 steps:
prepare the time series
integrate connections' dates and perform the first calculation (time difference)
Finally, calculate nb connection per day
Then, the final query will display the desired result.
Here is the query I proposed, developed with Postgresql (you did not precise your dbms, but converting should not be such a big deal here):
with init_calendar as (
-- Prepare date series and count total users
select generate_series(min(log_in_dates), now(), interval '1 day') as present_date,
count(distinct user_id) as nb_users
from users
),
calendar as (
-- Add connections' dates for each period from the beginning to current date in calendar
-- and calculate nb days difference for each of them
-- Syntax my vary depending dbms used
select distinct present_date, log_in_dates as last_date,
extract(day from present_date - log_in_dates) as days_difference,
nb_users
from init_calendar
join users on log_in_dates <= present_date
),
usr_con as (
-- Identify last user connection's dates according to running date
-- Tag the line to be counted as no connection
select c.present_date, c.last_date, c.days_difference, c.nb_users,
u.user_id, max(log_in_dates) as last_con,
case when max(log_in_dates) = present_date then 0 else 1 end as to_count
from calendar c
join users u on u.log_in_dates <= c.last_date
group by c.present_date, c.last_date, c.days_difference, c.nb_users, u.user_id
)
select present_date, last_date, days_difference,
nb_users - sum(to_count) as Num_users_no_log_in
from usr_con
group by present_date, last_date, days_difference, nb_users
order by present_date, last_date
Please note that there is a difference with your own expected result as you forgot user_id = 3 in your calculation.
If you want to play with the query, you can with dbfiddle

How do I use a historic value as at a particular month when there are no values for the given month?

I have 2 SQL Server tables.
PurchaseOrderReceivingLine (PORL) is a table that contains every receipt from a purchase order. This has hundreds of entries per month.
PartyRelationshipScore (PRS) is a table with a party (supplier) reference number (that is used to join to the PORL table) and a score out of 10 for relationship and price. It also has a date field for when the score is updated so we have a history of the updates.
What I want to achieve is a supplier summary for each month. So I would have Supplier #, TotalValue, LateParts etc. I'm fine with creating the code for that. What I'm struggling with is getting the score for the given month if there are no values for that month.
So, for example I might have a value of 5 on the 1st August. Then it doesn't change until the 1st October when it is increased to 6.
On the grouping, September will have a TotalValue & a LateParts value but because there are no records in September in the PRS table, it will return a NULL value. I need it to get the last value recorded and return that (in this case August's 5). So it will return;
Aug 2019 - 5
Sep 2019 - 5
Oct 2019 - 6
Thanks in advance.
PORL Table
+-------+----------------+-------+-------+
| PORL# | Date (UK) | Value | Party |
+-------+----------------+-------+-------+
| 1 | 1/8/2019 | 100 | 6 |
| 2 | 1/8/2019 | 250 | 6 |
| 3 | 1/9/2019 | 1000 | 6 |
| 4 | 1/10/2019 | 2000 | 6 |
+-------+----------------+-------+-------+
PRS Table
+-------------+------------+-------------------+------------+
| DateChanged (UK) | Party | RelationShipScore | PriceScore |
+-------------+------------+-------------------+------------+
| 1/8/2019 | 6 | 5 | 5 |
| 1/10/2019 | 6 | 6 | 7 |
+------------------+-------+-------------------+------------+
Preferred outcome
+----------+-------+------+------------+-------------------+------------+
| Supplier | Month | Year | TotalValue | RelationshipScore | PriceScore |
+----------+-------+------+------------+-------------------+------------+
| 6 | 8 | 2019 | 350 | 5 | 5 |
| 6 | 9 | 2019 | 1000 | 5 | 5 |
| 6 | 10 | 2019 | 2000 | 6 | 7 |
+----------+-------+------+------------+-------------------+------------+
The relationshipscore & pricescore for month 9 are based on it not changing from month 8.
I think this helps
select Supplier = T.Party
, Month = DATEPART(MONTH,T.[Date])
, Year = DATEPART(YEAR,T.[Date])
, T.TotalValue
, R.RelationShipScore
, R.PriceScore
from ( Select P.[Party],P.[Date],[TotalValue] = sum(P.[Value])
from PurchaseOrderReceivingLine P
group by P.[Party],P.[Date] ) T
outer apply ( select top 1 RelationShipScore , PriceScore
from PartyRelationshipScore
where Party = T.Party
and DateChanged <= T.[Date]
Order by DateChanged desc ) R

Weekly Average Reports: Redshift

My Sales data for first two weeks of june, Monday Date i.e 1st Jun , 8th Jun are below
date | count
2015-06-01 03:25:53 | 1
2015-06-01 03:28:51 | 1
2015-06-01 03:49:16 | 1
2015-06-01 04:54:14 | 1
2015-06-01 08:46:15 | 1
2015-06-01 13:14:09 | 1
2015-06-01 16:20:13 | 5
2015-06-01 16:22:13 | 1
2015-06-01 16:27:07 | 1
2015-06-01 16:29:57 | 1
2015-06-01 19:16:45 | 1
2015-06-08 10:54:46 | 1
2015-06-08 15:12:10 | 1
2015-06-08 20:35:40 | 1
I need a find weekly avg of sales happened in a given range .
Complex Query:
(some_manipulation_part), ifact as
( select date, sales_count from final_result_set
) select date_part('h',date )) as h ,
date_part('dow',date )) as day_of_week ,
count(sales_count)
from final_result_set
group by h, dow.
Output :
h | day_of_week | count
3 | 1 | 3
4 | 1 | 1
8 | 1 | 1
10 | 1 | 1
13 | 1 | 1
15 | 1 | 1
16 | 1 | 8
19 | 1 | 1
20 | 1 | 1
If I try to apply avg on the above final result, It is not actually fetching correct answer!
(some_manipulation_part), ifact as
( select date, sales_count from final_result_set
) select date_part('h',date )) as h ,
date_part('dow',date )) as day_of_week ,
avg(sales_count)
from final_result_set
group by h, dow.
h | day_of_week | count
3 | 1 | 1
4 | 1 | 1
8 | 1 | 1
10 | 1 | 1
13 | 1 | 1
15 | 1 | 1
16 | 1 | 1
19 | 1 | 1
20 | 1 | 1
So I 've two mondays in the given range, it is not actually dividing by it. I am not even sure what is happening inside redshift.
To get "weekly averages" use date_trunc():
SELECT date_trunc('week', my_date_column) as week
, avg(sales_count) AS avg_sales
FROM final_result_set
GROUP BY 1;
I hope you are not actually using date as name for your date column. It's a reserved word in SQL and a basic type name, don't use it as identifier.
If you group by the day of week (DOW) you get averages per weekday. and sunday is 0. (Use ISODOW to get 7 for Sunday.)

Distinct lists on dates where an ID is present (i.e. intersects) on consecutive dates

I'm trying to make an MSSQL query that produces lists of apartment prices. The ultimate goal of the query is to calculate the percentage change in average prices of apartments. However, this final calculation (namely taking averages) is something I can fix in code provided that the list(s) of prices that are retrieved are correct.
What makes this tricky is that apartments are sold and new ones added all the time, so when comparing prices from week to week (I have weekly data), I only want to compare prices for apartments that have a recorded price in weeks (t-1, t), (t, t+1), (t+1,t+2) etc. In other words, some apartments that had a recorded price in time (t-1) might not be there at time t, and some apartments may have been added at time t (and thus weren't there at time t-1). I only want to select prices in week t-1 and t where some ApartmentID exists in both week t-1 and t to calculate the average change in week t.
Example data
-------------------------------------------------------------
| RegistrationID | Date | Price | ApartmentID |
-------------------------------------------------------------
| 1 | 2014-04-04 | 5 | 1 |
| 2 | 2014-04-04 | 6 | 2 |
| 3 | 2014-04-04 | 4 | 3 |
| 4 | 2014-04-11 | 5.2 | 1 |
| 5 | 2014-04-11 | 4 | 3 |
| 6 | 2014-04-11 | 7 | 4 |
| 7 | 2014-04-19 | 5.1 | 1 |
| 8 | 2014-04-19 | 4.1 | 3 |
| 9 | 2014-04-19 | 7.1 | 4 |
| 10 | 2014-04-26 | 4.1 | 3 |
| 11 | 2014-04-26 | 7.2 | 4 |
-------------------------------------------------------------
Solutions thoughts
I think it makes sense to produce two different lists, one for odd-numbered weeks and one for even-numbered weeks. List 1 would then contain Date, Price and ApartmentID that are valid for the tuples (t-1,t), (t+1,t+2), (t+3,t+4) etc. while list 2 would contain the same for the tuples (t,t+1),(t+2,t+3),(t+4,t+5) etc. The reason I think two lists are needed is that for any given week t, there are two sets of apartments and corresponding prices that need to be produced - one that is "forward compatible" and one that is "backwards compatible".
If two such lists can be produced, then the rest is simply an exercise in taking averages over each distinct date.
I'm not really sure to begin here. I played a little around with Intersect, but I'm pretty sure I need to nest queries to get this to work.
Result
Using the methodology described above would yield two lists.
List 1:
Notice how RegistrationID 2 and 6 disappear because they don't exist in on both dates 2014-04-04 and 2014-04-11. The same goes for RegistrationID 7 as this apartment doesn't exist for both 2014-04-19 and 2014-04-26.
-------------------------------------------------------------
| RegistrationID | Date | Price | ApartmentID |
-------------------------------------------------------------
| 1 | 2014-04-04 | 5 | 1 |
| 3 | 2014-04-04 | 4 | 3 |
| 4 | 2014-04-11 | 5.2 | 1 |
| 5 | 2014-04-11 | 4 | 3 |
| 8 | 2014-04-19 | 4.1 | 3 |
| 9 | 2014-04-19 | 7.1 | 4 |
| 10 | 2014-04-26 | 4.1 | 3 |
| 11 | 2014-04-26 | 7.2 | 4 |
-------------------------------------------------------------
List 2:
Here, nothing disappears because every apartment is present in the tuples within the scope of this list.
-------------------------------------------------------------
| RegistrationID | Date | Price | ApartmentID |
-------------------------------------------------------------
| 4 | 2014-04-11 | 5.2 | 1 |
| 5 | 2014-04-11 | 4 | 3 |
| 6 | 2014-04-11 | 7 | 4 |
| 7 | 2014-04-19 | 5.1 | 1 |
| 8 | 2014-04-19 | 4.1 | 3 |
| 9 | 2014-04-19 | 7.1 | 4 |
-------------------------------------------------------------
Here's a solution. First, I get all the records from the table (I named it "ApartmentPrice"), computing the WeekOf (which is the Sunday of that week), PreviousWeek (the Sunday of the previous week), and NextWeek (the Sunday of the following week). I store that in a table variable (you could also put it in a CTE or a temp table).
declare #tempTable table(RegistrationId int, PriceDate date, Price decimal(8,2), ApartmentId int, WeekOf date, PreviousWeek date, NextWeek date)
Insert #tempTable
select ap.RegistrationId,
ap.PriceDate,
ap.Price,
ap.ApartmentId,
DATEADD(ww, DATEDIFF(ww,0,ap.PriceDate), 0) WeekOf,
DATEADD(ww, DATEDIFF(ww,0,dateadd(wk, -1, ap.PriceDate)), 0) PreviousWeek,
DATEADD(ww, DATEDIFF(ww,0,dateadd(wk, 1, ap.PriceDate)), 0) NextWeek
from ApartmentPrice ap
Then I join that table variable to itself where WeekOf equals either NextWeek or PreviousWeek. This gives the apartments that have a record in the adjoining week.
select distinct t.RegistrationId, t.PriceDate, t.Price, t.ApartmentId
from #tempTable t
join #tempTable t2 on t.ApartmentId = t2.ApartmentId and (t.WeekOf = t2.PreviousWeek or t.WeekOf = t2.NextWeek)
order by t.RegistrationId, t.ApartmentId, t.PriceDate
I'm using distinct because an apartment will appear more than once in the results if it does have an adjoining week record.
You can also find the average prices for each week like this:
select t.WeekOf, avg(distinct t.Price)
from #tempTable t
join #tempTable t2 on t.ApartmentId = t2.ApartmentId and (t.WeekOf = t2.PreviousWeek or t.WeekOf = t2.NextWeek)
group by t.WeekOf
order by t.WeekOf
Here's a SQL Fiddle. I added a few more rows to the test data to show that it handles dates that cross the end of the year boundary.

Oracle CONNECT Operations

I've read through the Oracle documentation concerning the CONNECT operations, but I can't seem to get my head around a database query we have in an existing application. Below is a simplified version of the query.
SELECT LEVEL,
CONNECT_BY_ROOT MY_MONTH MY_LABEL,
b.*
FROM (
SELECT ROWNUM AS ORDERING,
MY_AREA,
TRUNC (THE_MONTH, 'MONTH') AS MY_MONTH
FROM MY_TABLE
ORDER BY MY_AREA, MY_MONTH DESC
) b
WHERE LEVEL <= 3
START WITH 1 = 1
CONNECT BY PRIOR MY_AREA = MY_AREA
AND PRIOR ORDERING = ORDERING - 1
AND PRIOR MY_MONTH <= ADD_MONTHS(MY_MONTH, 6);
While I have a basic understanding of the CONNECT functionalities, this combination has me lost. Can anyone explain what is going on in this query?
I think the end says to get all of the rows that have the same area and a row number 1 less than the current row number and a date before 6 months in the future from the current date. I would guess this would only return 1 row (due to the row number criteria) or 0 rows if the other criteria weren't met. And then maybe the first CONNECT_BY_ROOT says to get that row's MY_MONTH value?
Start with b, which is a table of MY_AREA (a number?), MY_MONTH, which is a month-truncated date (i.e. the days are all set to 01), and an aliased ROWNUM, which is determined by the ORDER BY clause, which is ORDER BY MY_AREA, MY_MONTH DESC, e.g.:
+----------+---------+-----------+
| ORDERING | MY_AREA | MY_MONTH |
+----------+---------+-----------+
| 1 | 10 | 01-SEP-12 |
| 2 | 10 | 01-JAN-12 |
| 3 | 12 | 01-AUG-12 |
| 4 | 12 | 01-JUN-12 |
| 5 | 12 | 01-MAY-12 |
| 6 | 12 | 01-JAN-12 |
| 7 | 12 | 01-JAN-10 |
+----------+---------+-----------+
The WHERE clause doesn't come into play until later, so move on to START WITH, which says only 1 = 1. This means that every row in b will be used in the query; if you had had another condition here, e.g. my_area < 5 or whatever, only a certain set of rows would have been used.
Now, the CONNECT BY, which determines how the hierarchy should be built. This works like a WHERE clause, except for the special PRIOR keyword which tells the DB to look at the previous level in the hierarchy. So:
PRIOR MY_AREA = MY_AREA just means that the child node has to have the same value for `MY_AREA'
PRIOR ORDERING = ORDERING - 1 means that the child should come one row after the current node in b's ordering.
PRIOR MY_MONTH <= ADD_MONTHS(MY_MONTH, 6) means that in order to be joined into the hierarchy, the previous MY_MONTH should be 6 months or less after the date of the current node.
The whole hierarchy is then created. LEVEL (special for CONNECT BY...) is set to the level in the hierarchy, CONNECT_BY_ROOT gives the MY_MONTH value for the root of that hierarchy and aliases it to MY_LABEL. After this, the table would look something like the following table. I've added separators for each hierarchy for clarity.
+-------+-----------+----------+---------+-----------+
| LEVEL | MY_LABEL | ORDERING | MY_AREA | MY_MONTH |
+-------+-----------+----------+---------+-----------+
| 1 | 01-SEP-12 | 1 | 10 | 01-SEP-12 |
+-------+-----------+----------+---------+-----------+
| 1 | 01-JAN-12 | 2 | 10 | 01-JAN-12 |
+-------+-----------+----------+---------+-----------+
| 1 | 01-AUG-12 | 3 | 12 | 01-AUG-12 |
| 2 | 01-AUG-12 | 4 | 12 | 01-JUN-12 |
| 3 | 01-AUG-12 | 5 | 12 | 01-MAY-12 |
| 4 | 01-AUG-12 | 6 | 12 | 01-JAN-12 |
+-------+-----------+----------+---------+-----------+
| 1 | 01-JUN-12 | 4 | 12 | 01-JUN-12 |
| 2 | 01-JUN-12 | 5 | 12 | 01-MAY-12 |
| 3 | 01-JUN-12 | 6 | 12 | 01-JAN-12 |
+-------+-----------+----------+---------+-----------+
| 1 | 01-MAY-12 | 5 | 12 | 01-MAY-12 |
| 2 | 01-MAY-12 | 6 | 12 | 01-JAN-12 |
+-------+-----------+----------+---------+-----------+
| 1 | 01-JAN-12 | 6 | 12 | 01-JAN-12 |
+-------+-----------+----------+---------+-----------+
| 1 | 01-JAN-10 | 7 | 12 | 01-JAN-10 |
+-------+-----------+----------+---------+-----------+
So, as you can see, each of the rows appears at the top of its own hierarchy, with all nodes meeting the CONNECT BY criteria under it.
Finally, the WHERE clause is applied; this chops off all of the levels > 3 in every hierarchy, so you're left with a maximum of 3 levels. This affects only one row in the middle hierarchy, the one with LEVEL = 4.