SQL with nested condition - sql

EDIT: added third requirement after playing with solution from Tim Biegeleisen
EDIT2: modified Robbie's DOB to be before his parent's marriage date
I am trying to create a query that will look at two tables and determine the difference in dates based on a percentage. I know, super confusing... Let me try and explain using the tables below:
Bob and Mary are married on 2010-01-01 and expect 4 kids (Parent table)
I want to know how many years it took until they met 50% of their expected kids (i.e. 2/4 kids). Using the Child table to see the DOB of their 4 kids, we know that Frankie is the second child which meets our 50% threshold so we use Frankie's DOB and subtract it from Frankie's parent's marriage date and end up with 3 years!
If the goal isn't reached then display no value e.g. Mick and Jo only had 1 child so far so they haven't yet reached their goal
Hoping this is doable using BigQuery standard SQL.
Parent table
id married_couple married_at expected_kids
--------------------------------------
1 Bob and Mary 2010-01-01 4
2 Mick and Jo 2010-01-01 4
Child table
id child_name parent_id date_of_birth
--------------------------------------
1 Eddie 1 2012-01-01
2 Frankie 1 2013-01-01
3 Robbie 1 2005-01-01
4 Duncan 1 2015-01-01
5 Rick 2 2014-01-01
Expected SQL result
parent_id half_goal_reached(years)
--------------------------------------
1 3
2

Below both soluthions for BigQuery Standard SQL
First one is more in classic sql way, the second one is more of BigQuery style (I think)
First Solution: with analytics function
#standardSQL
SELECT
parent_id,
IF(
MAX(pos) = MAX(CAST(expected_kids / 2 AS INT64)),
MAX(DATE_DIFF(date_of_birth, married_at, YEAR)),
NULL
) AS half_goal_reached
FROM (
SELECT c.parent_id, c.date_of_birth, expected_kids, married_at,
ROW_NUMBER() OVER(PARTITION BY c.parent_id ORDER BY c.date_of_birth) AS pos
FROM `child` AS c
JOIN `parent` AS p
ON c.parent_id = p.id
)
WHERE pos <= CAST(expected_kids / 2 AS INT64)
GROUP BY parent_id
Second Solution: with use of ARRAY
#standardSQL
SELECT
parent_id,
DATE_DIFF(dates[SAFE_ORDINAL(CAST(expected_kids / 2 AS INT64))], married_at, YEAR) AS half_goal_reached
FROM (
SELECT
parent_id,
ARRAY_AGG(date_of_birth ORDER BY date_of_birth) AS dates,
MAX(expected_kids) AS expected_kids,
MAX(married_at) AS married_at
FROM `child` AS c
JOIN `parent` AS p
ON c.parent_id = p.id
GROUP BY parent_id
)
Dummy Data
You can test / play with both solutions using below dummy data
#standardSQL
WITH `parent` AS (
SELECT 1 id, 'Bob and Mary' married_couple, DATE '2010-01-01' married_at, 4 expected_kids UNION ALL
SELECT 2, 'Mick and Jo', DATE '2010-01-01', 4
),
`child` AS (
SELECT 1 id, 'Eddie' child_name, 1 parent_id, DATE '2012-01-01' date_of_birth UNION ALL
SELECT 2, 'Frankie', 1, DATE '2013-01-01' UNION ALL
SELECT 3, 'Robbie', 1, DATE '2014-01-01' UNION ALL
SELECT 4, 'Duncan', 1, DATE '2015-01-01' UNION ALL
SELECT 5, 'Rick', 2, DATE '2014-01-01'
)

Try the following query, whose logic is too verbose to explain it well. I join the parent and child tables, bringing into line the parent id, number of years elapsed since marriage, running number of children, and expected number of children. With this information in hand, we can easily find the first row whose running number of children matches or exceeds half of the expected number.
SELECT parent_id, num_years AS half_goal_reached
FROM
(
SELECT parent_id, num_years, cnt, expected_kids,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY num_years) rn
FROM
(
SELECT
t2.parent_id,
YEAR(t2.date_of_birth) - YEAR(t1.married_at) AS num_years,
(SELECT COUNT(*) FROM child c
WHERE c.parent_id = t2.parent_id AND
c.date_of_birth <= t2.date_of_birth) AS cnt,
t1.expected_kids
FROM parent t1
INNER JOIN child t2
ON t1.id = t2.parent_id
) t
WHERE
cnt >= expected_kids / 2
) t
WHERE t.rn = 1;
Note that there may be issues with how I computed the yearly differences, or how I compute the threshhold for half the number of expected children. Also, if we were using a recent enterprise database we could have used an analytic function to get the running number of children instead of a correlated subquery, but I was unsure if Big Query would support that, so I used the latter.

Related

How to show data that's not in a table. SQL ORACLE

I've a data base with two tables.
Table Players Table Wins
ID Name ID Player_won
1 Mick 1 2
2 Frank 2 1
3 Sarah 3 4
4 Eva 4 5
5 Joe 5 1
I need a SQL query which show "The players who have not won any game".
I tried but I don't know even how to begin.
Thank you
You need all the rows from players that don't have corresponding rows in wins. For this you need a left join, filtering for rows that don't join:
select
p.id,
p.name
from Players p
left join Wins w on w.Player_won = p.id
where w.Player_won is null
You can also use not in:
select
id,
name
from Players
where id not in (select Player_won from Wins)
How about the MINUS set operator?
Sample data:
SQL> with players (id, name) as
2 (select 1, 'Mick' from dual union all
3 select 2, 'Ffrank' from dual union all
4 select 3, 'Sarah' from dual union all
5 select 4, 'Eva' from dual union all
6 select 5, 'Joe' from dual
7 ),
8 wins (id, player_won) as
9 (select 1, 2 from dual union all
10 select 2, 1 from dual union all
11 select 3, 4 from dual union all
12 select 4, 5 from dual union all
13 select 5, 1 from dual
14 )
Query begins here:
15 select id from players
16 minus
17 select player_won from wins;
ID
----------
3
SQL>
So, yes ... player 3 didn't win any game so far.
I think you should provide your attempts next time, but here you go:
select p.name
from players p
where not exists (select * from wins w where p.id = w.player_won);
MINUS is not the best option here because of not using indexes and instead performing a full-scan of both tables.
I've a data base with two tables.
You don't show the names or any definition of the tables, leaving me to make an educated guess about their structure.
I tried but I don't know even how to begin.
What exactly did you try? Possibly what you are missing here is the concept of a LEFT OUTER JOIN.
Assuming the tables are named player_table and wins_table, and have column names exactly as you showed, and that the player_won column is intended to express the number of games won by the player of that row's ID, and without knowing whether or not wins_table will have rows for players with zero wins… this should cover it:
select Name
from players_table pt
left join wins_table wt on (pt.ID = wt.ID)
-- Either this player is explicitly specified to have Player_won=0
-- or there is no row for this player ID in the wins table
-- (but excluding the possibility of an explicit NULL value, since its meaning would be unclear)
where Player_won = 0 or wt.ID is null;
As you can see from the variety of answers you've gotten, there are many ways to accomplish this.
One additional way to do this is to use COUNT in a correlated subquery, as in:
SELECT *
FROM PLAYERS p
WHERE 0 = (SELECT COUNT(*)
FROM WINS w
WHERE w.PLAYER_WON = p.ID)
db<>fiddle here
SELECT *
FROM Players p
INNER JOIN Wins w
ON p.ID = w.ID
WHERE w.players_won = 0
I have not done SQL in awhile but I think this might be right if you are looking for players with 0 wins

What's the most efficient way to match values between 2 tables based on most recent prior date?

I've got two tables in MS SQL Server:
dailyt - which contains daily data:
date val
---------------------
2014-05-22 10
2014-05-21 9.5
2014-05-20 9
2014-05-19 8
2014-05-18 7.5
etc...
And periodt - which contains data coming in at irregular periods:
date val
---------------------
2014-05-21 2
2014-05-18 1
Given a row in dailyt, I want to adjust its value by adding the corresponding value in periodt with the closest date prior or equal to the date of the dailyt row. So, the output would look like:
addt
date val
---------------------
2014-05-22 12 <- add 2 from 2014-05-21
2014-05-21 11.5 <- add 2 from 2014-05-21
2014-05-20 10 <- add 1 from 2014-05-18
2014-05-19 9 <- add 1 from 2014-05-18
2014-05-18 8.5 <- add 1 from 2014-05-18
I know that one way to do this is to join the dailyt and periodt tables on periodt.date <= dailyt.date and then imposing a ROW_NUMBER() (PARTITION BY dailyt.date ORDER BY periodt.date DESC) condition, and then having a WHERE condition on the row number to = 1.
Is there another way to do this that would be more efficient? Or is this pretty much optimal?
I think using APPLY would be the most efficient way:
SELECT d.Val,
p.Val,
NewVal = d.Val + ISNULL(p.Val, 0)
FROM Dailyt AS d
OUTER APPLY
( SELECT TOP 1 Val
FROM Periodt p
WHERE p.Date <= d.Date
ORDER BY p.Date DESC
) AS p;
Example on SQL Fiddle
If there relatively very few periodt rows, then there is an option that may prove quite efficient.
Convert periodt into a From/To ranges table using subqueries or CTEs. (Obviously performance depends on how efficiently this initial step can be done, which is why a small number of periodt rows is preferable.) Then the join to dailyt will be extremely efficient. E.g.
;WITH PIds AS (
SELECT ROW_NUMBER() OVER(ORDER BY PDate) RN, *
FROM #periodt
),
PRange AS (
SELECT f.PDate AS FromDate, t.PDate as ToDate, f.PVal
FROM PIds f
LEFT OUTER JOIN PIds t ON
t.RN = f.RN + 1
)
SELECT d.*, p.PVal
FROM #dailyt d
LEFT OUTER JOIN PRange p ON
d.DDate >= p.FromDate
AND (d.DDate < p.ToDate OR p.ToDate IS NULL)
ORDER BY 1 DESC
If you want to try the query, the following produces the sample data using table variables. Note I added an extra row to dailyt to demonstrate no periodt entries with a smaller date.
DECLARE #dailyt table (
DDate date NOT NULL,
DVal float NOT NULL
)
INSERT INTO #dailyt(DDate, DVal)
SELECT '20140522', 10
UNION ALL SELECT '20140521', 9.5
UNION ALL SELECT '20140520', 9
UNION ALL SELECT '20140519', 8
UNION ALL SELECT '20140518', 7.5
UNION ALL SELECT '20140517', 6.5
DECLARE #periodt table (
PDate date NOT NULL,
PVal int NOT NULL
)
INSERT INTO #periodt
SELECT '20140521', 2
UNION ALL SELECT '20140518', 1

SQL Query to generate an extra field from data in the table

I have a table with 3 fields like this sample table Tbl1
Person Cost FromDate
1 10 2009-1-1
1 20 2010-1-1
2 10 2009-1-1
I want to query it and get back the 3 fields and a generated field called ToDate that defaults to 2099-1-1 unless there is an actual ToDate implied from another entry for the person in the table.
select Person,Cost,FromDate,ToDate From Tbl1
Person Cost FromDate ToDate
1 10 2009-1-1 2010-1-1
1 20 2010-1-1 2099-1-1
2 10 2009-1-1 2099-1-1
You can select the minimum date from all dates that are after the record's date. If there is none you get NULL. With COALESCE you change NULL into the default date:
select
Person,
Cost,
FromDate,
coalesce((select min(FromDate) from Tbl1 later where later.FromDate > Tbl1.FromDate), '2099-01-01') as ToDate
From Tbl1
order by Person, FromDate;
Although Thorsten's answer is perfectly fine, it would be more efficient to use window-functions to match the derived end-dates.
;WITH nbrdTbl
AS ( SELECT Person, Cost, FromDate, row_nr = ROW_NUMBER() OVER (PARTITION BY Person ORDER BY FromDate ASC)
FROM Tbl1)
SELECT t.Person, t.Cost, t.FromDate, derived_end_date = COALESCE(nxt.FromDate, '9991231')
FROM nbrdTbl t
LEFT OUTER JOIN nbrdTbl nxt
ON nxt.Person = t.Person
AND nxt.row_nr = t.row_nr + 1
ORDER BY t.Person, t.FromDate
Doing a test on a 2000-records table it's about 3 times as efficient according to the Execution plan (78% vs 22%).

Find duplicates within a specific period

I have a table with the following structure
ID Person LOG_TIME
-----------------------------------
1 1 2012-05-21 13:03:11.550
2 1 2012-05-22 13:09:37.050 <--- this is duplicate
3 1 2012-05-28 13:09:37.183
4 2 2012-05-20 15:09:37.230
5 2 2012-05-22 13:03:11.990 <--- this is duplicate
6 2 2012-05-24 04:04:13.222 <--- this is duplicate
7 2 2012-05-29 11:09:37.240
I have some application job that fills this table with data.
There is a business rule that each person should have only 1 record in every 7 days.
From the above example, records # 2,5 and 6 are considered duplicates while 1,3,4 and 7 are OK.
I want to have a SQL query that checks if there are records for the same person in less than 7 days.
;WITH cte AS
(
SELECT ID, Person, LOG_TIME,
DATEDIFF(d, MIN(LOG_TIME) OVER (PARTITION BY Person), LOG_TIME) AS diff_date
FROM dbo.Log_time
)
SELECT *
FROM cte
WHERE diff_date BETWEEN 1 AND 6
Demo on SQLFiddle
Please see my attempt on SQLFiddle here.
You can use a join based on DATEDIFF() to find records which are logged less than 7 days apart:
WITH TooClose
AS
(
SELECT
a.ID AS BeforeID,
b.ID AS AfterID
FROM
Log a
INNER JOIN Log b ON a.Person = b.Person
AND a.LOG_TIME < b.LOG_TIME
AND DATEDIFF(DAY, a.LOG_TIME, b.LOG_TIME) < 7
)
However, this will include records which you don't consider "duplicates" (for instance, ID 3, because it is too close to ID 2). From what you've said, I'm inferring that a record isn't a "duplicate" if the record it is too close to is itself a "duplicate".
So to apply this rule and get the final list of duplicates:
SELECT
AfterID AS ID
FROM
TooClose
WHERE
BeforeID NOT IN (SELECT AfterID FROM TooClose)
Please take a look at this sample.
Reference: SQLFIDDLE
Query:
select person,
datediff(max(log_time),min(log_time)) as diff,
count(log_time)
from pers
group by person
;
select y.person, y.ct
from (
select person,
datediff(max(log_time),min(log_time)) as diff,
count(log_time) as ct
from pers
group by person) as y
where y.ct > 1
and y.diff <= 7
;
PERSON DIFF COUNT(LOG_TIME)
1 1 3
2 8 3
PERSON CT
1 3
declare #Count int
set #count=(
select COUNT(*)
from timeslot
where (( (TimeFrom<#Timefrom and TimeTo >#Timefrom)
or (TimeFrom<#Timeto and TimeTo >#Timeto))
or (TimeFrom=#Timefrom or TimeTo=#Timeto)))

How to write Oracle query to find a total length of possible overlapping from-to dates

I'm struggling to find the query for the following task
I have the following data and want to find the total network day for each unique ID
ID From To NetworkDay
1 03-Sep-12 07-Sep-12 5
1 03-Sep-12 04-Sep-12 2
1 05-Sep-12 06-Sep-12 2
1 06-Sep-12 12-Sep-12 5
1 31-Aug-12 04-Sep-12 3
2 04-Sep-12 06-Sep-12 3
2 11-Sep-12 13-Sep-12 3
2 05-Sep-12 08-Sep-12 3
Problem is the date range can be overlapping and I can't come up with SQL that will give me the following results
ID From To NetworkDay
1 31-Aug-12 12-Sep-12 9
2 04-Sep-12 08-Sep-12 4
2 11-Sep-12 13-Sep-12 3
and then
ID Total Network Day
1 9
2 7
In case the network day calculation is not possible just get to the second table would be sufficient.
Hope my question is clear
We can use Oracle Analytics, namely the "OVER ... PARTITION BY" clause, in Oracle to do this. The PARTITION BY clause is kind of like a GROUP BY but without the aggregation part. That means we can group rows together (i.e. partition them) and them perform an operation on them as separate groups. As we operate on each row we can then access the columns of the previous row above. This is the feature PARTITION BY gives us. (PARTITION BY is not related to partitioning of a table for performance.)
So then how do we output the non-overlapping dates? We first order the query based on the (ID,DFROM) fields, then we use the ID field to make our partitions (row groups). We then test the previous row's TO value and the current rows FROM value for overlap using an expression like: (in pseudo code)
max(previous.DTO, current.DFROM) as DFROM
This basic expression will return the original DFROM value if it doesnt overlap, but will return the previous TO value if there is overlap. Since our rows are ordered we only need to be concerned with the last row. In cases where a previous row completely overlaps the current row we want the row then to have a 'zero' date range. So we do the same thing for the DTO field to get:
max(previous.DTO, current.DFROM) as DFROM, max(previous.DTO, current.DTO) as DTO
Once we have generated the new results set with the adjusted DFROM and DTO values, we can aggregate them up and count the range intervals of DFROM and DTO.
Be aware that most date calculations in database are not inclusive such as your data is. So something like DATEDIFF(dto,dfrom) will not include the day dto actually refers to, so we will want to adjust dto up a day first.
I dont have access to an Oracle server anymore but I know this is possible with the Oracle Analytics. The query should go something like this:
(Please update my post if you get this to work.)
SELECT id,
max(dfrom, LAST_VALUE(dto) OVER (PARTITION BY id ORDER BY dfrom) ) as dfrom,
max(dto, LAST_VALUE(dto) OVER (PARTITION BY id ORDER BY dfrom) ) as dto
from (
select id, dfrom, dto+1 as dto from my_sample -- adjust the table so that dto becomes non-inclusive
order by id, dfrom
) sample;
The secret here is the LAST_VALUE(dto) OVER (PARTITION BY id ORDER BY dfrom) expression which returns the value previous to the current row.
So this query should output new dfrom/dto values which dont overlap. It's then a simple matter of sub-querying this doing (dto-dfrom) and sum the totals.
Using MySQL
I did haves access to a mysql server so I did get it working there. MySQL doesnt have results partitioning (Analytics) like Oracle so we have to use result set variables. This means we use #var:=xxx type expressions to remember the last date value and adjust the dfrom/dto according. Same algorithm just a little longer and more complex syntax. We also have to forget the last date value any time the ID field changes!
So here is the sample table (same values you have):
create table sample(id int, dfrom date, dto date, networkDay int);
insert into sample values
(1,'2012-09-03','2012-09-07',5),
(1,'2012-09-03','2012-09-04',2),
(1,'2012-09-05','2012-09-06',2),
(1,'2012-09-06','2012-09-12',5),
(1,'2012-08-31','2012-09-04',3),
(2,'2012-09-04','2012-09-06',3),
(2,'2012-09-11','2012-09-13',3),
(2,'2012-09-05','2012-09-08',3);
On to the query, we output the un-grouped result set like above:
The variable #ld is "last date", and the variable #lid is "last id". Anytime #lid changes, we reset #ld to null. FYI In mysql the := operators is where the assignment happens, an = operator is just equals.
This is a 3 level query, but it could be reduced to 2. I went with an extra outer query to keep things more readable. The inner most query is simple and it adjusts the dto column to be non-inclusive and does the proper row ordering. The middle query does the adjustment of the dfrom/dto values to make them non-overlapped. The outer query simple drops the non-used fields, and calculate the interval range.
set #ldt=null, #lid=null;
select id, no_dfrom as dfrom, no_dto as dto, datediff(no_dto, no_dfrom) as days from (
select if(#lid=id,#ldt,#ldt:=null) as last, dfrom, dto, if(#ldt>=dfrom,#ldt,dfrom) as no_dfrom, if(#ldt>=dto,#ldt,dto) as no_dto, #ldt:=if(#ldt>=dto,#ldt,dto), #lid:=id as id,
datediff(dto, dfrom) as overlapped_days
from (select id, dfrom, dto + INTERVAL 1 DAY as dto from sample order by id, dfrom) as sample
) as nonoverlapped
order by id, dfrom;
The above query gives the results (notice dfrom/dto are non-overlapping here):
+------+------------+------------+------+
| id | dfrom | dto | days |
+------+------------+------------+------+
| 1 | 2012-08-31 | 2012-09-05 | 5 |
| 1 | 2012-09-05 | 2012-09-08 | 3 |
| 1 | 2012-09-08 | 2012-09-08 | 0 |
| 1 | 2012-09-08 | 2012-09-08 | 0 |
| 1 | 2012-09-08 | 2012-09-13 | 5 |
| 2 | 2012-09-04 | 2012-09-07 | 3 |
| 2 | 2012-09-07 | 2012-09-09 | 2 |
| 2 | 2012-09-11 | 2012-09-14 | 3 |
+------+------------+------------+------+
How about constructing an SQL which merges intervals by removing holes and considering only maximum intervals. It goes like this (not tested):
SELECT DISTINCT F.ID, F.From, L.To
FROM Temp AS F, Temp AS L
WHERE F.From < L.To AND F.ID = L.ID
AND NOT EXISTS (SELECT *
FROM Temp AS T
WHERE T.ID = F.ID
AND F.From < T.From AND T.From < L.To
AND NOT EXISTS ( SELECT *
FROM Temp AS T1
WHERE T1.ID = F.ID
AND T1.From < T.From
AND T.From <= T1.To)
)
AND NOT EXISTS (SELECT *
FROM Temp AS T2
WHERE T2.ID = F.ID
AND (
(T2.From < F.From AND F.From <= T2.To)
OR (T2.From < L.To AND L.To < T2.To)
)
)
with t_data as (
select 1 as id,
to_date('03-sep-12','dd-mon-yy') as start_date,
to_date('07-sep-12','dd-mon-yy') as end_date from dual
union all
select 1,
to_date('03-sep-12','dd-mon-yy'),
to_date('04-sep-12','dd-mon-yy') from dual
union all
select 1,
to_date('05-sep-12','dd-mon-yy'),
to_date('06-sep-12','dd-mon-yy') from dual
union all
select 1,
to_date('06-sep-12','dd-mon-yy'),
to_date('12-sep-12','dd-mon-yy') from dual
union all
select 1,
to_date('31-aug-12','dd-mon-yy'),
to_date('04-sep-12','dd-mon-yy') from dual
union all
select 2,
to_date('04-sep-12','dd-mon-yy'),
to_date('06-sep-12','dd-mon-yy') from dual
union all
select 2,
to_date('11-sep-12','dd-mon-yy'),
to_date('13-sep-12','dd-mon-yy') from dual
union all
select 2,
to_date('05-sep-12','dd-mon-yy'),
to_date('08-sep-12','dd-mon-yy') from dual
),
t_holidays as (
select to_date('01-jan-12','dd-mon-yy') as holiday
from dual
),
t_data_rn as (
select rownum as rn, t_data.* from t_data
),
t_model as (
select distinct id,
start_date
from t_data_rn
model
partition by (rn, id)
dimension by (0 as i)
measures(start_date, end_date)
rules
( start_date[for i
from 1
to end_date[0]-start_date[0]
increment 1] = start_date[0] + cv(i),
end_date[any] = start_date[cv()] + 1
)
order by 1,2
),
t_network_days as (
select t_model.*,
case when
mod(to_char(start_date, 'j'), 7) + 1 in (6, 7)
or t_holidays.holiday is not null
then 0 else 1
end as working_day
from t_model
left outer join t_holidays
on t_holidays.holiday = t_model.start_date
)
select id,
sum(working_day) as network_days
from t_network_days
group by id;
t_data - your initial data
t_holidays - contains list of holidays
t_data_rn - just adds unique key (rownum) to each row of t_data
t_model - expands t_data date ranges into a flat list of dates
t_network_days - marks each date from t_model as working day or weekend based on day of week (Sat and Sun) and holidays list
final query - calculates number of network day per each group.