How to select rows with conditional values of one column in SQL - sql

Say I have this table:
id
timeline
1
BASELINE
1
MIDTIME
1
ENDTIME
2
BASELINE
2
MIDTIME
3
BASELINE
4
BASELINE
5
BASELINE
5
MIDTIME
5
ENDTIME
6
MIDTIME
6
ENDTIME
7
RISK
7
RISK
So this is what the data looks like except the data has more observations (few thousands)
How do I get the output so that it will look like this:
id
timeline
1
BASELINE
1
MIDTIME
2
BASELINE
2
MIDTIME
5
BASELINE
5
MIDTIME
How do I select the first two terms of each ID which has 2 specific timeline values (BASELINE and MIDTIME)? Notice id 6 has MIDTIME and ENDTIME,and id 7 has two RISK I don't want these two ids.
I used
SELECT *
FROM df
WHERE id IN (SELECT id FROM df GROUP BY id HAVING COUNT(*)=2)
and got IDs with two timeline values (output below) but don't know how to get rows with only BASELINE and MIDTIME.
id timeline
---|--------|
1 | BASELINE |
1 | MIDTIME |
2 | BASELINE |
2 | MIDTIME |
5 | BASELINE |
5 | MIDTIME |
6 | MIDTIME | ---- dont want this
6 | ENDTIME | ---- dont want this
7 | RISK | ---- dont want this
7 | RISK | ---- dont want this
Many Thanks.

You can try using exists -
DEMO
select * from t t1 where timeline in ('BASELINE','MIDTIME') and
exists
(select 1 from t t2 where t1.id=t2.id and timeline in ('BASELINE','MIDTIME')
group by t2.id having count(distinct timeline)=2)
OUTPUT:
id timeline
1 BASELINE
1 MIDTIME
2 BASELINE
2 MIDTIME
5 BASELINE
5 MIDTIME

I think this query should give you the result you want.
NOTE: As i understand, you don't want the ID where exists a "ENDTIME", and in your sample data, there is an "ENDTIME" for ID 1. I assumed this was an error so i made a query that excludes all id containing "ENDTIME".
WITH CTE AS
(
SELECT
id
FROM
df
WHERE
timeline IN ('ENDTIME', 'RISK')
)
SELECT
id,
timeline
FROM
df
WHERE
id NOT IN (SELECT id FROM CTE);

There's probably a number of ways to do this, here's one way that will pick up BASELINE and MIDTIME rows where only they exist, ensuring there are only 2 rows per returned ID. Without knowing the ordering of timeline, it's not possible to go further I don't think:
SELECT
id
, timeline
FROM (
SELECT
*
, SUM(CASE WHEN timeline = 'BASELINE' THEN 1 ELSE 0 END) OVER (PARTITION BY id) AS BaselineCount
, SUM(CASE WHEN timeline = 'MIDTIME' THEN 1 ELSE 0 END) OVER (PARTITION BY id) AS MidtimeCount
FROM df
WHERE df.timeline IN ('BASELINE', 'MIDTIME')
) subquery
WHERE subquery.BaselineCount > 0
AND subquery.MidtimeCount > 0
GROUP BY
id
, timeline
;

Related

How to combine and sum consequent values until new value in column

I need some help with summing subsequent values of a column based on category in another column, until that category reaches new value. Here's what my data looks like
id | site_id | date_id | hour_id | location_id | status | status_minutes
1 1 20210101 1 1 Offline 60
2 1 20210101 2 1 Offline 57
3 1 20210101 2 1 Available 3
4 1 20210101 3 1 Available 20
5 1 20210101 3 1 Offline 40
... ... ... ... ... ... ...
25 1 20210101 23 1 Offline 60
26 1 20210102 0 1 Offline 23
As you can see in the above data is at hourly level, and so if status minutes column equals to 60, it'll be just one row for that hour. However, if not, then status minutes will be spread across rows that would add up to 60, as you can see in rows 2 and 3, and in rows 4 and 5.
Now, my goal is to understand stretches of time of how long each status was going on, until next status kicked in. So the output for the example above would be:
site_id | date_id | location_id | status | status_minutes
1 20210101 1 Offline 117
1 20210101 1 Available 23
1 20210101 1 Offline 40
... ... ... ... ...
1 20210101 1 Offline 60
1 20210102 1 Offline 23
Important part is that this operation should only be confined within each day, as seen in the last two rows of example and the output. So the summing happens only within a given day, and then starts again with the 0th hour next day.
This is a gaps and island problem. The section_num is being used to determine groups before finding the total status_minutes.
You may try the following:
SELECT
site_id,
date_id,
location_id,
status,
SUM(status_minutes) as status_minutes
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY site_id,date_id,location_id
ORDER BY hour_id
) - ROW_NUMBER() OVER (
PARTITION BY site_id,date_id,location_id,status
ORDER BY hour_id
) as section_num
FROM
my_table
) t
GROUP BY
site_id,
date_id,
location_id,
status,
section_num
ORDER BY
site_id,
date_id,
location_id,
section_num
View working demo on db fiddle

BigQuery: Flattening all repeated fields in nested schema

I am having so much trouble with querying from Big Query's nested schema.
I have the following fields.
I want to flatten the table and get something like this.
user | question_id | user_choices
123 | 1 | 1
123 | 1 | 2
123 | 1 | 3
123 | 1 | 4
From other resources, I got to a point where I can query from one of the records in the repeated columns. Such as the following:
SELECT user, dat.question_id FROM tablename, UNNEST(data) dat
It gives me this result.
But when I do this, I get another repeated columns again.
SELECT user, dat.question_id, dat.user_choices FROM tablename, UNNEST(data) dat
Can anyone help me how to UNNEST this table properly so I can have flattened schema for all data items?
Thanks!
Below is for BigQuery Standard SQL
#standardSQL
SELECT user, question_id, choice
FROM `project.dataset.table`,
UNNEST(data) question,
UNNEST(user_choices) choice
You can test, play with above using dummy data from your question like below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 user,
[STRUCT<question_id INT64, user_choices ARRAY<INT64>>
(1,[1,2,3]),
(2,[2,5]),
(3,[1,3])
] data UNION ALL
SELECT 2 user,
[STRUCT<question_id INT64, user_choices ARRAY<INT64>>
(1,[2,3]),
(2,[4,5]),
(3,[2,6])
] data
)
SELECT user, question_id, choice
FROM `project.dataset.table`,
UNNEST(data) question,
UNNEST(user_choices) choice
ORDER BY user, question_id, choice
with result
Row user question_id choice
1 1 1 1
2 1 1 2
3 1 1 3
4 1 2 2
5 1 2 5
6 1 3 1
7 1 3 3
8 2 1 2
9 2 1 3
10 2 2 4
11 2 2 5
12 2 3 2
13 2 3 6

Query to find Cumulative while subtracting other counts

Here is my table structure
Id INT
RecId INT
Dated DATETIME
Status INT
and here is my data.
Status table (contains different statuses)
Id Status
1 Created
2 Assigned
Log table (contains logs for the different statuses that a record went through (RecId))
Id RecId Dated Status
1 1 2013-12-09 14:16:31.930 1
2 7 2013-12-09 14:27:26.620 1
3 1 2013-12-09 14:27:26.620 2
3 8 2013-12-10 11:14:13.747 1
3 9 2013-12-10 11:14:13.747 1
3 8 2013-12-10 11:14:13.747 2
What I need to generate a report from this data in the following format.
Dated Created Assigned
2013-12-09 2 1
2013-12-10 3 1
Here the rows data is calculated date wise. The Created is calculated as (previous record (date) Created count - Previous date Assigned count) + Todays Created count.
For example if on date 2013-12-10 three entries were made to log table out of which two have the status Created while one has the status assigned. So in the desired view that I want to build for report, For date 2013-12-10, the view will return Created as 2 + 1 = 3 where 2 is newly inserted records in log table and 1 is the previous day remaining record count (Created - Assigned) 2 - 1.
I hope the scenario is clear. Please ask me if further information is required.
Please help me with the sql to construct the above view.
This matches the expected result for the provided sample, but may require more testing.
with CTE as (
select
*
, row_number() over(order by dt ASC) as rn
from (
select
cast(created.dated as date) as dt
, count(created.status) as Created
, count(Assigned.status) as Assigned
, count(created.status)
- count(Assigned.status) as Delta
from LogTable created
left join LogTable assigned
on created.RecId = assigned.RecId
and created.status = 1
and assigned.Status = 2
and created.Dated <= assigned.Dated
where created.status = 1
group by
cast(created.dated as date)
) x
)
select
dt.dt
, dt.created + coalesce(nxt.delta,0) as created
, dt.assigned
from CTE dt
left join CTE nxt on dt.rn = nxt.rn+1
;
Result:
| DT | CREATED | ASSIGNED |
|------------|---------|----------|
| 2013-12-09 | 2 | 1 |
| 2013-12-10 | 3 | 1 |
See this SQLFiddle demo

Implementing Hierarchy in SQL

Suppose I have a table which has a "CDATE" representing the date when I retrieved the data, a "SECID" identifying the security I retrieved data for, a "SOURCE" designating where I got the data and the "VALUE" which I got from the source. My data might look as following:
CDATE | SECID | SOURCE | VALUE
--------------------------------
1/1/2012 1 1 23
1/1/2012 1 5 45
1/1/2012 1 3 33
1/4/2012 2 5 55
1/5/2012 1 5 54
1/5/2012 1 3 99
Suppose I have a HIERARCHY table like the following ("SOURCE" with greatest HIERARCHY number takes precedence):
SOURCE | NAME | HIERARCHY
---------------------------
1 ABC 10
3 DEF 5
5 GHI 2
Now let's suppose I want my results to be picked according to the hierarchy above. So applying the hierarch and selecting the source with the greatest HIERARCHY number I would like to end up with the following:
CDATE | SECID | SOURCE | VALUE
---------------------------------
1/1/2012 1 1 23
1/4/2012 2 5 55
1/5/2012 1 3 99
This joins on your hierarchy and selects the top-ranked source for each date and security.
SELECT CDATE, SECID, SOURCE, VALUE
FROM (
SELECT t.CDATE, t.SECID, t.SOURCE, t.VALUE,
ROW_NUMBER() OVER (PARTITION BY t.CDATE, t.SECID
ORDER BY h.HIERARCHY DESC) as nRow
FROM table1 t
INNER JOIN table2 h ON h.SOURCE = t.SOURCE
) A
WHERE nRow = 1
You can get the results you want with the below. It combines your data with your hierarchies and ranks them according to the highest hierarchy. This will only return one result arbitrarily though if you have a source repeated for the same date.
;with rankMyData as (
select
d.CDATE
, d.SECID
, d.SOURCE
, d.VALUE
, row_number() over(partition by d.CDate, d.SECID order by h.HIERARCHY desc) as ranking
from DATA d
inner join HIERARCHY h
on h.source = d.source
)
SELECT
CDATE
, SECID
, SOURCE
, VALUE
FROM rankMyData
where ranking = 1

SQL - Aggregating data in a result set with identical rows and eliminating multiple rows based on one column's value

I have a table that has transactions by employeeID by TransactionTime. Each employee may have multiple transactions that occur at the same time. For example: EmployeeID 1 has 2 transactions at 12. I need to sum the transactions by EmployeeID at each time interval. So for employeeID 1, the new column (TotalTransactionsByTime) result would be 2. Next, if the CODE for a given TransactionTime has a CODE of BAD, I need to exclude all transactions at that time increment. So for EmployeeID 2, I would need to exclude all three transactions from the result set because they have a CODE of 'BAD' which nullifies all transactions at that increment.
MY TABLE
|EmployeeID|TransactionTime|CODE|
1 12 GOOD
1 12 GOOD
1 5 GOOD
2 1 BAD --need to omit all 3 transactions for employeeID 2
2 1 GOOD
2 1 GOOD
3 3 GOOD
3 3 GOOD
A correct result would look like:
|EmployeeID | TransactionTime | CODE | NUMBERTRNS
1 12 GOOD | 2
1 5 GOOD | 1
3 3 GOOD | 2
select mt1.EmployeeID, mt1.TransactionTime, mt1.CODE, count(*) as NUMBERTRNS
from MyTable mt1
where mt1.EmployeeID not in (select EmployeeID from MyTable where CODE = 'BAD')
group by mt1.EmployeeID, mt1.TransactionTime, mt1.CODE