For the simplification of the problem, lets say I have 2 tables:
user
id: int
ticket
id: int
user_id: int
marked: bool
With the given example data:
user
id
1
2
3
4
5
ticket
id
user_id
marked
1
1
false
2
1
true
3
1
true
4
2
true
5
2
false
6
2
false
7
3
false
8
5
false
9
5
false
User 1 and 2 have marked tickets.
User 3 has 1 unmarked ticket.
User 4 has no tickets.
User 5 has 2 unmarked tickets.
And I need a query that returns tickets with id 7, 8 and 9 - the tickets of users who don't have marked tickets.
I've written the following query:
SELECT * FROM ticket t
INNER JOIN user u ON t.user_id=u.id
INNER JOIN ticket tt ON u.id = tt.user_id
WHERE tt.marked = false;
But it doesn't works as expected. I don't want to use subqueries to exclude users with marked tickets. Can this be done fully with JOINs? So it happens that I'm not that familiar with JOIN clauses.
This assumes marked is an int. You may need to adjust your query to convert your bool data type.
with a as (
select t.user_id
, max(t.marked) as marked
from ticket t
group by t.user_id
)
select t.*
from ticket t
inner join a on a.user_id = t.user_id
where a.marked = 0
I deliberately omitted user since it adds no value.
I'm hesitant to answer without more info, but subqueries may be helpful.
set up and post a repel for us to try it out.
If you're not familiar, here's a brief video on subqueries.
https://www.youtube.com/watch?v=GpC0XyiJPEo&t=417s
Looking for some Oracle SQL theoretical help on the best way to handle a grouped result set. I understand why it groups the way it does, but I'm trying to figure out if there's a way to
I have a table that lists the activity of some cost centers. It looks like this:
Company Object Sub July August
A 1 20 50
A 1 10 0
A 1 10 0 20
B 1 0 0
I then need to flag whether or not there was activity in August. So I'm writing a CASE statement where if August = 0 THEN 'FALSE' ELSE 'TRUE'. Then I need to group all records by Company, Object, and Sub. The Cumulative column is a SUM of both July and August. However, my output looks like this:
Company Object Sub SUM ActivityFlag
A 1 70 TRUE
A 1 10 FALSE
A 1 10 20 TRUE
B 1 0 FALSE
What I need is this:
Company Object Sub August ActivityFlag
A 1 80 TRUE
A 1 10 20 TRUE
B 1 0 FALSE
Obviously, this is a simplified example of a much larger issue, but I'm trying to think through this problem theoretically so I can apply similar logic to my actual issue.
Is there a good SQL method for adding the August amount for rows 1 and 2, and then selecting TRUE so that this appears on a single row? I hope this makes sense.
use aggregation
select company,object,sub,sum(july+august),
max(case when august>0 then 'True' else 'false' end)
from table_name group by company,object,sub
If you are flagging your detail with the case statement you can either put the case in a sum similar to:
MAX(CASE WHEN August = 0 THEN 1 ELSE 0 END)
Another way if to aggregate the flag upward in an inner query:
SELECT IsAugust = MAX(IsAugust) FROM
(
...
IsAugust = CASE WHEN August=0 THEN 1 ELSE 0 END
...
)AS X
GROUP BY...
I have a problem that should be solved outside of SQL, but due to business constraints needs to be solved within SQL.
So, please don't tell me to do this at data ingestion, outside of SQL, I want to, but it's not an option...
I have a stream of events, with 4 principle properties....
The source device
The event's timestamp
The event's "type"
The event's "payload" (a dreaded VARCHAR representing various data-types)
What I need to do is break the stream up in to pieces (that I will refer to as "sessions").
Each session is specific to a device (effectively, PARTITION BY device_id)
No one session may contain more than one event of the same type
To shorten the examples, I'll limit them to include just the timestamp and the event_type...
timestamp | event_type desired_session_id
-----------+------------ --------------------
0 | 1 0
1 | 4 0
2 | 2 0
3 | 3 0
4 | 2 1
5 | 1 1
6 | 3 1
7 | 4 1
8 | 4 2
9 | 4 3
10 | 1 3
11 | 1 4
12 | 2 4
An idealised final output may be to pivot the final results...
device_id | session_id | event_type_1_timestamp | event_type_1_payload | event_type_2_timestamp | event_type_2_payload ...
(But that is not yet set in stone, but I will need to "know" which events make up a session, that their timestamps are, and what their payloads are. It is possible that just appending the session_id column to the input is sufficient, as long as I don't "lose" the other properties.)
There are:
12 discrete event types
hundreds of thousands of devices
hundred of thousands of events per device
a "norm" of around 6-8 events per "session"
but sometimes a session may have just 1 or all 12
These factors mean that half-cartesian products and the like are, umm, less than desirable, but possibly may be "the only way".
I've played (in my head) with analytic functions and gaps-and-islands type processes, but can never quite get there. I always fall back to a place where I "want" some flags that I can carry forward from row to row and reset them as needed...
Pseduo-code that doesn't work in SQL...
flags = [0,0,0,0,0,0,0,0,0]
session_id = 0
for each row in stream
if flags[row.event_id] == 0 then
flags[row.event_id] = 1
else
session_id++
flags = [0,0,0,0,0,0,0,0,0]
row.session_id = session_id
Any SQL solution to that is appreciated, but you get "bonus points" if you can also take account of events "happening at the same time"...
If multiple events happen at the same timestamp
If ANY of those events are in the "current" session
ALL of those events go in to a new session
Else
ALL of those events go in to the "current" session
If such a group of event include the same event type multiple times
Do whatever you like
I'll have had enough by that point...
But set the session as "ambiguous" or "corrupt" with some kind of flag?
I'm not 100% sure this can be done in SQL. But I have an idea for an algorithm that might work:
enumerate the counts for each event
take the maximum count up to each point as the "grouping" for the events (this is the session)
So:
select t.*,
(max(seqnum) over (partition by device order by timestamp) - 1) as desired_session_id
from (select t.*,
row_number() over (partition by device, event_type order by timestamp) as seqnum
from t
) t;
EDIT:
This is too long for a comment. I have a sense that this requires a recursive CTE (RBAR). This is because you cannot land at a single row and look at the cumulative information or neighboring information to determine if the row should start a new session.
Of course, there are some situations where it is obvious (say, the previous row has the same event). And, it is also possible that there is some clever method of aggregating the previous data that makes it possible.
EDIT II:
I don't think this is possible without recursive CTEs (RBAR). This isn't quite a mathematical proof, but this is where my intuition comes from.
Imagine you are looking back 4 rows from the current and you have:
1
2
1
2
1 <-- current row
What is the session for this? It is not determinate. Consider:
e s vs e s
1 1 2 1 <-- row not in look back
1 2 1 1
2 2 2 2
1 3 1 2
2 3 2 3
1 4 1 3
The value depends on going further back. Obviously, this example can be extended all the way back to the first event. I don't think there is a way to "aggregate" the earlier values to distinguish between these two cases.
The problem is solvable if you can deterministically say that a given event is the start of a new session. That seems to require complete prior knowledge, at least in some cases. There are obviously cases where this is easy -- such as two events in a row. I suspect, though, that these are the "minority" of such sequences.
That said, you are not quite stuck with RBAR through the entire table, because you have device_id for parallelization. I'm not sure if your environment can do this, but in BQ or Postgres, I would:
Aggregate along each device to create an array of structs with the time and event information.
Loop through the arrays once, perhaps using custom code.
Reassign the sessions by joining back to the original table or unnesting the logic.
UPD based on discussion (not checked/tested, rough idea):
WITH
trailing_events as (
select *, listagg(event_type::varchar,',') over (partition by device_id order by ts rows between previous 12 rows and current row) as events
from tbl
)
,session_flags as (
select *, f_get_session_flag(events) as session_flag
from trailing_events
)
SELECT
*
,sum(session_flag::int) over (partition by device_id order by ts) as session_id
FROM session_flags
where f_get_session_flag is
create or replace function f_get_session_flag(arr varchar(max))
returns boolean
stable as $$
stream = arr.split(',')
flags = [0,0,0,0,0,0,0,0,0,0,0,0]
is_new_session = False
for row in stream:
if flags[row.event_id] == 0:
flags[row.event_id] = 1
is_new_session = False
else:
session_id+=1
flags = [0,0,0,0,0,0,0,0,0,0,0,0]
is_new_session = True
return is_new_session
$$ language plpythonu;
prev answer:
The flags could be replicated as the division remainder of running count of the event and 2:
1 -> 1%2 = 1
2 -> 2%2 = 0
3 -> 3%2 = 1
4 -> 4%2 = 0
5 -> 5%2 = 1
6 -> 6%2 = 0
and concatenated into a bit mask (similar to flags array in the pseudocode). The only tricky point is when to exactly reset all flags to zeros and initiate the new session ID but I could get quite close. If your sample table is called t and it has ts and type columns the script could look like this:
with
-- running count of the events
t1 as (
select
*
,sum(case when type=1 then 1 else 0 end) over (order by ts) as type_1_cnt
,sum(case when type=2 then 1 else 0 end) over (order by ts) as type_2_cnt
,sum(case when type=3 then 1 else 0 end) over (order by ts) as type_3_cnt
,sum(case when type=4 then 1 else 0 end) over (order by ts) as type_4_cnt
from t
)
-- mask
,t2 as (
select
*
,case when type_1_cnt%2=0 then '0' else '1' end ||
case when type_2_cnt%2=0 then '0' else '1' end ||
case when type_3_cnt%2=0 then '0' else '1' end ||
case when type_4_cnt%2=0 then '0' else '1' end as flags
from t1
)
-- previous row's mask
,t3 as (
select
*
,lag(flags) over (order by ts) as flags_prev
from t2
)
-- reset the mask if there is a switch from 1 to 0 at any position
,t4 as (
select *
,case
when (substring(flags from 1 for 1)='0' and substring(flags_prev from 1 for 1)='1')
or (substring(flags from 2 for 1)='0' and substring(flags_prev from 2 for 1)='1')
or (substring(flags from 3 for 1)='0' and substring(flags_prev from 3 for 1)='1')
or (substring(flags from 4 for 1)='0' and substring(flags_prev from 4 for 1)='1')
then '0000'
else flags
end as flags_override
from t3
)
-- get the previous value of the reset mask and same event type flag for corner case
,t5 as (
select *
,lag(flags_override) over (order by ts) as flags_override_prev
,type=lag(type) over (order by ts) as same_event_type
from t4
)
-- again, session ID is a switch from 1 to 0 OR same event type (that can be a switch from 0 to 1)
select
ts
,type
,sum(case
when (substring(flags_override from 1 for 1)='0' and substring(flags_override_prev from 1 for 1)='1')
or (substring(flags_override from 2 for 1)='0' and substring(flags_override_prev from 2 for 1)='1')
or (substring(flags_override from 3 for 1)='0' and substring(flags_override_prev from 3 for 1)='1')
or (substring(flags_override from 4 for 1)='0' and substring(flags_override_prev from 4 for 1)='1')
or same_event_type
then 1
else 0 end
) over (order by ts) as session_id
from t5
order by ts
;
You can add necessary partitions and extend to 12 event types, this code is intended to work on a sample table that you provided... it's not perfect, if you run the subqueries you'll see that flags are reset more often than needed but overall it works except the corner case for session id 2 with a single event type=4 following the end of the other session with the same event type=4, so I have added a simple lookup in same_event_type and used it as another condition for a new session id, hope this will work on a bigger dataset.
The solution I decided to live with is effectively "don't do it in SQL" by deferring the actual sessionising to a scalar function written in python.
--
-- The input parameter should be a comma delimited list of identifiers
-- Each identified should be a "power of 2" value, no lower than 1
-- (1, 2, 4, 8, 16, 32, 64, 128, etc, etc)
--
-- The input '1,2,4,2,1,1,4' will give the output '0001010'
--
CREATE OR REPLACE FUNCTION public.f_indentify_collision_indexes(arr varchar(max))
RETURNS VARCHAR(MAX)
STABLE AS
$$
stream = map(int, arr.split(','))
state = 0
collisions = []
item_id = 1
for item in stream:
if (state & item) == (item):
collisions.append('1')
state = item
else:
state |= item
collisions.append('0')
item_id += 1
return ''.join(collisions)
$$
LANGUAGE plpythonu;
NOTE : I wouldn't use this if there are hundreds of event types ;)
Effectively I pass in a data structure of events in sequence, and the return is a data structure of where the new sessions start.
I chose the actual data structures so make the SQL side of things as simple as I could. (Might not be the best, very open to other ideas.)
INSERT INTO
sessionised_event_stream
SELECT
device_id,
REGEXP_COUNT(
LEFT(
public.f_indentify_collision_indexes(
LISTAGG(event_type_id, ',')
WITHIN GROUP (ORDER BY session_event_sequence_id)
OVER (PARTITION BY device_id)
),
session_event_sequence_id::INT
),
'1',
1
) + 1
AS session_login_attempt_id,
session_event_sequence_id,
event_timestamp,
event_type_id,
event_data
FROM
(
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY device_id
ORDER BY event_timestamp, event_type_id, event_data)
AS session_event_sequence_id
FROM
event_stream
)
Assert a deterministic order to the events (encase of events happening at the same time, etc)
ROW_NUMBER() OVER (stuff) AS session_event_sequence_id
Create a comma delimited list of event_type_id's
LISTAGG(event_type_id, ',') => '1,2,4,8,2,1,4,1,4,4,1,1'
Use python to work out the boundaries
public.f_magic('1,2,4,8,2,1,4,1,4,4,1,1') => '000010010101'
For the first event in the sequence, count the number of 1's up to and including the first character in the 'boundaries'. For the second event in the sequence, count the number of 1's up to and including the second character in the boundaries, etc, etc.
event 01 = 1 => boundaries = '0' => session_id = 0
event 02 = 2 => boundaries = '00' => session_id = 0
event 03 = 4 => boundaries = '000' => session_id = 0
event 04 = 8 => boundaries = '0000' => session_id = 0
event 05 = 2 => boundaries = '00001' => session_id = 1
event 06 = 1 => boundaries = '000010' => session_id = 1
event 07 = 4 => boundaries = '0000100' => session_id = 1
event 08 = 1 => boundaries = '00001001' => session_id = 2
event 09 = 4 => boundaries = '000010010' => session_id = 2
event 10 = 4 => boundaries = '0000100101' => session_id = 3
event 11 = 1 => boundaries = '00001001010' => session_id = 3
event 12 = 1 => boundaries = '000010010101' => session_id = 4
REGEXP_COUNT( LEFT('000010010101', session_event_sequence_id), '1', 1 )
The result is something that's not very speedy, but robust and still better than other options I've tried. What it "feels like" is that (perhaps, maybe, I'm not sure, caveat, caveat) if there are 100 items in a stream then LIST_AGG() is called once and the python UDF is called 100 times. I might be wrong. I've seen Redshift do worse things ;)
Pseudo code for what turns out to be a worse option.
Write some SQL that can find "the next session" from any given stream.
Run that SQL once storing the results in a temp table.
=> Now have the first session from every stream
Run it again using the temp table as an input
=> We now also have the second session from every stream
Keep repeating this until the SQL inserts 0 rows in to the temp table
=> We now have all the sessions from every stream
The time taken to calculate each session was relatively low, and was actually dominated by the overhead of making repeated requests to RedShift. It also meant that the dominant factor was "how many session are in the longest stream" (In my case, 0.0000001% of the streams were 1000x longer than the average.)
The python version is actually slower in most individual cases, but is not dominated by those annoying outliers. This meant that overall the python version completed about 10x sooner than the "external loop" version described here. It also used a bucket load more CPU resources in total, but elapsed time is the more important factor right now :)
I have a query that produces the following:
Team | Member | Cancelled | Rate
-----------------------------------
1 John FALSE 150
1 Bill TRUE 10
2 Sarah FALSE 145
2 James FALSE 110
2 Ashley TRUE 0
What I need is to select the count of members for a team where cancelled is false and the sum of the rate regardless of cancelled status...something like this:
SELECT
Team,
COUNT(Member), --WHERE Cancelled = FALSE
SUM(Rate) --All Rows
FROM
[QUERY]
GROUP BY
Team
So the result would look like this:
Team | CountOfMember | SumOfRate
----------------------------------
1 1 160
2 2 255
This is just an example. The real query has multiple complex joins. I know I could do one query for the sum of the rate and then another for the count and then join the results of those two together, but is there a simpler way that would be less taxing and not cause me to copy and paste an already complex query?
You want a conditional sum, something like this:
sum(case when cancelled = 'false' then 1 else 0 end)
The reason for using sum(). The sum() is processing the records and adding a value, either 0 or 1 for every record. The value depends on the valued of cancelled. When it is false, then the sum() increments by 1 -- counting the number of such values.
You can do something similar with count(), like this:
count(case when cancelled = 'false' then cancelled end)
The trick here is that count() counts the number of non-NULL values. The then clause can be anything that is not NULL -- cancelled, the constant 1, or some other field. Without an else, any other value is turned into NULL and not counted.
I have always preferred the sum() version over the count() version, because I think it is more explicit. In other dialects of SQL, you can sometimes shorten it to:
sum(cancelled = 'false')
which, once you get used to it, makes a lot of sense.