Complex grouping case - using "delimiting tinyint flag" between groups - sql

Below is the an example of the page-views dataset in question.
Presented records are sorted in ASC order by a timestamp.
I need to calculate some per-session measures from the dataset.
The problem is that there is no clear identifier for a session. The only thing that is available is is_a_new_session flag - which serves as a kind of a delimiter between sessions. So, in the given example there are 5 separate sessions.
How could I generate some sort of a session identifier and add it to the dataset, so that I can later use it for grouping per session?
The desired new column would like similar to this:

Use a cumulative sum to define the groups and then aggregate:
select min(timestamp), max(timestamp), . . . -- whatever columns you want
from (select t.*,
sum(is_a_new_session) over (order by timestamp) as grp
from t
) t
group by grp;

Related

SQl - How do I compare the data from one row to the next?

My data has a single ticket record with a Slot field to identify how long the ticket will take.
I don't know how to have the empty date fields populate with the data above based on how many 30 min slots there in the ticket.
I want to replicate the SCHED_START AND SCHED_END dates from the ticket row to the other DATEDATE rows.
Here's my query that produces this data:
SELECT DATEDATE
,TICKET_ID
,TECH
,TICK_TYPE
,SCHED_START
,SCHED_END
,PREV_DATE
,SLOTS
,SLOT_MINUTES
FROM DATES_TICKETS
ORDER BY DATEDATE, TECH
If I understand correctly, you want last_value(ignore nulls):
select t.*,
last_value(sched_start ignore nulls) over (order by datedate),
. . .
from t;
Not all databases support this SQL standard functionality.

How do I group a set of entities (people) into 10 equal portions but based on their usage for EX using Oracle's Toad 11g (SQL)

Hi I have a list of 2+ mil people and their usage put in order from largest to smallest.
I tried ranking using row_number () over (partition by user column order by usage desc) as rnk
but that didnt work ..the results were crazy.
Simply put, I just want 10 equal groups of 10 with the first group consisting of the highest usage in the order of which i had first listed them.
HELP!
You can use ntile():
select t.*, ntile(10) over (order by usage desc) as usage_decile
from t;
The only caveat: This will divide the data into exactly 10 equal sized groups. If usage values have duplicates, then users with the same usage will be in different deciles.
If you don't want that behavior, use a more manual calculation:
select t.*,
ceil(rank() over (order by usage desc) * 10 /
count(*) over ()
) as usage_decile
from t;

postgres select aggregate timespans

I have a table with the following structure:
timstamp-start, timestamp-stop
1,5
6,10
25,30
31,35
...
i am only interested in continuous timespans e.g. the break between a timestamp-end and the following timestamp-start is less than 3.
How could I get the aggregated covered timespans as a result:
timestamp-start,timestamp-stop
1,10
25,35
The reason I am considering this is because a user may request a timespan that would need to return several thousand rows. However, most records are continous and using above method could potentially reduce many thousand of rows down to just a dozen. Or is the added computation not worth the savings in bandwith and latency?
You can group the time stamps in three steps:
Add a flag to determine where a new period starts (that is, a gap greater than 3).
Cumulatively sum the flag to assign groupings.
Re-aggregate with the new groupings.
The code looks like:
select min(ts_start) as ts_start, max(ts_end) as ts_end
from (select t.*,
sum(flag) over (order by ts_start) as grouping
from (select t.*,
(coalesce(ts_start - lag(ts_end) over (order by ts_start),0) > 3)::int as flag
from t
) t
) t
group by grouping;

How do I group entry timestamps into visitor-specific "sessions" in PostgreSQL?

Here's some mock data:
visitor_id,channel,timestamp,order_id,session
100,A,1,,1
100,B,2,,1
100,A,3,,1
100,B,4,1,1
100,B,5,,2
100,B,6,,2
100,B,7,2,2
100,A,8,,3
100,A,9,,3
A visitor will come into the site via channels, and eventually order (creating an order_id). Many visitors never order, but I still want to group their session together (to determine what was their first channel, for example). The last column is one example solving the problem.
What's an efficient, declarative statement to create it in PostgreSQL? Are there better solutions than what I am proposing?
You want to combine the values up-to an order. One method would be to assign each row a grouping id, such as the number of orders before a given time. This can be done with a correlated subquery:
select md.*,
(select count(md2.order_id)
from mockdata md2
where md2.visitor_id = md.visitor_id and
md2.timestamp < md.timestamp
) as session
from mockdata md;
This can also be done using a cumulative count:
select md.*,
count(order_id) over (partition by visitor_id
order by timestamp
rows between unbounded preceding and 1 preceding
) as session
from mockdata md;

Teradata - Max value of a dataset with corresponding date

This is probably obvious, I just can't seem to get it to work right. Let's say I have a table of various servers and their CPU percentages for every day for the past year. I want to basically say:
"for every server name, show me the max CPU value that this server hit (from this dataset) and the corresponding date that it happened on"
So ideally I would get a result like:
server1 52.34% 3/16/2012
server2 48.76% 4/15/2012
server3 98.32% 6/16/2012
etc..
When I try to do this like so, I can't use a group by or else it just shows me every date:
select servername, date, max(cpu) from cpu_values group by 1,2 order by 1,2;
This of course just gives me every server and every date.. Sub-query? Partition by? Any assistance would be appreciated!
You can use the row_number() OLAP window function:
select servername
, cpu
, date
from cpu_values
qualify row_number() over (partition by servername
order by cpu desc) = 1
Notice that you do not need a GROUP BY or ORDER BY clause. The PARTITION clause is similar to a GROUP BY and the ORDER BY clause sorts the rows within each partition (in this case by descending cpu). The "=1" part selects the single row that satisfies the condition.
A subquery would be the simplest solution:
SELECT
S.Name, Peak.PeakUsage, MIN(S.Date) AS Date
FROM
ServerHistory AS S
INNER JOIN
(
SELECT
ID, MAX(CPUUsage) AS PeakUsage
FROM
ServerHistory
WHERE
Date BETWEEN X AND Y
GROUP BY
ID
) AS Peak ON S.ID = Peak.ID
GROUP BY
S.Name, Peak.PeakUsage
P.S., next time around, you may want to tag with "SQL". There are relatively few Teradata people out there, but plenty who can help with basic SQL questions.