BigQuery Join Using Most Recent Row - sql

I have seen variations of this question but have been searching StackOverflow for almost a week now trying various solutions and still struggling with this. Really appreciate you taking the time to consider my question.
I am working on a research project in GCP using BigQuery. I have a table result of ~100 million rows of events where there is a session_id column that relates to the session that the event originated from. I would like to join this with another table status of about 40 million rows that has that same session_id and tracks the status of those sessions. Both tables have a time column. In the result table, this is the time of the event. In the status table this is the time of any status changes. I want to join the rows in the result table with the corresponding row in the status table for the most recent state of the session up to or before the time of the event using the session ID. The result would be that each row in the result table would have the corresponding information about the state of the session when the event occurred.
How can I achieve this? Any way to do it that won't be really inefficient? Thank you so much for your help!

You may be able to use a left join:
select r.*, s.status -- choose whatever columns you want
from result r left join
(select s.*,
lead(time) over (partition by session_id order by time) as next_time
from status s
) s
on r.session_id = s.session_id and
r.time <= s.time and
(r.time > s.next_time or s.next_time is null)

Related

How to get last value from a table category wise?

I have a problem with retrieving the last value of every category from my table which should not be sorted. For example i want the daily inventory value of nov-1 last appearance in the table without sorting the column daily inventory i.e "471". Is there a way to achieve this?
similarly i need to get the value of the next week's last daily inventory value and i should be able to do this for multiple items in the table too.
p.s: nov-1 represents nov-1 st week
Question from comments of initial post: will I be able to achieve what I need if I introduce a column id? If so, how can I do it?
Here's a way to do it (no guarantee that it's the most efficient way to do it)...
;WITH SetID AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY Week ORDER BY Week) AS rowid, * FROM <TableName>
),
MaxRow AS
(
SELECT LastRecord = MAX(rowid), Week
FROM SetID
GROUP BY Week
)
SELECT a.*
FROM SetID a
INNER JOIN MaxRow b
ON a.rowid = b.LastRecord
AND b.Week = a.Week
ORDER BY a.Week
I feel like there's more to the table though, and this is also untested on large amounts of data. I'd be afraid that a different RowID could be potentially assigned upon each run. (I haven't used ROW_NUMBER() enough to know if this would throw unexpected data.)
I suppose this example is to enforce the idea that, if you had a dedicated rowID on the table, it's possible. Also, I believe #Larnu's comment to you on your original post - introducing an ID column that retains current order, but reinserting all your data - is a concern too.
Here's a SQLFiddle example here.

Why do I get extra rows in LEFT JOIN when joining to an ID and TIMESTAMP column?

I have a table that contains multiple registration periods (date and time for the start of the registration, as well as date and time for when that instance of registration ends). For each row (registration period), there is a status column that contains the status at the end of the registration period. I was trying to get the status associated with the most recent end date of registration per a given ID. I've used a window function to get the most recent end date of interest per ID, and then I wanted to LEFT JOIN on ID and end date to get the status from the same table on which I used the window function. There should really just be one just one combination for a given end date and status per ID, but somehow I get more rows that what's in the left table.
Like I mentioned earlier, my approach was to use a window function to get MAX(end_date) per ID and some other column, let's call it enrollment_number. Then use LEFT JOIN on this table and its parent table to bring in status associated with that date only. Later, I'd like to use the result of this join to bring in the status associated with the end date into other tables where I need it.
WITH
my_first_test AS
(
SELECT my_id,
enrollment_number,
MAX(end_date_of_enrollment) OVER (partition by my_id, enrollment_number) AS end_date_enrolled
FROM enrollments
)
SELECT mft.my_id, mft.end_date_enrolled, e.status
FROM my_first_test AS mft
LEFT JOIN enrollments AS e
ON mft.my_id = e.my_id AND mft.end_date_enrolled = e.end_date_enrolled;
The CTE returns 42917 rows, same number of rows as in the enrollments table, which it should be if I understand it correctly.
Then, I LEFT JOIN enrollments, to bring in information from the status column also contained in the enrollments table. The LEFT JOIN is done on my_id and end_date_enrolled.
I expect 42917 rows in the resulting table, because my_id and end_date_enrolled together should be unique. However, I get slightly more rows in my final table - 44408. I was wondering if the StackOverflow community would be able to help me solve this mystery. I am using SQL in AWS Redshift.
You have duplicates in enrollments. You can find them with aggregation:
SELECT my_id, end_date_enrolled, COUNT(*)
FROM enrollments AS e
GROUP BY my_id, end_date_enrolled
HAVING COUNT(*) > 1;

How to get timespans between entities from timed log table?

I have a log table to store user's login/logout logs. My goal is to calculate how many times each user logged in and logined-times for each login.
I'm working on PostgreSQL database. Log table has log_id(PK), user_id(FK), login_state, created_time. login_state column is enum type and its value is either 'login' or 'logout'.
For now, I used self join on Log table like below.
SELECT A.log_id, A.user_id, A.login_state, A.created_time, B.log_id, B.login_state, B.created_time, (B.created_time-A.created_time) elapsedtime
FROM logtable A INNER JOIN logtable B ON (A.login_state='login' AND B.login_state='logout')
WHERE (A.user_id=B.user_id) AND (A.created_time<=B.created_time);
I got some right records but there are also wrong records.
I think maybe join couldn't be a solution. For each login entity, only one logout entity should be matched but I couldn't write the right query statement for this.
The best result could be a collection of login-logout pairs and it's elapsed time for each user.
Need some helps. Thanks.
============== Add some sample data and expected results =========
Sample Log Table
Expect Results
DB Fiddler for test
https://www.db-fiddle.com/f/vz6EyKKTg6PWs1X4HbTspB/0
demo:db<>fiddle
You can use the lead() window function to get the next value into the current record.
SELECT
*,
logout_time - created_time AS elapsed_time
FROM (
SELECT
*,
lead(created_time) OVER (PARTITION BY user_id ORDER BY created_time) as logout_time
FROM logtable
) s
WHERE login_state = 'login'
ORDER BY created_time

How to get all records from one table between dates of column in another table

I'm using sql server, I have a queues table has a column called process_date, and a users table has a column called process_date, there is no relation between the two tables. I want to get all users that has been processed with one queue depending on process_date (I have the queue_id and I want to return all users that have been processed with that queue)
How can I write a query to do so?
update: the process_date is date time so it's not a match for the queue and the users, that's why I needed to get the process date for the previous queue and then the current queue and then get the users that were processed in that date range
Try this
SELECT
*
FROM Users
WHERE EXISTS
(
SELECT
1
FROM [Queue]
WHERE [Queue].process_date = Users.process_date
AND [Queue].QueueId = 1
)
this will return all users processed for queue_id 1.
Note:
it is not recommended to join tables based on some dates to retrieve the records. so if possible, you should add a queue id to the users' table and create a foreign key constraint with the queue table. then update the queue id for each user because time ca variation in milliseconds can also leade to miss match in the above
With so little detail to work with any answer will likewise be "thin on detail" too. e.g.
select * from users u
inner join queue q on u.process_date = q.process_date
or if there is some form of delay then perhaps a small time range may be needed: e.g.
select * from users u
inner join queue q on u.process_date between q.process_date and dateadd(minute,1,q.process_date)
The process_date field is date time so it won't be a match that's why I'm not joining, What I did is selecting the previous process_date from the queue, and the process_date for the current queue, and then selecting the users with process_date between these two dates

Updating rows in a view from a complex select statement

I have been working on a query that identifies an issue with the data in my database:
SELECT t1.*
FROM [DailyTaskHours] t1
INNER JOIN (
SELECT ActivityDate
,taskId
,EnteredBy
FROM [DailyTaskHours]
WHERE hours != 0
GROUP BY EnteredBy
,taskId
,ActivityDate
HAVING COUNT(*) > 1
) t2 ON (
t1.ActivityDate = t2.ActivityDate
AND t1.taskId = t2.taskId
AND t1.EnteredBy = t2.EnteredBy
AND t1.Hours != 0
)
ORDER BY ActivityDate
What this does is find duplicate hours booked for the same person on the same task on the same day:
Now that I found the issues I want to correct them with an UPDATE. I want the duplicate activity that was created earlier than the other to move the value from Hours to doubleBookedHours and for Hours to be zeroed out. Secondly, I want the more recent row's DoubleBookedFLag column to be updated to 1.
How can I achieve this?
You can write a SQL Server Agent Job to call T-SQL or a SSIS package to perform your logic.
I always like using pseudo code when designing my algorithm.
For instance.
Find duplicate entries and save them to a temporary table, either in a staging area or tempdb. Some location that is accessible by multiple processes (spids).
Find least recent records (1+). Move hours to double booked column?
Find least recent records (1+). Zero out hours column.
Update the most recent record to have double book flag column set to 1.
You were not specific on moving the value from hours to double booked hours. Are these columns?
In short, a SQL Server Agent job and several correct T-SQL steps should solve your problem.