BigQuery - Query time becomes extremely long

BigQuery - Query time becomes extremely long - google-bigquery

Recently all my query takes too long time but basically all of them consume no data.
For example, for a really simple query
Start Time: Jan 14, 2016, 12:35:13 PM
End Time: Jan 14, 2016, 12:35:15 PM
Bytes Processed: 0 B
Bytes Billed: 0 B
Billing Tier: 1
Destination Table: ****************.******************
Write Preference: Append to table
Allow Large Results: true
Flatten Results: true
this is the information I got from the BQ console, which tells me that this query doesn't consume any data(it's true) and only takes two seconds.
But it actually takes 27 seconds when I run this query again in the console by click Run Query in the query history. And after that, the Query History in the console shows this query takes 2 seconds again.
Basically all the query in this dataset have this issue.
I have over 40000 tables in this dataset.
So my guess is that before the BQ actually run the query, it first locates the table that is gonna be used. Then it starts to execute the query, which here is the start time in the query history.
If that is the case, how should I solve it and why does it take so long?
Here is the query I mentioned(have made some changes):
select "some_id", '2015-12-01', if (count(user_id) == 0, NULL, sum(users_in_today_again) / count(user_id)) as retention
from
(
select
users_in_last_day.user_id as user_id,
if(users_in_today.user_id is null, 0, 1) as users_in_today_again
FROM
(
select user_id
from
table_date_range(ds.sessions_some_id_, date_add(timestamp('2015-12-01'), -1, "DAY"), date_add(timestamp('2015-12-01'), -1, "DAY"))
group by user_id
) as users_in_last_day
left join
(
select user_id
from table_date_range(ds.sessions_some_id_, timestamp('2015-12-01'), timestamp('2015-12-01'))
group by user_id
) as users_in_today
on users_in_last_day.user_id = users_in_today.user_id
)
Thanks in advance!

PART 1
You can check your theory about delay before start time by using Jobs:Get API with the jobid taken from Query History in BQ Console.
As you can see in Job Resources - statistics parameter in addition to startTime and endTime has also has also creationTime
PART 2
Shooting in the dark here, but try below
SELECT "some_id", '2015-12-01', IF (COUNT(user_id) == 0, NULL, SUM(users_in_today_again) / COUNT(user_id)) AS retention
FROM
(
SELECT
users_in_last_day.user_id AS user_id,
IF(users_in_today.user_id IS NULL, 0, 1) AS users_in_today_again
FROM
(
SELECT user_id FROM (
SELECT user_id, ROW_NUMBER() OVER(PARTITION BY user_id) AS pos
FROM TABLE_DATE_RANGE(ds.sessions_some_id_, DATE_ADD(TIMESTAMP('2015-12-01'), -1, "DAY"), DATE_ADD(TIMESTAMP('2015-12-01'), -1, "DAY"))
) WHERE pos = 1
) AS users_in_last_day
LEFT JOIN
(
SELECT user_id FROM (
SELECT user_id, ROW_NUMBER() OVER(PARTITION BY user_id) AS pos
FROM TABLE_DATE_RANGE(ds.sessions_some_id_, TIMESTAMP('2015-12-01'), TIMESTAMP('2015-12-01'))
) WHERE pos = 1
) AS users_in_today
ON users_in_last_day.user_id = users_in_today.user_id
)
I know, it might look silly, but explanation stats (based on some dummy data) for this version
is totally different from same for version in question
My wild guess is that heavy read/compute Stage1/2 in original version can be responsible for the delay in question
Just guess

As hinted at on the comment thread on Mikhail's question, most of the time is probably spent evaluating the TABLE_DATE_RANGE functions in the query. This time is currently accounted for between creationTime and startTime in query statistics.
In general, tens or hundreds of thousands of tables in a dataset will cause slow performance when using TABLE_DATE_RANGE, TABLE_QUERY, or the <dataset>.__TABLES__ metatable. We are working to update our public documentation to mention this.
My suggestion is that if you want to use table wildcards on a dataset, make sure it doesn't have too many tables in it. If that solution is unworkable for you, let us know if BigQuery could support something that would make your use case easier on our issue tracker.

Related

Find two local averages within one SQL Server data set

In the plant at our company there is a physical process that has a two-stage start and a two-stage finish. As a widget starts to enter the process a new record is created containing the widget ID and a timestamp (DateTimeCreated) and once the widget fully enters the process another timestamp is logged in a different field for the same record (DateTimeUpdated). The interval is a matter of minutes.
Similarly, as a widget starts to exit the process another record is created containing the widget ID and the DateTimeCreated, with the DateTimeUpdated being populated when the widget has fully exited the process. In the current table design an "exiting" record is indistinguishable from an "entering" record (although a given widget ID occurs only either once or twice so a View could utilise this fact to make the distinction, but let's ignore that for now).
The overall time a widget is in the process is several days but that's not really of importance to the discussion. What is important is that the interval when exiting the process is always longer than when entering. So a very simplified, imaginary set of sorted interval values might look like this:
1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 6, 7, 7, 7, 7, 8, 8, 8, 8, 10, 10, 10
You can see there is a peak in the occurrences of intervals around the 3-minute-mark (the "enters") and another peak around the 7/8-minute-mark (the "exits"). I've also excluded intervals of 5 minutes to demonstrate that enter-intervals and exit-intervals can be considered mutually exclusive.
We want to monitor the performance of each stage in the process daily by using a query to determine the local averages of the entry and exit data point clusters. So conceptually the two data sets could be split either side of an overall average (in this case 5.375) and then an average calculated for the values below the split (2.75) and another average above the split (8). Using the data above (in a random distribution) the averages are depicted as the dotted lines in the chart below.
My current approach is to use two Common Table Expressions followed by a final three-table-join query. It seems okay, but I can't help feeling it could be better. Would anybody like to offer an alternative approach or other observations?
WITH cte_Raw AS
(
SELECT
DATEDIFF(minute, DateTimeCreated, DateTimeUpdated) AS [Interval]
FROM
MyTable
WHERE
DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime) -- Today
)
, cte_Midpoint AS
(
SELECT
AVG(Interval) AS Interval
FROM
cte_Raw
)
SELECT
AVG([Entry].Interval) AS AverageEntryInterval
, AVG([Exit].Interval) AS AverageExitInterval
FROM
cte_Raw AS [Entry]
INNER JOIN
cte_Midpoint
ON
[Entry].Interval < cte_Midpoint.Interval
INNER JOIN
cte_Raw AS [Exit]
ON
[Exit].Interval > cte_Midpoint.Interval

I don't think your query produces accurate results. Your two JOINs are producing a proliferation of rows, which throw the averages off. They might look correct (because one is less than the other), but it you did counts, you would see that the counts in your query have little to do with the sample data.
If you are just looking for the average of values that are less than the overall average and greater than the overall average, then you an use window functions:
WITH t AS (
SELECT t.*, v.[Interval],
AVG(v.[Interval]) OVER () as avg_interval
FROM MyTable t CROSS JOIN
(VALUES (DATEDIFF(minute, DateTimeCreated, DateTimeUpdated))
) v(Interval)
WHERE DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime)
)
SELECT AVG(CASE WHEN t.[Interval] < t.avg_interval THEN t.[Interval] END) AS AverageEntryInterval,
AVG(CASE WHEN t.[Interval] > t.avg_interval THEN t.[Interval] END) AS AverageExitInterval
FROM t;

I decided to post my own answer as at the time of writing neither of the two proposed answers will run. I have however removed the JOIN statements and used the CASE statement approach proposed by Gordon.
I've also multiplied the DATEDIFF result by 1.0 to prevent rounding of results from the AVG function.
WITH cte_Raw AS
(
SELECT
1.0 * DATEDIFF(minute, DateTimeCreated, DateTimeUpdated) AS [Interval]
FROM
MyTable
WHERE
DateTimeCreated > CAST(CAST(GETDATE() AS date) AS datetime) -- Today
)
, cte_Midpoint AS
(
SELECT
AVG(Interval) AS Interval
FROM
cte_Raw
)
SELECT AVG(CASE WHEN cte_Raw.Interval < cte_Midpoint.Interval THEN cte_Raw.[Interval] END) AS AverageEntryInterval,
AVG(CASE WHEN cte_Raw.Interval > cte_Midpoint.Interval THEN cte_Raw.[Interval] END) AS AverageExitInterval
FROM cte_Raw CROSS JOIN cte_Midpoint
This solution does not cater for the theoretical pitfall indicated by Vladimir of uneven dispersions of Entry vs Exit intervals, as in practice we can be confident this does not occur.

SQL script to find previous value, not necessarily previous row

is there a way in SQL to find a previous value, not necessarily in the previous row, within the same SELECT statement?
See picture below. I'd like to add another column, ELAPSED, that calculates the time difference between TIMERSTART, but only when DEVICEID is the same, and I_TYPE is viewDisplayed. e.g. subtract 1 from 2, store difference in 3, store 0 in 4 because i_type is not viewDisplayed, subtract 2 from 5, store difference in 6, and so on.
It has to be a statement, I can't use a stored procedure in this case.
SELECT DEVICEID, I_TYPE, TIMERSTART,
O AS ELAPSED -- CASE WHEN <CONDITION> THEN TIMEDIFF() ELSE 0 END AS ELAPSED
FROM CLIENT_USAGE
ORDER BY TIMERSTART ASC
I'm using SAP HANA DB, but it works pretty much like the latest version of MS-SQL. So, if you know how to make it work in SQL, I can make it work in HANA.

You can make a subquery to find the last time entered previous to the row in question.
select deviceid, i_type, timerstart, (timerstart - timerlast) as elapsed.
from CLIENT_USAGE CU
join ( select top 1 timerstart as timerlast
from CLIENT_USAGE C
where (C.i_type = CU.i_type) and
(C.deviceid = CU.deviceid) and (C.timerstart < CU.timerstart)
order by C.timerstart desc
) as temp1
on temp1.i_type = CU.i_type
order by timerstart asc
This is a rough sketch of what the sql should look like I do not know what your primary key is on this table if it is i_type or i_type and deviceid. But this should help with how to atleast calculate the field. I do not think it would be necessary to store the value unless this table is very large or the hardware being used is very slow. It can be calculated rather easily each time this query is run.

SAP HANA supports window functions:
select DEVICEID,
TIMERSTART,
lag(TIMERSTART) over (partition by DEVICEID order by TIMERSTART) as previous_start
from CLIENT_USAGE
Then you can wrap this in parentheses and manipulate the data to your hearts' content

How to choose the latest partition in BigQuery table?

I am trying to select data from the latest partition in a date-partitioned BigQuery table, but the query still reads data from the whole table.
I've tried (as far as I know, BigQuery does not support QUALIFY):
SELECT col FROM table WHERE _PARTITIONTIME = (
SELECT pt FROM (
SELECT pt, RANK() OVER(ORDER by pt DESC) as rnk FROM (
SELECT _PARTITIONTIME AS pt FROM table GROUP BY 1)
)
)
WHERE rnk = 1
);
But this does not work and reads all rows.
SELECT col from table WHERE _PARTITIONTIME = TIMESTAMP('YYYY-MM-DD')
where 'YYYY-MM-DD' is a specific date does work.
However, I need to run this script in the future, but the table update (and the _PARTITIONTIME) is irregular. Is there a way I can pull data only from the latest partition in BigQuery?

October 2019 Update
Support for Scripting and Stored Procedures is now in beta (as of October 2019)
You can submit multiple statements separated with semi-colons and BigQuery is able to run them now
See example below
DECLARE max_date TIMESTAMP;
SET max_date = (
SELECT MAX(_PARTITIONTIME) FROM project.dataset.partitioned_table`);
SELECT * FROM `project.dataset.partitioned_table`
WHERE _PARTITIONTIME = max_date;
Update for those who like downvoting without checking context, etc.
I think, this answer was accepted because it addressed the OP's main question Is there a way I can pull data only from the latest partition in BigQuery? and in comments it was mentioned that it is obvious that BQ engine still scans ALL rows but returns result based on ONLY recent partition. As it was already mentioned in comment for question - Still something that easily to be addressed by having that logic scripted - first getting result of subquery and then use it in final query
Try
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
SELECT MAX(TIMESTAMP(partition_id))
FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
)
or
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
SELECT MAX(_PARTITIONTIME)
FROM [dataset.partitioned_table]
)

Sorry for digging up this old question, but it came up in a Google search and I think the accepted answer is misleading.
As far as I can tell from the documentation and running tests, the accepted answer will not prune partitions because a subquery is used to determine the most recent partition:
Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.
So, although the suggested answer will deliver the results you expect, it will still query all partitions. It will not ignore all older partitions and only query the latest.
The trick is to use a more-or-less-constant to compare to, instead of a subquery. For example, if _PARTITIONTIME isn't irregular but daily, try pruning partitions by getting yesterdays partition like so:
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONDATE = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
Sure, this isn't always the latest data, but in my case this happens to be close enough. Use INTERVAL 0 DAY if you want todays data, and don't care that the query will return 0 results for the part of the day where the partition hasn't been created yet.
I'm happy to learn if there is a better workaround to get the latest partition!

List all partitions with:
#standardSQL
SELECT
_PARTITIONTIME as pt
FROM
`[DATASET].[TABLE]`
GROUP BY 1
And then choose the latest timestamp.
Good luck :)
https://cloud.google.com/bigquery/docs/querying-partitioned-tables

I found the workaround to this issue. You can use with statement, select last few partitions and filter out the result. This is I think better approach because:
You are not limited by fixed partition date (like today - 1 day). It will always take the latest partition from given range.
It will only scan last few partitions and not whole table.
Example with last 3 partitions scan:
WITH last_three_partitions as (select *, _PARTITIONTIME as PARTITIONTIME
FROM dataset.partitioned_table
WHERE _PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 3 DAY))
SELECT col1, PARTITIONTIME from last_three_partitions
WHERE PARTITIONTIME = (SELECT max(PARTITIONTIME) from last_three_partitions)

A compromise that manages to query only a few partitions without resorting to scripting or failing with missing partitions for fixed dates.
WITH latest_partitions AS (
SELECT *, _PARTITIONDATE AS date
FROM `myproject.mydataset.mytable`
WHERE _PARTITIONDATE > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
)
SELECT
*
FROM
latest_partitions
WHERE
date = (SELECT MAX(date) FROM latest_partitions)

You can leverage the __TABLES__ list of tables to avoid re-scanning everything or having to hope latest partition is ~3 days ago. I did the split and ordinal stuff to guard against in case my table prefix appears more than once in the table name for some reason.
This should work for either _PARTITIONTIME or _TABLE_SUFFIX.
select * from `project.dataset.tablePrefix*`
where _PARTITIONTIME = (
SELECT split(table_id,'tablePrefix')[ordinal(2)] FROM `project.dataset.__TABLES__`
where table_id like 'tablePrefix%'
order by table_id desc limit 1)

I had this answer in a less popular question, so copying it here as it's relevant (and this question is getting more pageviews):
Mikhail's answer looks like this (working on public data):
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
AND wiki='es'
# 122.2 MB processed
But it seems the question wants something like this:
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')
AND wiki='es'
# 50.6 GB processed
... but for way less than 50.6GB
What you need now is some sort of scripting, to perform this in 2 steps:
max_date = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')
;
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = {{max_date}}
AND wiki='es'
# 115.2 MB processed
You will have to script this outside BigQuery - or wait for news on https://issuetracker.google.com/issues/36955074.

Building on the answer from Chase. If you have a table that requires you filter over a column, and you're receiving the error:
Cannot query over table 'myproject.mydataset.mytable' without a filter over column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME' that can be used for partition elimination
Then you can use:
SELECT
MAX(_PARTITIONTIME) AS pt
FROM
`myproject.mydataset.mytable`
WHERE _PARTITIONTIME IS NOT NULL
Instead of the latest partition, I've used this to get the earliest partition in a dataset by simply changing max to min.

SQL Query - Pull User Achievement

First, I am sorry as I could not come up with better title for this question.
I have a badge/achievement system in my website, community users are rewarded specific badges according to their activity in the website, below sql example I use to pull the number of users who made at at least 100 forum posts (I am using informix db version 10)
SELECT tjm.userid::INTEGER AS user_id,
EXTEND(DBINFO("UTC_TO_DATETIME",tjm.creationdate/1000), year to fraction)
AS earned_date
FROM TABLE(
MULTISET(
SELECT jm.userid, jm.creationdate, (
SELECT COUNT(*) from TABLE(
MULTISET(
SELECT userid, creationdate
FROM jive:jivemessage
)
) AS i
WHERE i.userid = jm.userid AND i.creationdate < jm.creationdate
) + 1 AS row_num
FROM jive:jivemessage jm
)
) AS tjm
WHERE tjm.row_num=100
This sql takes around more than 30 minutes to execute, we have a very large community and there are millions of forum posts.
I would like to know if there is a solution to improve the query performance? I am trying to reduce the execution time because I have 40 sql queries similar to this one but for different tables and activities.

I don't now Informix DB, but the query below should do what you ask and it's ANSI SQL (except for the EXTEND part, which I copied from your original query).
SELECT
jm.userid
,EXTEND(DBINFO("UTC_TO_DATETIME",tjm.creationdate/1000), year to fraction) AS earned_date
FROM
(
-- This sub-query will return all Users who have 100 messages or more
SELECT
jm.userid
,count(jm.userid) as totalmessages
FROM
jive:jivemessage jm
GROUP BY
jm.userid
HAVING
count(jm.userid) >= 100) AS MessageCount
The above could probably be done without having to use a sub-query. The only reason why I used it is to have the DateEarned, as per original query, in the result set. Adding it to the sub-query would have required adding it to the GROUP BY, with unpredictable results if the query runs across two days (e.g. at 23:59:59).
Update 2012/08/14 - Rewritten query following new requirements
As I stated before, I don't know Informix at all, therefore the following query may or may not run.
SELECT
UsersWithBadge.userid
,MAX(UsersWithBadge.creationdate) as dateearned
FROM
(
SELECT FIRST 100
jm.userid
,jm.creationdate
FROM
jive:jivemessage jm
JOIN
(-- This sub-query will return all Users who have 100 messages or more
SELECT
jm.userid
,count(jm.userid) as totalmessages
FROM
jive:jivemessage jm
GROUP BY
jm.userid
HAVING
count(jm.userid) >= 100)
AS MessageCount ON
(MessageCount.userid = jm.userid)
) AS UsersWithBadge
GROUP BY
UsersWithBadge.userid

How do I analyse time periods between records in SQL data without cursors?

The root problem: I have an application which has been running for several months now. Users have been reporting that it's been slowing down over time (so in May it was quicker than it is now). I need to get some evidence to support or refute this claim. I'm not interested in precise numbers (so I don't need to know that a login took 10 seconds), I'm interested in trends - that something which used to take x seconds now takes of the order of y seconds.
The data I have is an audit table which stores a single row each time the user carries out any activity - it includes a primary key, the user id, a date time stamp and an activity code:
create table AuditData (
AuditRecordID int identity(1,1) not null,
DateTimeStamp datetime not null,
DateOnly datetime null,
UserID nvarchar(10) not null,
ActivityCode int not null)
(Notes: DateOnly (datetime) is the DateTimeStamp with the time stripped off to make group by for daily analysis easier - it's effectively duplicate data to make querying faster).
Also for the sake of ease you can assume that the ID is assigned in date time order, that is 1 will always be before 2 which will always be before 3 - if this isn't true I can make it so).
ActivityCode is an integer identifying the activity which took place, for instance 1 might be user logged in, 2 might be user data returned, 3 might be search results returned and so on.
Sample data for those who like that sort of thing...:
1, 01/01/2009 12:39, 01/01/2009, P123, 1
2, 01/01/2009 12:40, 01/01/2009, P123, 2
3, 01/01/2009 12:47, 01/01/2009, P123, 3
4, 01/01/2009 13:01, 01/01/2009, P123, 3
User data is returned (Activity Code 2) immediate after login (Activity Code 1) so this can be used as a rough benchmark of how long the login takes (as I said, I'm interested in trends so as long as I'm measuring the same thing for May as July it doesn't matter so much if this isn't the whole login process - it takes in enough of it to give a rough idea).
(Note: User data can also be returned under other circumstances so it's not a one to one mapping).
So what I'm looking to do is select the average time between login (say ActivityID 1) and the first instance after that for that user on that day of user data being returned (say ActivityID 2).
I can do this by going through the table with a cursor, getting each login instance and then for that doing a select to say get the minimum user data return following it for that user on that day but that's obviously not optimal and is slow as hell.
My question is (finally) - is there a "proper" SQL way of doing this using self joins or similar without using cursors or some similar procedural approach? I can create views and whatever to my hearts content, it doesn't have to be a single select.
I can hack something together but I'd like to make the analysis I'm doing a standard product function so would like it to be right.

SELECT TheDay, AVG(TimeTaken) AvgTimeTaken
FROM (
SELECT
CONVERT(DATE, logins.DateTimeStamp) TheDay
, DATEDIFF(SS, logins.DateTimeStamp,
(SELECT TOP 1 DateTimeStamp
FROM AuditData userinfo
WHERE UserID=logins.UserID
and userinfo.ActivityCode=2
and userinfo.DateTimeStamp > logins.DateTimeStamp )
)TimeTaken
FROM AuditData logins
WHERE
logins.ActivityCode = 1
) LogInTimes
GROUP BY TheDay
This might be dead slow in real world though.

In Oracle this would be a cinch, because of analytic functions. In this case, LAG() makes it easy to find the matching pairs of activity codes 1 and 2 and also to calculate the trend. As you can see, things got worse on 2nd JAN and improved quite a bit on the 3rd (I'm working in seconds rather than minutes).
SQL> select DateOnly
2 , elapsed_time
3 , elapsed_time - lag (elapsed_time) over (order by DateOnly) as trend
4 from
5 (
6 select DateOnly
7 , avg(databack_time - prior_login_time) as elapsed_time
8 from
9 ( select DateOnly
10 , databack_time
11 , ActivityCode
12 , lag(login_time) over (order by DateOnly,UserID, AuditRecordID, ActivityCode) as prior_login_time
13 from
14 (
15 select a1.AuditRecordID
16 , a1.DateOnly
17 , a1.UserID
18 , a1.ActivityCode
19 , to_number(to_char(a1.DateTimeStamp, 'SSSSS')) as login_time
20 , 0 as databack_time
21 from AuditData a1
22 where a1.ActivityCode = 1
23 union all
24 select a2.AuditRecordID
25 , a2.DateOnly
26 , a2.UserID
27 , a2.ActivityCode
28 , 0 as login_time
29 , to_number(to_char(a2.DateTimeStamp, 'SSSSS')) as databack_time
30 from AuditData a2
31 where a2.ActivityCode = 2
32 )
33 )
34 where ActivityCode = 2
35 group by DateOnly
36 )
37 /
DATEONLY ELAPSED_TIME TREND
--------- ------------ ----------
01-JAN-09 120
02-JAN-09 600 480
03-JAN-09 150 -450
SQL>
Like I said in my comment I guess you're working in MSSQL. I don't know whether that product has any equivalent of LAG().

If the assumptions are that:
Users will perform various tasks in no mandated order, and
That the difference between any two activities reflects the time it takes for the first of those two activities to execute,
Then why not create a table with two timestamps, the first column containing the activity start time, the second column containing the next activity start time. Thus the difference between these two will always be total time of the first activity. So for the logout activity, you would just have NULL for the second column.
So it would be kind of weird and interesting, for each activity (other than logging in and logging out), the time stamp would be recorded in two different rows--once for the last activity (as the time "completed") and again in a new row (as time started). You would end up with a jacob's ladder of sorts, but finding the data you are after would be much more simple.
In fact, to get really wacky, you could have each row have the time that the user started activity A and the activity code, and the time started activity B and the time stamp (which, as mentioned above, gets put down again for the following row). This way each row will tell you the exact difference in time for any two activities.
Otherwise, you're stuck with a query that says something like
SELECT TIME_IN_SEC(row2-timestamp) - TIME_IN_SEC(row1-timestamp)
which would be pretty slow, as you have already suggested. By swallowing the redundancy, you end up just querying the difference between the two columns. You probably would have less need of knowing the user info as well, since you'd know that any row shows both activity codes, thus you can just query the average for all users on any given day and compare it to the next day (unless you are trying to find out which users are having the problem as well).

This is the faster query to find out, in one row you will have current and row before datetime value, after that you can use DATEDIFF ( datepart , startdate , enddate ). I use #DammyVariable and DamyField as i remember the is some problem if is not first #variable=Field in update statement.
SELECT *, Cast(NULL AS DateTime) LastRowDateTime, Cast(NULL As INT) DamyField INTO #T FROM AuditData
GO
CREATE CLUSTERED INDEX IX_T ON #T (AuditRecordID)
GO
DECLARE #LastRowDateTime DateTime
DECLARE #DammyVariable INT
SET #LastRowDateTime = NULL
SET #DammyVariable = 1
UPDATE #T SET
#DammyVariable = DammyField = #DammyVariable
, LastRowDateTime = #LastRowDateTime
, #LastRowDateTime = DateTimeStamp
option (maxdop 1)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery - Query time becomes extremely long - google-bigquery

Related

Find two local averages within one SQL Server data set

SQL script to find previous value, not necessarily previous row

How to choose the latest partition in BigQuery table?

SQL Query - Pull User Achievement

How do I analyse time periods between records in SQL data without cursors?

Categories

Resources