Access 10th through 70th element in STRUCT - google-bigquery

I have 3 fields: username, tracking_id, timestamp. One user will have multiple rows (some have more, some have less) with different tracking ids and timestamps for each action he has taken on my website. I want to group by the username and get the tracking ids of that user's 10th through 70th action. I use standard SQL on BigQuery.
First problem is, I can't find syntax to access a range in the STRUCT (only a single row or using a limit to get the first/last 70 rows for example). Then, I can image after managing to access a range, there could be an issue with the index being out of bounds because some users might not have 70 or more actions.
SELECT
username,
ARRAY_AGG(STRUCT(tracking_id,
timestamp)
ORDER BY
timestamp
)[OFFSET (9 to 69)] #??????
FROM
table
The result should be a table with the same 3 fields: username, tracking_id, timestamp, but instead of containing ALL the user's rows, it should only contain each users 10th to 70th row.

Below is for BigQuery Standard SQL
#standardSQL
SELECT username,
ARRAY_AGG(STRUCT(tracking_id, `timestamp`) ORDER BY `timestamp`) AS selected_actions
FROM (
SELECT * EXCEPT(pos) FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY username ORDER BY `timestamp`) pos
FROM `project.dataset.table`
)
WHERE pos BETWEEN 10 AND 70
)
GROUP BY username

Related

Group by question in SQL Server, migration from MySQL

Failed finding a solution to my problem, would love your help.
~~ Post has been edited to have only one question ~~-
Group by one query while selecting multiple columns.
In MySQL you can simply group by whatever you want, and it will still select all of them, so if for example I wanted to select the newest 100 transactions, grouped by Email (only get the last transaction of a single email)
In MySQL I would do that:
SELECT * FROM db.transactionlog
group by Email
order by TransactionLogId desc
LIMIT 100;
In SQL Server its not possible, googling a bit suggested to specify each column that I want to have with an aggregate as a hack, that couldn't cause a mix of values (mixing columns between the grouped rows)?
For example:
SELECT TOP(100)
Email,
MAX(ResultCode) as 'ResultCode',
MAX(Amount) as 'Amount',
MAX(TransactionLogId) as 'TransactionLogId'
FROM [db].[dbo].[transactionlog]
group by Email
order by TransactionLogId desc
TransactionLogId is the primarykey which is identity , ordering by it to achieve the last inserted.
Just want to know that the ResultCode and Amount that I'll get doing such query will be of the last inserted row, and not the highest of the grouped rows or w/e.
~Edit~
Sample data -
row1:
Email : test#email.com
ResultCode : 100
Amount : 27
TransactionLogId : 1
row2:
Email: test#email.com
ResultCode:50
Amount: 10
TransactionLogId: 2
Using the sample data above, my goal is to get the row details of
TransactionLogId = 2.
but what actual happens is that I get a mixed values of the two, as I do get transactionLogId = 2, but the resultcode and amount of the first row.
How do I avoid that?
Thanks.
You should first find out which is the latest transaction log by each email, then join back against the same table to retrieve the full record:
;WITH MaxTransactionByEmail AS
(
SELECT
Email,
MAX(TransactionLogId) as LatestTransactionLogId
FROM
[db].[dbo].[transactionlog]
group by
Email
)
SELECT
T.*
FROM
[db].[dbo].[transactionlog] AS T
INNER JOIN MaxTransactionByEmail AS M ON T.TransactionLogId = M.LatestTransactionLogId
You are currently getting mixed results because your aggregate functions like MAX() is considering all rows that correspond to a particular value of Email. So the MAX() value for the Amount column between values 10 and 27 is 27, even if the transaction log id is lower.
Another solution is using a ROW_NUMBER() window function to get a row-ranking by each Email, then just picking the first row:
;WITH TransactionsRanking AS
(
SELECT
T.*,
MostRecentTransactionLogRanking = ROW_NUMBER() OVER (
PARTITION BY
T.Email -- Start a different ranking for each different value of Email
ORDER BY
T.TransactionLogId DESC) -- Order the rows by the TransactionLogID descending
FROM
[db].[dbo].[transactionlog] AS T
)
SELECT
T.*
FROM
TransactionsRanking AS T
WHERE
T.MostRecentTransactionLogRanking = 1

How do I get the average date interval of a column in SQL?

I have a table of user interactions on a web site and I need to calculate the average time between interactions of each user. To make it more simple to understand, here's some records of the table:
Where the first column is the user id and the second is the interaction time. The results that I need is the average time between interactions of each user. Example:
The user 12345 average interaction interval is 1 day
I've already tried to use window functions, but i couldn't get the average because PostgreSQL doesn't let me use GROUP BY or AVG on window functions, I could get the intervals using the following command, but couldn't group it based on the user id.
SELECT INTERACTION_DATE - LAG(INTERACTION_DATE ) OVER (ORDER BY INTERACTION_DATE )
So, I decided to create my own custom function and after that, create a custom aggregate function to do this, and use this function on a group by clause:
CREATE OR REPLACE FUNCTION DATE_INTERVAL(TIMESTAMP)
RETURNS TABLE (USER_INTERVALS INTERVAL)
AS $$
SELECT $1 - LAG($1) OVER (ORDER BY $1)
$$
LANGUAGE SQL
IMMUTABLE;
But this function only return several rows with one column with null value.
Is there a better way to do this?
You need to first calculate the difference between the interactions for each row (and user), then you can calculate the average on that:
select user_id, avg(interaction_time)
from (
select user_id,
interaction_date - lag(interaction_date) over (partition by user_id order by interaction_date) as interaction_time
from the_table
) t
group by user_id;
Encapsule your first query then compute the average:
SELECT AVG(InteractionTime) FROM (
SELECT INTERACTION_DATE - LAG(INTERACTION_DATE ) OVER (ORDER BY INTERACTION_DATE ) AS InteractionTime
)

How to distinguish rows in a database table on the basis of two or more columns while returning all columns in sql server

I want to distinguish Rows on the basis of two or more columns value of the same table at the same time returns all columns from the table.
Ex: I have this table
DB Table
I want my result to be displayed as: filter on the basis of type and Number only. As in abover table type and Number for first and second Row is same so it should be suppressed in result.
txn item Discrip Category type Number Mode
60 2 Loyalty L 6174 XXXXXXX1390 0
60 4 Visa C 1600 XXXXXXXXXXXX4108 1
I have tried with sub query but yet unsuccessful. Please suggest what to try.
Thanks
You can do what you want with row_number():
select t.*
from (select t.*,
row_number() over (partition by type, number order by item) as seqnum
from t
) t
where seqnum = 1;

How to implement "group-by" sampling in Hive?

Given a Hive table:
create table mock
(user string,
url string
);
How to sample a certain percentage of url (say 50%) or certain number of url for each user?
There is a built-in query to extract samples from a table.
SELECT * FROM mock TABLESAMPLE(50 PERCENT)
Here is an alternative solution using row_number(). First number each rows for each user
with numbered as (
SELECT user, url, row_number() OVER (PARTITION BY user ORDER BY user) as rn FROM mock
)
Then just either select the odd or even rows using pmod to get 50% sample
SELECT user, url FROM numbered where pmod(rn,2) = 0

Group by multiple columns, get group total count and specific column from last two rows in each group

I have an SQL Server table with the following columns:
Notification
===================
Id (int)
UserId (int)
Area (int)
Action (int)
ObjectId (int)
RelatedUserLink (nvarchar(100))
Created (datetime)
The goal is to create a query that groups notifications of the same Area, Action and ObjectId for a specific user (UserId) and
returns a single row including total count for the group and also the value of a specific column for the last two rows.
The query will only be executed for one user (UserId) each time.
The problem is that I need the column RelatedUserLink for the last two records (based on Created) of each group. The RelatedUserLink should be distinct for each group (if there are more than one, only the latest should be included and counted).
The result for each group should be represented in one result-row. It doesn´t matter if the two RelatedUserLink-values are concatenated in the same column or separated in two columns as "RelatedUserLink1" and "RelatedUserLink2". If the group only consists of one result the second RelatedUserLink should simply be null.
Desired result:
UserId | Area | Action | ObjectId | RelatedUserLink1 | RelatedUserLink2 | Created (latest in group) | Count
10 1 2 100 "userlink1" "userlink2" 2016-04-08 20
10 1 3 200 "userlink1" "userlink2" 2016-04-09 4
The table will be quite large, 100.000-200.000 rows.
(The related User-table has approx 10.000 rows)
I also have the option to get all notifications for a user and then do the grouping in code but I hope there is a faster way by letting SQL server handle it!?
Any help is much appreciated!
Thanks!
I would attempt this by using the following WITH clause:
WITH RUL AS (
select
UserId,
Area,
Action,
ObjectId,
RelatedUserLink as RelatedUserLink1,
LAG(RelatedUserLink) OVER (PARTITION BY UserId, Area, Action, ObjectId ORDER BY Created) as RelatedUserLink2,
ROW_NUMBER() OVER (PARTITION BY UserId, Area, Action, ObjectId ORDER BY Created DESC) latest_to_earliest,
MAX(Created) OVER (PARTITION BY UserId, Area, Action, ObjectId) as Created,
COUNT(*) OVER OVER (PARTITION BY UserId, Area, Action, ObjectId) as Count
from
Notification
where UserId = 10
)
select
UserId,
Area,
Action,
ObjectId,
RelatedUserLink1,
RelatedUserLink2,
Created,
Count
from
RUL
where
latest_to_earliest = 1;
The LAG function will always hold the previous RelatedUserLink value (unless there is only one value in the group, which means it will be NULL). The ROW_NUMBER counts down through the group in Created order until it reaches 1 at the last row. The MAX and COUNT functions keep the maximum and count values for the entire group on each row, effectively the same as a GROUP BY, eliminating the need to perform a separate query and join back.
The SELECT outside the WITH clause just picks up the final row for each group, which should hold the last RelatedUserLink value in RelatedUserLink1 and the penultimate (or NULL) RelatedUserLink value in RelatedUserLink2.