Parallelizable OVER EACH BY - google-bigquery

I am hitting this obstacle again and again...
JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions
Is there a best practice or recommendations how to use window functions (Over()) with very large data sets that cannot be processed on a single node?
Fragmenting my data and running the same query with different filters can work, but its very limiting, takes lot of time (and manual labor) and costly (running same query on the same data set 30 times instead of once).
Referring to Jeremy's answer bellow...
It's better, but still doesn't work properly.
If I take my original query sample:
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true
)
group by title
Now works...
But
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A-Z]')=true
)
group each by title
Gives again the Resources Exceeded Error...

Window functions can now be executed in distributed fashion according to the PARTITION BY clause given inside OVER. If you supply a PARTITION BY with your window functions, your data will be processed in parallel similar to how JOIN EACH and GROUP EACH BY are processed.
In addition, you can use PARTITION BY on the output of JOIN EACH or GROUP EACH BY without serializing execution. Using the same keys for PARTITION BY as for JOIN EACH or GROUP EACH BY is particularly efficient, because the data will not need to be reshuffled between join/aggregation and window function execution.

Update: note Jeremy's comment with good news.
OVER() functions always need to run on the whole dataset as the last step of execution (they even run after the LIMIT clauses). Everything needs to fit in the last VM, unless it's parallelizable with a PARTITION clause.
When I find this type of errors, I try to filter as much data as I can in earlier steps.
For example, this query doesn't run:
SELECT Year, Actor1Name, Actor2Name, c FROM (
SELECT Actor1Name, Actor2Name, Year, COUNT(*) c, RANK() OVER(PARTITION BY YEAR ORDER BY c DESC) rank
FROM
(SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name),
(SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name),
WHERE Actor1Name IS NOT null
AND Actor2Name IS NOT null
GROUP EACH BY 1, 2, 3
)
WHERE rank=1
ORDER BY Year
But I can fix it easily with an earlier filter, in this case adding a "HAVING c > 100":
SELECT Year, Actor1Name, Actor2Name, c FROM (
SELECT Actor1Name, Actor2Name, Year, COUNT(*) c, RANK() OVER(PARTITION BY YEAR ORDER BY c DESC) rank
FROM
(SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name),
(SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name),
WHERE Actor1Name IS NOT null
AND Actor2Name IS NOT null
GROUP EACH BY 1, 2, 3
HAVING c > 100
)
WHERE rank=1
ORDER BY Year
So what is happening here: Before applying RANK() OVER(), I'm getting rid of many of the combinations that won't matter when I'm looking for the top ones (as I'm filtering out everything with a count less than 100).
To give a more specific answer, it's always better if you can supply a query and sample data to review.

Related

How to get first and last record from same group in SQL Server?

I'm a new SQL user and need help.
Let's say I have a vehicle number 123 and I've traveled from Region 3 to final destination Region 4. In between, I've visited Region 1 and 5 as well but that's not my concern.
Simple example would be as follow.
Original Table
Desired Output
How can this be done in SQL query?
You have a sequence number so you can use some form of aggregation. One method is:
select records,
max(case when sequence = 1 then fromregion end) as fromregion,
max(case when sequence = maxsequence then toregion) as toregion
from (select t.*, max(sequence) over (partition by records) as max_sequence
from t
) t
group by records;
Unfortunately, SQL Server doesn't offer "first()" or "last()" as aggregation functions. But it does support first_value() as a window function. This allows you to do the logic without a subquery:
select distinct records,
first_value(fromRegion) over (partition by records order by sequence) as fromregion,
first_value(toRegion) over (partition by records order by sequence desc) as toregion
from t;

Define groups of row by logic

I have a unique scenario to which i can't find a solution, so i thought to ask the experts :)
I have a query that returns a course syllabus, which each row represent a day of training. You can see in the picture below that there are rest days in the middle of the training
I can't find a way to group the each consecutive training days
Please see screenshot below detailed the rows and what i want to achieve
I am using MS-SQL 2014
Here is a Fiddle with the data i have and the expected results
SQL Fiddle
The simplest method is a difference of row_number(). The following identifies each consecutive group with a number:
select td.*,
dense_rank() over (order by dateadd(day, - seqnum, DayOfTraining)) as grpnum
from (select td.*,
row_number() over (order by DayOfTraining) as seqnum
from TrainingDays td
) td;
The key idea is that subtracting a sequence from consecutive days produces a constant for those days.
Here is the SQL Fiddle.
After many hit and trials, this is the closest I could come up with
http://rextester.com/ECBQ88563
The problem here is that if last row belongs to another group, it will still use it with previous group. So in your sample if you change last date from 19 to 20, the output will still be the same. May be with another condition we can eliminate it. Other than that this should work.
SELECT DayOfTraining1,
dense_rank() over (ORDER BY grp_dt) AS grp
FROM
(SELECT DayOfTraining1,
min(DayOfTraining) AS grp_dt
FROM
(SELECT trng.DayOfTraining AS DayOfTraining1,
dd.DayOfTraining
FROM trng
CROSS JOIN
(SELECT d.*
FROM
(SELECT trng.*,
lag (DayOfTraining,1) OVER (
ORDER BY DayOfTraining) AS nxt_DayOfTraining,
lead (DayOfTraining,1) OVER (
ORDER BY DayOfTraining) AS prev_DayOfTraining,
datediff(DAY, lag (DayOfTraining,1) OVER (
ORDER BY DayOfTraining), DayOfTraining) AS ddf
FROM trng
) d
WHERE d.ddf <> 1
OR prev_DayOfTraining IS NULL
) dd
WHERE trng.DayOfTraining <= dd.DayOfTraining
) t
GROUP BY DayOfTraining1
) t1;
Explanation: The inner query d is using lag and lead functions to capture previous and next rows values. Then we are taking the days difference and using and capturing dates where difference is not 1. These are the dates where group should switch. Use a derived table dd for the same.
Now cross join this with main table and use aggregate function to determine the continuous groups (took me many hit and trials) to achieve this.
Then use dense_rank function on it to get the group.

How to count rows in SQL Server 2012?

I am trying to find whether a person (id = A3) is continuously active in a program at least five months or more in a given year (2013). Any suggestion would be appreciated. My data look like as follows:
You simply use group by and a conditional expression:
select id,
(case when count(ActiveMonthYear) >= 5 then 'YES!' else 'NAW' end)
from table t
where ListOfTheMonths between '201301' and '201312'
group by id;
EDIT:
I suppose "continuously" doesn't just mean any five months. For that, there are various ways. I like the difference of row numbers approach
select distinct id
from (select t.*,
(row_number() over (partition by id order by ListOfTheMonths) -
count(ActiveMonthYear) over (partition by id order by ListOfTheMonths)
) as grp
from table t
where ListOfTheMonths between '201301' and '201312'
) t
where ActiveMonthYear is not null
group by id, grp
having count(*) >= 5;
The difference in the subquery is constant for groups of consecutive active months. This is then used a grouping. The result is a list of all ids that meet this criteria. You can add a where for a particular id (do it in the subquery).
By the way, this is written using select distinct and group by. This is one of the rare cases where these two are appropriately used together. A single id could have two periods of five months in the same year. There is no reason to include that person twice in the result set.

Is it possible to calculate the sum of each group in a table without using group by clause

I am trying to find out if there is any way to aggregate a sales for each product. I realise I can achieve it either by using group-by clause or by writing a procedure.
example:
Table name: Details
Sales Product
10 a
20 a
4 b
12 b
3 b
5 c
Is there a way possible to perform the following query with out using group by query
select
product,
sum(sales)
from
Details
group by
product
having
sum(sales) > 20
I realize it is possible using Procedure, could it be done in any other way?
You could do
SELECT product,
(SELECT SUM(sales) FROM details x where x.product = a.product) sales
from Details a;
(and wrap it into another select to simulate the HAVING).
It's possible to use analytic functions to do the sum calculation, and then wrap that with another query to do your filtering.
See and play with the example here.
select
running_sum,
OwnerUserId
from (
select
id,
score,
OwnerUserId,
sum(score) over (partition by OwnerUserId order by Id) running_sum,
last_value(id) over (partition by OwnerUserId order by OwnerUserId) last_id
from
Posts
where
OwnerUserId in (2934433, 10583)
) inner_q
where inner_q.id = inner_q.last_id
--and running_sum > 20;
We keep a running sum going on the partition of the owner (product), and we tally up the last id for the same window, which is the ID we'll use to get the total sum. Wrap it all up with another query to make sure you get the "last id", take the sum, and then do any filtering you want on the result.
This is an extremely round-about way to avoid using GROUP BY though.
If you don't want nested select statements (run slower), use CASE:
select
sum(case
when c.qty > 20
then c.qty
else 0
end) as mySum
from Sales.CustOrders c

Selecting 5 Most Recent Records Of Each Group

The below statement retrieves the top 2 records within each group in SQL Server. It works correctly, however as you can see it doesn't scale at all. I mean that if I wanted to retrieve the top 5 or 10 records instead of just 2, you can see how this query statement would grow very quickly.
How can I convert this query into something that returns the same records, but that I can quickly change it to return the top 5 or 10 records within each group instead, rather than just 2? (i.e. I want to just tell it to return the top 5 within each group, rather than having 5 unions as the below format would require)
Thanks!
WITH tSub
as (SELECT CustomerID,
TransactionTypeID,
Max(EventDate) as EventDate,
Max(TransactionID) as TransactionID
FROM Transactions
WHERE ParentTransactionID is NULL
Group By CustomerID,
TransactionTypeID)
SELECT *
from tSub
UNION
SELECT t.CustomerID,
t.TransactionTypeID,
Max(t.EventDate) as EventDate,
Max(t.TransactionID) as TransactionID
FROM Transactions t
WHERE t.TransactionID NOT IN (SELECT tSub.TransactionID
FROM tSub)
and ParentTransactionID is NULL
Group By CustomerID,
TransactionTypeID
Use Partition by to solve this type problem
select values from
(select values ROW_NUMBER() over (PARTITION by <GroupColumn> order by <OrderColumn>)
as rownum from YourTable) ut where ut.rownum<=5
This will partitioned the result on the column you wanted order by EventDate Column then then select those entry having rownum<=5. Now you can change this value 5 to get the top n recent entry of each group.