JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions - google-bigquery

How would you overcome the above restriction?
I am trying to find flows based on sequences of 3 records using the LEAD and LAG window functions, and than calculate some aggregations (count, sum, etc,) of their attributes.
When i run my queries on a small sample of data, everything is fine and the group by runs OK. but when running on larger data set, i get: "Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead."
In many other cases switching to GROUP EACH BY do the work...
However, as I use window functions, I cannot use EACH...
Any suggestions? Best practices?
here is a sample query based of wikipedia sample data. it shows the frequency of title editing by different contributors. the where condition is just to limit response size, if you remove the "B" we get results, if we add it we got the "use EACH" recomendation.
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true)
group by title
Thanks

I guess your particular use case is different to the sample query, but let me comment on what I'm able to see:
You found a way to make GROUP EACH and OVER possible: Surrounding the OVER() query with another one allows you to change the GROUP BY to GROUP EACH BY. However, this query's problem is not there.
Let's forget about GROUP and GROUP EACH. Let's look at the core query:
SELECT title, contributor_id, LEAD(contributor_id)
OVER(PARTITION BY title ORDER BY timestamp) AS LeadContributor
FROM [publicdata:samples.wikipedia]
WHERE REGEXP_MATCH(title, r'^[A,B]')
This query fails with r'^[A,B]' and works with r'^[A]', and it highlight an OVER() limitation: As GROUP BY and ORDER BY, it only works when data fits in one machine, as they are not parallelizable. As the answer to r'^[A]' reveals, that can be a lot of data - though sometimes not enough. That's why BigQuery offers the parallelizable GROUP EACH BY. However, there is no parallelizable OVER EACH BY we can use here.
The workaround I would apply here is exactly what you are doing: Do the OVER() with just a fraction of the data.
(btw, let me say I love the sample query... it's an interesting question with an interesting answer!)

Related

SQLite alias (AS) not working in the same query

I'm stuck in an (apparently) extremely trivial task that I can't make work , and I really feel no chance than to ask for advice.
I used to deal with PHP/MySQL more than 10 years ago and I might be quite rusty now that I'm dealing with an SQLite DB using Qt5.
Basically I'm selecting some records while wanting to make some math operations on the fetched columns. I recall (and re-read some documentation and examples) that the keyword "AS" is going to conveniently rename (alias) a value.
So for example I have this query, where "X" is an integer number that I render into this big Qt string before executing it with a QSqlQuery. This query lets me select all the electronic components used in a Project and calculate how many of them to order (rounding to the nearest multiple of 5) and the total price per component.
SELECT Inventory.id, UsedItems.pid, UsedItems.RefDes, Inventory.name, Inventory.category,
Inventory.type, Inventory.package, Inventory.value, Inventory.manufacturer,
Inventory.price, UsedItems.qty_used as used_qty,
UsedItems.qty_used*X AS To_Order,
ROUND((UsedItems.qty_used*X/5)+0.5)*5*CAST((X > 0) AS INT) AS Nearest5,
Inventory.price*Nearest5 AS TotPrice
FROM Inventory
LEFT JOIN UsedItems ON Inventory.id=UsedItems.cid
WHERE UsedItems.pid='1'
ORDER BY RefDes, value ASC
So, for example, I aliased UsedItems.qty_used as used_qty. At first I tried to use it in the next field, multiplying it by X, writing "used_qty*X AS To_Order" ... Query failed. Well, no worries, I had just put the original tab.field name and it worked.
Going further, I have a complex calculation and I want to use its result on the next field, but the same issue popped out: if I alias "ROUND(...)" AS Nearest5, and then try to use this value by multiplying it in the next field, the query will fail.
Please note: the query WORKS, but ONLY if I don't use aliases in the following fields, namely if I don't use the alias Nearest5 in the TotPrice field. I just want to avoid re-writing the whole ROUND(...) thing for the TotPrice field.
What am I missing/doing wrong? Either SQLite does not support aliases on the same query or I am using a wrong syntax and I am just too stuck/confused to see the mistake (which I'm sure it has to be really stupid).
Column aliases defined in a SELECT cannot be used:
For other expressions in the same SELECT.
For filtering in the WHERE.
For conditions in the FROM clause.
Many databases also restrict their use in GROUP BY and HAVING.
All databases support them in ORDER BY.
This is how SQL works. The issue is two things:
The logic order of processing clauses in the query (i.e. how they are compiled). This affects the scoping of parameters.
The order of processing expressions in the SELECT. This is indeterminate. There is no requirement for the ordering of parameters.
For a simple example, what should x refer to in this example?
select x as a, y as x
from t
where x = 2;
By not allowing duplicates, SQL engines do not have to make a choice. The value is always t.x.
You can try with nested queries.
A SELECT query can be nested in another SELECT query within the FROM clause;
multiple queries can be nested, for example by following the following pattern:
SELECT *,[your last Expression] AS LastExp From (SELECT *,[your Middle Expression] AS MidExp FROM (SELECT *,[your first Expression] AS FirstExp FROM yourTables));
Obviously, respecting the order that the expressions of the innermost select query can be used by subsequent select queries:
the first expressions can be used by all other queries, but the other intermediate expressions can only be used by queries that are further upstream.
For your case, your query may be:
SELECT *, PRC*Nearest5 AS TotPrice FROM (SELECT *, ROUND((UsedItems.qty_used*X/5)+0.5)*5*CAST((X > 0) AS INT) AS Nearest5 FROM (SELECT Inventory.id, UsedItems.pid, UsedItems.RefDes, Inventory.name, Inventory.category, Inventory.type, Inventory.package, Inventory.value, Inventory.manufacturer, Inventory.price AS PRC, UsedItems.qty_used*X AS To_Order FROM Inventory LEFT JOIN UsedItems ON Inventory.id=UsedItems.cid WHERE UsedItems.pid='1' ORDER BY RefDes, value ASC))

Using analytical Count(distinct) on Vertica is not supported

Having a thorough Google research, it seems that Vertica DB simply does not support count(distinct <col>) over(<partition by>), as it causes:
"ERROR 4249: Only MIN/MAX are allowed to use DISTINCT ... MIN/MAX are allowed to use DISTINCT"
I'm looking for an easy walk-around for this one.
Meanwhile, I'm using joins or nested queries.
For example:
select campaign_id, segment_id, COUNT(DECODE(rank, 1, 1, NULL)) over()
from (select campaign_id, segment_id, row_number() over(partition by segment_id) rank
from cs)
But my query is very long and I need to invent tricks all over the way. Any idea for a better approach?
Thanks!
(Working at HPE? Please implement this, as you did for all common analytical funcitions!)
I had to do something similar nested counting structure for counting distinct values cumulatively, over a date range. It boiled down to a similar gathering up of ROW_NUMBER() = 1 rows, though I used case:
COUNT(CASE WHEN rank = 1 THEN userID END) OVER (...)
It wasn't pretty to look at, but it was mercifully not slow.
I need to invent tricks all over the way
Yeah, I think that just happens when you bump into missing features.

Optimize query to avoid "Resources exceeded during query execution"

I've saved this query to as a View:
SELECT nth(1,CodAlm) as FirstCode,
nth(1,DesAlm) as FirstDescription,
last(CodAlm) as LastCode,
Last(DesAlm) as LastDescription,
max(DATE(DataTic)) as LastVisit,
min(DATE(DataTic)) as FirstVisit,
DATEDIFF(CURRENT_TIMESTAMP(),TIMESTAMP(max(DATE(DataTic)))) as Diffdays,
count(distinct DATE(DataTic)) as countVisits,
count(distinct CodAlm) as NumberCodes,
sum(subtot) as Totalimport,
TarCli,
Last(nomcli) as Name,
Last(cogcli) as LastName,
Last(emailcli) as email,
Last(sexcli) as gender
FROM (SELECT CodAlm, DesAlm, DataTic,SubTot, TarCli, NomCli,CogCli,EmailCli,SexCli FROM [bime.Sales] where Year(DataTic)>2012 AND IsFirstLine="1" ORDER by TarCli, DataTic)
group each by tarcli
But, When I run any query over this view, bigquery returns me Resources exceeded during query execution. I think that ORDER BY is the cause of my problem, but I need this to show my results correctly. How I could rewrite this query correctly? bime.Sales table have 18 Milion rows.
See this question for more info:
What causes "resources exceeded" in BigQuery?
The error is likely caused by the GROUP EACH BY clause, and the most likely reason is that you have a skewed distribution of keys (i.e., one key with a disproportionate number of records). Can you look at your data distribution, and perhaps filter out any skewed keys?
Also note that the ORDER BY is not guaranteed to be preserved by GROUP EACH BY, so you need to apply the ordering after the GROUP EACH BY. You may find it useful to use analytic functions like FIRST_VALUE, NTH_VALUE, and LAST_VALUE with OVER(PARTITION BY tarcli ORDER BY DataTic) instead of GROUP EACH BY if you want to get reliable ordering.
Some things you should consider and try (if you haven't done that so far):
1) Do you really need the "group each by"? Have you tried with just "group by"?
2) Have you tried to use a table instead of a view? You could try to "materialize" the view to check if the resource consumption decreases.
3) Can you shard the data? Perhaps putting each year or month in a different table (using DataTic). That would decrease the size of each table and, therefore, the resource usage.
Cheers!

SQL grouping with "Invalid use of group" error

I'll be upfront, this is a homework question, but I've been stuck on this one for hours and I just was looking for a push in the right direction. First I'll give you the relations and the hw question for background, then I'll explain my question:
Branch (BookCode, BranchNum, OnHand)
HW problem: List the BranchNum for all branches that have at least one book that has at least 10 copies on hand.
My question: I understand that I must take the SUM(OnHand) grouped by BookCode, but how do I then take that and group it by BranchNum? This is logically what I come up with and various versions:
select distinct BranchNum
from Inventory
where sum(OnHand) >= 10
group by BookCode;
but I keep getting an error that says "Invalid use of group function."
Could someone please explain what is wrong here?
UPDATE:
I understand now, I had to use the HAVING statement, the basic form is this:
select distinct (what you want to display)
from (table)
group by
having
Try this one.
SELECT BranchNum
FROM Inventory
GROUP BY BranchNum
HAVING SUM(OnHand) >= 10
You can also find Group By Clause with example here.
Although all comments in the question seem to be valid and add information they all seem to be missing why your query is not working. The reason is simple and is strictly related by the state/phase at which the sum is calculated.
The where clause is the first thing that will get executed. This means it will filter all rows at the beginning. Then the group by will come in effect and will merge all rows that are not specified in the clause and apply the aggregated functions (if any).
So if you try to add an aggregated function to the where clause you're trying to aggregate before data is being grouped by and even filtered. The having clause gets executed after the group by and allows you to filter the aggregated functions, as they have already been calculated.
That's why you can write HAVING SUM(OnHand) >= 10 and you can't write WHERE SUM(OnHand) >= 10.
Hope this helps!

Vague count in sql select statements

I guess this has been asked in the site before but I can't find it.
I've seen in some sites that there is a vague count over the results of a search. For example, here in stackoverflow, when you search a question, it says +5000 results (sometimes), in gmail, when you search by keywords, it says "hundreds" and in google it says aprox X results. Is this just a way to show the user an easy-to-understand-a-huge-number? or this is actually a fast way to count results that can be used in a database [I'm learning Oracle at the moment 10g version]? something like "hey, if you get more than 1k results, just stop and tell me there are more than 1k".
Thanks
PS. I'm new to databases.
Usually this is just a nice way to display a number.
I don't believe there is a way to do what you are asking for in SQL - count does not have an option for counting up until some number.
I also would not assume this is coming from SQL in either gmail, or stackoverflow.
Most search engines will return a total number of matches to a search, and then let you page through results.
As for making an exact number more human readable, here is an example from Rails:
http://api.rubyonrails.org/classes/ActionView/Helpers/NumberHelper.html#method-i-number_to_human
With Oracle, you can always resort to analytical functions in order to calculate the exact number of rows about to be returned. This is an example of such a query:
SELECT inner.*, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
This will give you the total number of rows for your specific subquery. When you want to apply paging as well, you can further wrap these SQL parts as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*,ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
)
WHERE ROWNUM < :max_row
) outer
WHERE outer.RNUM > :min_row
Replace min_row and max_row by meaningful values. But beware that calculating the exact number of rows can be expensive when you're not filtering using UNIQUE SCAN or relatively narrow RANGE SCAN operations on indexes. Read more about this here: Speed of paged queries in Oracle
As others have said, you can always have an absolute upper limit, such as 5000 to your query using a ROWNUM <= 5000 filter and then just indicate that there are more than 5000+ results. Note that Oracle can be very good at optimising queries when you apply ROWNUM filtering. Find some info on that subject here:
http://www.dba-oracle.com/t_sql_tuning_rownum_equals_one.htm
Vague count is a buffer which will be displayed promptly. If user wants to see more results then he can request more.
It's a performance facility, after displaying the results the sites like google keep searching for more results.
I don't know how fast this will run, but you can try:
SELECT NULL FROM your_tables WHERE your_condition AND ROWNUM <= 1001
If count of rows in result will equals to 1001 then total count of records will > 1000.
this question gives some pretty good information
When you do an SQL query you can set a
LIMIT 0, 100
for example and you will only get the first hundred answers. so you can then print to your viewer that there are 100+ answers to their request.
For google I couldn't say if they really know there is more than 27'000'000'000 answer to a request but I believe they really do know. There are some standard request that have results stored and where the update is done in the background.