Window functions and allow large results - google-bigquery

The window function documentation states that window functions cannot be used generate large query results:
https://developers.google.com/bigquery/query-reference#windowfunctions
This statement is repeated in the documentation for large query results:
https://developers.google.com/bigquery/querying-data#largequeryresults
I've created a query that uses window functions and creates a lot of results. The query can be found below for interest, it is run over the standard Google Analytics data extract into BigQuery.
When I run this query it returns a "Response too large to return" message. Specifying "Allow Large Results" seems to correct the problem. So I'm using both window functions and large results for this query.
This seems to be at odds with the statement that window functions can't be used to generate large query results. Can someone help me understand what this statement means?
SELECT
CONCAT(fullVisitorId, STRING(visitId)) AS fullVisitID,
hits.hitNumber as Sequence,
hits.page.pagePath as PagePath,
LAG(Pagepath, 1) OVER
(PARTITION BY fullVisitID ORDER BY Sequence Asc) AS PrePage,
LEAD(Pagepath, 1) OVER
(PARTITION BY fullVisitID ORDER BY Sequence Asc) AS PostPage
FROM [<<TABLE NAME>>]
WHERE hits.type= 'PAGE'

This is the product improving at a faster pace than documentation.
Initially window functions were not parallelizable, hence not compatible with "allow large results" (that works by paralleling the output). However BigQuery now is capable of parallelizing window function queries when they use the PARTITION keyword - hence that query now works.
Note that each partition can't be too big for this to work.

Related

Bigquery CASE SENSITIVE query with LIMIT clause is not working?

When making a Bigquery query like for example:
Select Campaign FROM TABLE WHERE Campaign CONTAINS 'buy' GROUP BY Campaign IGNORE CASE LIMIT 100
The IGNORE CASE clause is not working when used with LIMIT clause.
Some time ago it did work.
Is this a Bigquery fault or something changed?
Thanks a lot
Ramiro
A couple of things here:
Legacy SQL expects IGNORE CASE to appear at the end of the query, so you need to use LIMIT 100 IGNORE CASE instead of IGNORE CASE LIMIT 100
The BigQuery team advises using standard SQL instead of legacy SQL if you're working on new queries, since it tends to have better error messages, better performance, etc. and it's where we're focusing our efforts going forward. You may also be interested in the migration guide.
If you want to use standard SQL for your query, you could do:
Select LOWER(Campaign) AS Campaign
FROM TABLE
WHERE LOWER(Campaign) LIKE '%buy%'
GROUP BY LOWER(Campaign)
LIMIT 100

When does Google BigQuery's TOP function return approximate results?

I have a table and I want to return the most frequent value of a certain column. Usually, one would do that using the classical GROUP BY ... ORDER BY ... LIMIT. I stumbled upon the BigQuery's TOP function and I got interested in it, since the documentation states that it is generally faster. However, the documentation also says that it "may only return approximate results". When does this happen and is the usage of TOP function generally worth it when one needs accurate results?
Full description from the documentation:
TOP is a function that is an
alternative to the GROUP BY clause. It is used as simplified syntax
for GROUP BY ... ORDER BY ... LIMIT .... Generally, the TOP function
performs faster than the full ... GROUP BY ... ORDER BY ... LIMIT ...
query, but may only return approximate results.
below might more fit for comment - but too lengthy, so I put it into answer
So far, from my experience it is just good as to have simplified alternative to GROUP BY - that is, btw, applicable only in simple scenarios: A query that uses the TOP() function can return only two fields: the TOP field, and the COUNT(*) value.
That said - I don't see discrepancy in counts, while I do see it runs faster.
So, check below comparison that I run against table with 2.5B rows. As you can see - counts are exactly the same and run-time is 15% faster
At the same time if you will run similar queries and check Query Plan Explanation - you will see totally different execution pattern that might lead to different result but i was not able to catch such use case

Multiple subtotals - Rollup order of fields

I am trying to run a query that aggregates data, groups the results by several different fields, and extract all relevant "SubTotal" permutations. (similar to CUBE() in MSSQL)
When Using Group By Rollup(), I get only permutations according to the order of the Group By fields in the Rollup function.
For example the query below (runs on a public dataset), it returns subtotal by year, or by year and month, or by year, month and medallion... but it doesn't subtotal by medallion.
SELECT
trip_year,
trip_month,
medallion,
SUM(trip_count) AS Sum_trip_count
FROM
[nyc-tlc:yellow.Trips_ByMonth_ByMedallion]
WHERE
medallion IN ("2R76", "8J82", "3B85", "4L79", "5D59", "6H75", "7P60", "8V48", "1H12", "2C69", "2F38", "5Y86", "5j90", "8A75", "8V41", "9J24", "9J55", "1E13", "1J82")
GROUP BY
ROLLUP(trip_year,
trip_month,
medallion)
My question is:
What should I do in order to get all different permutations of "Sub Totals" in a single query results.
Already tried: Union with similar query but with different order, it works, but not elegant (it would require too many unions).
Thanks
You are correct on both counts. In BigQuery, ROLLUP respects the hierarchy treating the listed fields as a strictly ordered list. Their order will not be changed during aggregation.
The CUBE aggregate commonly found in other SQL environments is unordered and in fact aggregates every possible order/subset of its listed fields. At this time, CUBE has not been implemented in BigQuery. The workaround you suggest is also what I would suggest. UNION all result sets from ROLLUP using each permutation of its contained fields. Albeit not ideal, you should get the same results.
In short, UNIONs of several queries with different permutations of ROLLUP fields is the only way to achieve this at the moment. The downsides are as you state that this may be difficult to maintain and can be more expensive in queries.
If you would like to see CUBE implemented in BigQuery, I strongly encourage you to file a feature request on the Big Query public issue tracker. Be sure to include a thorough use case in this request.
UPDATE: To support the feature request filed by the OP, please star it and you'll receive notifications with updates.

Feature not implemented: WINDOW/ORDER BY

I am using an embedded Apache Derby database and execute the following query:
SELECT
someUniqueValue,
row_number() over(ORDER BY someUniqueValue) as ROWID
FROM
myTable;
someUniqueValue is a varchar.
I am getting the Exception:
java.sql.SQLFeatureNotSupportedException: Feature not implemented: WINDOW/ORDER BY
If i change the row_number() line in my query to:
row_number() over() as ROWID
The query runs fine (although the result is useless for me).
The Derby documentation states this is supported. What am I doing wrong?
The link you posted is just a draft to specify how the feature should be implemented.
If you scroll down a bit you find:
An implementation of the ROW_NUMBER() window function is included in Derby starting with the 10.4.1.3 release. Limitations and usage description may be found in the Derby Reference Manual
When you then look at Derby manual (your link is not the manual) http://db.apache.org/derby/docs/10.10/ref/rreffuncrownumber.html you'll find a list of limitations:
Derby does not currently allow the named or unnamed window specification to be specified in the OVER() clause, but requires an empty parenthesis. This means the function is evaluated over the entire result set.
The ROW_NUMBER function cannot currently be used in a WHERE clause.
Derby does not currently support ORDER BY in subqueries, so there is currently no way to guarantee the order of rows in the SELECT subquery. An optimizer override can be used to force the optimizer to use an index ordered on the desired column(s) if ordering is a firm requirement.

ORDER BY in a Sql Server 2008 view

we have a view in our database which has an ORDER BY in it.
Now, I realize views generally don't order, because different people may use it for different things, and want it differently ordered. This view however is used for a VERY SPECIFIC use-case which demands a certain order. (It is team standings for a soccer league.)
The database is Sql Server 2008 Express, v.10.0.1763.0 on a Windows Server 2003 R2 box.
The view is defined as such:
CREATE VIEW season.CurrentStandingsOrdered
AS
SELECT TOP 100 PERCENT *, season.GetRanking(TEAMID) RANKING
FROM season.CurrentStandings
ORDER BY
GENDER, TEAMYEAR, CODE, POINTS DESC,
FORFEITS, GOALS_AGAINST, GOALS_FOR DESC,
DIFFERENTIAL, RANKING
It returns:
GENDER, TEAMYEAR, CODE, TEAMID, CLUB, NAME,
WINS, LOSSES, TIES, GOALS_FOR, GOALS_AGAINST,
DIFFERENTIAL, POINTS, FORFEITS, RANKING
Now, when I run a SELECT against the view, it orders the results by GENDER, TEAMYEAR, CODE, TEAMID. Notice that it is ordering by TEAMID instead of POINTS as the order by clause specifies.
However, if I copy the SQL statement and run it exactly as is in a new query window, it orders correctly as specified by the ORDER BY clause.
The order of rows returned by a view with an ORDER BY clause is never guaranteed. If you need a specific row order, you must specify where you select from the view.
See this the note at the top of this Book On-Line entry.
SQL Server 2005 ignores TOP 100 PERCENT by design.
Try TOP 2000000000 instead.
Now, I'll try and find a reference... I was at a seminar presented by Itzak Ben-Gan who mentioned it
Found some...
Kimberly L. Tripp
"TOP 100 Percent ORDER BY Considered Harmful"
In this particular case, the optimizer
recognizes that TOP 100 PERCENT
qualifies all rows and does not need
to be computed at all.
Just use :
"Top (99) Percent "
or
"Top (a number 1000s times more than your data rows like 24682468123)"
it works! just try it.
In SQL server 2008, ORDER BY is ignored in views that use TOP 100 PERCENT. In prior versions of SQL server, ORDER BY was only allowed if TOP 100 PERCENT was used, but a perfect order was never guaranteed. However, many assumed a perfect order was guaranteed. I infer that Microsoft does not want to mislead programmers and DBAs into believing there is a guaranteed order using this technique.
An excellent comparative demonstration of this inaccuracy, can be found here...
http://blog.sqlauthority.com/2009/11/24/sql-server-interesting-observation-top-100-percent-and-order-by
Oops, I just noticed that this was already answered. But checking out the comparative demonstration is worth a look anyway.
Microsoft has fixed this. You have patch your sql server
http://support.microsoft.com/kb/926292
I found an alternative solution.
My initial plan was to create a 'sort_order' column that would prevent users from having to perform a complex sort.
I used a windowed function ROW_NUMBER. In the ORDER BY clause, I specified the default sort order that I needed (just as it would have been in the ORDER BY of a SELECT statement).
I get several positive outcomes:
By default, the data is getting returned in the default sort order I originally intended (this is probably due to the windowed function having to sort the data prior to assigning the sort_order value)
Other users can sort the data in alternative ways if they choose to
The sort_order column is there for a very specific sort need, making it easier for users to sort the data should whatever tool they use rearranges the rowset.
Note: In my specific application, users are accessing the view via Excel 2010, and by default the data is presented to the user as I had hoped without further sorting needed.
Hope this helps those with a similar problem.
Cheers,
Ryan
run a profiler trace on your database and see the query that's actually being run when you query your view.
You also might want to consider using a stored procedure to return the data from your view, ordered correctly for your specific use case.