identify group by vs group each in advance

identify group by vs group each in advance - google-bigquery

Is there a way to figure out in advance (not by trial and error) whether a specific query should use GROUP BY or GROUP EACH BY?
We currently saw that after a cardinality of ~60-70% we are asked to use Group EACH by. It is hard to predict as we generate the SQL.

The usage of 'EACH' doesn't depend on the query, but on the data. Is there a small number of unique values for the group expression? Use GROUP BY. Is there a lot? Use GROUP EACH BY.
The best strategy is to use GROUP BY until you get an "over limits error".
To go deeper into the "why?", you can look at the Dremel paper that started it all. Basically GROUP BY runs in the mixers, while GROUP EACH BY gets pushed to the shards.
For other insights, check jcondit's answers at Resources Exceeded during query execution.

Related

sql concatenation in SELECT

In the SELECT clause I have SELECT isnull(client,'')+'-'+isnull(supplier,''), is it ok to write GROUP BY client,supplier, or should I mandatorily write GROUP BY isnull(client,'')+'-'+isnull(supplier,'')?

It's better to GROUP BY client, supplier. That way if there's any indexer available it can be used. While the other solution also works it would require a whole table scan in every case.

Just list the column names. You can verify this by executing it, and also look at execution plans to verify index usage.
GROUP BY client,supplier

You can directly say group by client, supplier
Please see sample data and result after running query.
sample

Bigquery CASE SENSITIVE query with LIMIT clause is not working?

When making a Bigquery query like for example:
Select Campaign FROM TABLE WHERE Campaign CONTAINS 'buy' GROUP BY Campaign IGNORE CASE LIMIT 100
The IGNORE CASE clause is not working when used with LIMIT clause.
Some time ago it did work.
Is this a Bigquery fault or something changed?
Thanks a lot
Ramiro

A couple of things here:
Legacy SQL expects IGNORE CASE to appear at the end of the query, so you need to use LIMIT 100 IGNORE CASE instead of IGNORE CASE LIMIT 100
The BigQuery team advises using standard SQL instead of legacy SQL if you're working on new queries, since it tends to have better error messages, better performance, etc. and it's where we're focusing our efforts going forward. You may also be interested in the migration guide.
If you want to use standard SQL for your query, you could do:
Select LOWER(Campaign) AS Campaign
FROM TABLE
WHERE LOWER(Campaign) LIKE '%buy%'
GROUP BY LOWER(Campaign)
LIMIT 100

Multiple subtotals - Rollup order of fields

I am trying to run a query that aggregates data, groups the results by several different fields, and extract all relevant "SubTotal" permutations. (similar to CUBE() in MSSQL)
When Using Group By Rollup(), I get only permutations according to the order of the Group By fields in the Rollup function.
For example the query below (runs on a public dataset), it returns subtotal by year, or by year and month, or by year, month and medallion... but it doesn't subtotal by medallion.
SELECT
trip_year,
trip_month,
medallion,
SUM(trip_count) AS Sum_trip_count
FROM
[nyc-tlc:yellow.Trips_ByMonth_ByMedallion]
WHERE
medallion IN ("2R76", "8J82", "3B85", "4L79", "5D59", "6H75", "7P60", "8V48", "1H12", "2C69", "2F38", "5Y86", "5j90", "8A75", "8V41", "9J24", "9J55", "1E13", "1J82")
GROUP BY
ROLLUP(trip_year,
trip_month,
medallion)
My question is:
What should I do in order to get all different permutations of "Sub Totals" in a single query results.
Already tried: Union with similar query but with different order, it works, but not elegant (it would require too many unions).
Thanks

You are correct on both counts. In BigQuery, ROLLUP respects the hierarchy treating the listed fields as a strictly ordered list. Their order will not be changed during aggregation.
The CUBE aggregate commonly found in other SQL environments is unordered and in fact aggregates every possible order/subset of its listed fields. At this time, CUBE has not been implemented in BigQuery. The workaround you suggest is also what I would suggest. UNION all result sets from ROLLUP using each permutation of its contained fields. Albeit not ideal, you should get the same results.
In short, UNIONs of several queries with different permutations of ROLLUP fields is the only way to achieve this at the moment. The downsides are as you state that this may be difficult to maintain and can be more expensive in queries.
If you would like to see CUBE implemented in BigQuery, I strongly encourage you to file a feature request on the Big Query public issue tracker. Be sure to include a thorough use case in this request.
UPDATE: To support the feature request filed by the OP, please star it and you'll receive notifications with updates.

Grouping by month in database and not with ruby

I'm trying to group calls by month but I need to do it in the database and not with ruby. Here is the current code:
Call.limit(1000).group_by { |t| t.created_at.month }
Which returns:
SELECT `calls`.* FROM `calls` ORDER BY created_at desc LIMIT 1000
Then ruby does the grouping. What should I do to make the database do the work ?
Thank you.

The short answer, is that you cannot achieve the same result at SQL level.
Here's the full explanation.
First of all, what should be the result of that call? You can use the PG/SQL Group BY statement, however it's likely the result is not what you expect.
The Group By syntax is designed to group rows with a pattern, and compute and aggregate function. In your case, even assuming you create a query that uses date_trunc to group by a part of the timestamp, the aggregate function does not permit you to return a dataset structured like the Ruby group_by method.
Why do you want to compute such grouping at database level?
If you have specific requirements or computation limits, then work on a custom method.

Use Call.limit(1000).group("month(created_at)")
Please checkout mysql date-time methods appropriate in your case. But .group() will do the mysql grouping.
http://dev.mysql.com/doc/refman/5.1/en/date-and-time-functions.html#function_month

ORDER BY in a Sql Server 2008 view

we have a view in our database which has an ORDER BY in it.
Now, I realize views generally don't order, because different people may use it for different things, and want it differently ordered. This view however is used for a VERY SPECIFIC use-case which demands a certain order. (It is team standings for a soccer league.)
The database is Sql Server 2008 Express, v.10.0.1763.0 on a Windows Server 2003 R2 box.
The view is defined as such:
CREATE VIEW season.CurrentStandingsOrdered
AS
SELECT TOP 100 PERCENT *, season.GetRanking(TEAMID) RANKING
FROM season.CurrentStandings
ORDER BY
GENDER, TEAMYEAR, CODE, POINTS DESC,
FORFEITS, GOALS_AGAINST, GOALS_FOR DESC,
DIFFERENTIAL, RANKING
It returns:
GENDER, TEAMYEAR, CODE, TEAMID, CLUB, NAME,
WINS, LOSSES, TIES, GOALS_FOR, GOALS_AGAINST,
DIFFERENTIAL, POINTS, FORFEITS, RANKING
Now, when I run a SELECT against the view, it orders the results by GENDER, TEAMYEAR, CODE, TEAMID. Notice that it is ordering by TEAMID instead of POINTS as the order by clause specifies.
However, if I copy the SQL statement and run it exactly as is in a new query window, it orders correctly as specified by the ORDER BY clause.

The order of rows returned by a view with an ORDER BY clause is never guaranteed. If you need a specific row order, you must specify where you select from the view.
See this the note at the top of this Book On-Line entry.

SQL Server 2005 ignores TOP 100 PERCENT by design.
Try TOP 2000000000 instead.
Now, I'll try and find a reference... I was at a seminar presented by Itzak Ben-Gan who mentioned it
Found some...
Kimberly L. Tripp
"TOP 100 Percent ORDER BY Considered Harmful"
In this particular case, the optimizer
recognizes that TOP 100 PERCENT
qualifies all rows and does not need
to be computed at all.

Just use :
"Top (99) Percent "
or
"Top (a number 1000s times more than your data rows like 24682468123)"
it works! just try it.

In SQL server 2008, ORDER BY is ignored in views that use TOP 100 PERCENT. In prior versions of SQL server, ORDER BY was only allowed if TOP 100 PERCENT was used, but a perfect order was never guaranteed. However, many assumed a perfect order was guaranteed. I infer that Microsoft does not want to mislead programmers and DBAs into believing there is a guaranteed order using this technique.
An excellent comparative demonstration of this inaccuracy, can be found here...
http://blog.sqlauthority.com/2009/11/24/sql-server-interesting-observation-top-100-percent-and-order-by
Oops, I just noticed that this was already answered. But checking out the comparative demonstration is worth a look anyway.

Microsoft has fixed this. You have patch your sql server
http://support.microsoft.com/kb/926292

I found an alternative solution.
My initial plan was to create a 'sort_order' column that would prevent users from having to perform a complex sort.
I used a windowed function ROW_NUMBER. In the ORDER BY clause, I specified the default sort order that I needed (just as it would have been in the ORDER BY of a SELECT statement).
I get several positive outcomes:
By default, the data is getting returned in the default sort order I originally intended (this is probably due to the windowed function having to sort the data prior to assigning the sort_order value)
Other users can sort the data in alternative ways if they choose to
The sort_order column is there for a very specific sort need, making it easier for users to sort the data should whatever tool they use rearranges the rowset.
Note: In my specific application, users are accessing the view via Excel 2010, and by default the data is presented to the user as I had hoped without further sorting needed.
Hope this helps those with a similar problem.
Cheers,
Ryan

run a profiler trace on your database and see the query that's actually being run when you query your view.
You also might want to consider using a stored procedure to return the data from your view, ordered correctly for your specific use case.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

identify group by vs group each in advance - google-bigquery

Is there a way to figure out in advance (not by trial and error) whether a specific query should use GROUP BY or GROUP EACH BY? We currently saw that after a cardinality of ~60-70% we are asked to use Group EACH by. It is hard to predict as we generate the SQL.

Related

sql concatenation in SELECT

Bigquery CASE SENSITIVE query with LIMIT clause is not working?

Multiple subtotals - Rollup order of fields

Grouping by month in database and not with ruby

ORDER BY in a Sql Server 2008 view

Categories

Resources