Google BigQuery; use subselect result in outer select; cross join - google-bigquery

I have query that results into one row table and I need to get this result in subsequent computation. Here is non working simplified example (just to depict what I'm trying to achieve):
SELECT amount / (SELECT SUM(amount) FROM [...]) FROM [...]
I tried some nested sub-selects and joins (cross join of the one row table with the other table) but didn't find any working solution. Is there a way to get this working in BigQuery?
Thanks, Radek
EDIT:
ok, I found solution:
select
t1.x / t2.y as z
from
(select 1 as k, amount as x from [...] limit 10) as t1
join
(select 1 as k, sum(amount) as y from [...]) as t2
on
t1.k = t2.k;
but not sure if this is the best how to do it...

With the recently announced ratio_to_report() window function:
SELECT RATIO_TO_REPORT(amount) OVER() AS z
FROM [...]
ratio_to_report takes the amount, and divides it by the sum of all the result rows amounts.

The way you've found (essentially a cross join using a dummy key) is the best way I know of to do this query. We've thought about adding an explicit cross join operator to make it easier to see how to do this, but cross join can get expensive if not done correctly (e.g. if done on two large tables can create n^2 results).

Related

MS SQL - Multiple Running Totals, each based on a different GROUP BY

Need to generate the 2 running total columns which are each based on a different group-by. I would PREFER that the solution use the OUTER APPLY method like the one below, except modified to run multiple running totals/sums on different group bys/columns. See image for example of desired result
SELECT t1.LicenseNumber, t1.IncidentDate, t1.TicketAmount,
RunningTotal = SUM(t2.TicketAmount)
FROM dbo.SpeedingTickets AS t1
OUTER APPLY
(
SELECT TicketAmount
FROM dbo.SpeedingTickets
WHERE LicenseNumber = t1.LicenseNumber
AND IncidentDate <= t1.IncidentDate
) AS t2
GROUP BY t1.LicenseNumber, t1.IncidentDate, t1.TicketAmount
ORDER BY t1.LicenseNumber, t1.IncidentDate;
Example + desires result:
i.stack.imgur.com/PvJQe.png
Use outer apply twice:
Here is how you get one running total:
SELECT st.*, r1.RunningTotal
FROM dbo.SpeedingTickets st OUTER APPLY
(SELECT SUM(st2.TicketAmount) as RunningTotal
FROM dbo.SpeedingTickets st2
WHERE st2.LicenseNumber = st.LicenseNumber AND
st2.IncidentDate <= st.IncidentDate
) r1
ORDER BY st.LicenseNumber, st.IncidentDate;
For two, you just add another OUTER APPLY. Your question doesn't specify what the second aggregation is, and the linked picture has no relevance to the description in the question.
Notes:
The aggregation goes in the subquery, not in the outer query.
Use table abbreviations for table aliases. Such consistency makes it easier to follow the query.
When using correlated subqueries, always use qualified column names for all columns.

How can I join two tables using intervals in Google Big Query?

You have identified a solution of finding an area within a bounding box /circle using cross join as below:
SELECT A.ID, C.Car
FROM Cars C
CROSS JOIN Areas A
WHERE C.Latitude BETWEEN A.LatitudeMin AND A.LatitudeMax AND
C.Longitude BETWEEN A.LongitudeMin AND A.LongitudeMax
at:
How to cross join in Big Query using intervals?
however, using cross join for large data sets is blocked by GBQ ops team due to constrains on the infrastructure.
Hence, my question: how could I find set of lat,longs within large data table (table A) that are within another set of bounding boxes , small(table B) ?
My query as below has been blocked:
select a.a1, a.a2 , a.mdl, b.name, count(1) count
from TableMaster a
CROSS JOIN places_locations b
where (a.lat
BETWEEN b.bottom_right_lat AND b.top_left_lat)
AND (a.long
BETWEEN b.top_left_long AND b.bottom_right_long)
group by ....
TableMaster is 538 GB with 6,658,716,712 rows (cleaned/absolute minimum)
places_locations varies per query around 5 to 100kb.
I have tried to adapt fake join based on a template:
How to improve performance of GeoIP query in BigQuery?
however, query takes an hour and does not produce any results nor any errors are displayed.
Could you identify a possible path to solve this puzzle at all?
The problem you're seeing is that the cross join generates too many intermediate values (6 billion x 1k = 6 trillion).
The way to work around this is to generate fewer outputs. If you have additional filters you can apply, you should try applying them before you do the join. If you could do the group by (or part of it) before the join, that would also help.
Moreover, for doing the lookup, you could do a more coarse-grained lookup first. That is, if you could do an initial cross join with a smaller table that has course grained regions, then you could join against the larger table on region id rather than doing a cross join.
okey so fake join does work at the end, solution:
` select a.B, a.C , count(1) count from ( SELECT B, C, A, lat, long from [GB_Data.PlacesMasterA] WHERE not B
is null) a
JOIN (SELECT top_left_lat, top_left_long, bottom_right_lat, bottom_right_long, A from
[Places.placeABOXA] ) b on a.A=b.A
where
(a.lat BETWEEN b.bottom_right_lat AND
b.top_left_lat) AND (a.long BETWEEN b.top_left_long AND
b.bottom_right_long) group each by B, C `

Why would CROSS APPLY not be equivalent to INNER JOIN

This runs in 2 minutes:
SELECT
G.GKey,
Amount = SUM(fct.AmountEUR)
FROM
WH.dbo.vw_Fact fct
INNER JOIN #g G ON
fct.DateKey >= G.Livedate AND
fct.GKey = G.GKey
GROUP BY G.GKey;
This runs in 8 mins:
SELECT
G.GKey,
C.Amount
FROM
#g G
CROSS APPLY
(
SELECT
Amount = SUM(fct.AmountEUR)
FROM
WH.dbo.vw_Fact fct
WHERE
fct.DateKey >= G.Livedate AND
fct.GKey = G.GKey
) C;
These are both quite simple scripts and they look logically the same to me.
Table #G has 50 rows with a clustered index ON #G(Livedate,GKey)
Table WH.dbo.vw_Fact has a billion rows.
I actually felt initially that applying the bigger table to the small table was going to be more efficient.
My experience using CROSS APPLY is limited - is there an obvious reason (without exploring execution plans) for the slow time?
Is there a 'third way' that is likely to be quicker?
Here's the logical difference between the two joins:
CROSS APPLY: yields the Cartesian cross product of an aggregation on a given value of LiveDate and GKey, this gets re-executed for every row.
INNER JOIN: yields a 1-to-1 match on vw_Fact for every value of LiveDate and GKey, then sum accross common values of GKey, this creates the joined set first, then applies the aggregate.
As some of the other answers mentioned, cross apply is convenient when you join to a table valued function that is parameterized by some row level data from another table.
Is there a third way, that is faster? I would generally suggest not using open ended operators in joins (such as >=). Maybe try to pre-aggregate the large table on GKey and some date bucket. Also, set up a non-clustered key on LiveDate including AmountEUR
I think you trying to get Rolling sum. use Over() clause Try this.
SELECT G.GKey,
Amount = Sum(fct.AmountEUR)
OVER(
partition BY G.GKey
ORDER BY id rows UNBOUNDED PRECEDING)
FROM WH.dbo.vw_Fact fct
INNER JOIN #g G
ON fct.GKey = G.GKey
APPLY works on a row-by-row basis and is useful for more complex joins such as joining on the first X rows of a table based upon a value in the first table or for joining a function with parameters.
See here for examples.
The obvious reason for the cross apply being slower is that it works on a row by row basis!
So for each row of #g you are running the aggregate query in the cross apply.

query behave not as expected

I have a query:
select count(*) as total
from sheet_record right join
(select * from sheet_record limit 10) as sr
on 1=1;
If i understood correct (which i think i did not), right join suppose to return all row from right table in conjunction with left table. it suppose to be at list 10 row. But query returns only 1 row with 1 column 'total' . And it doesn't matter left full inner join it will be, result is the same always.
If i reverse tables and use left join with small modification of query, then it work correct (Modifications have no matter because in this case i get exactly what i expected to get). But I am interested to find what i actually didn't understand about join and why this query works not as expected.
You are returning one column because the select contains an aggregation function, turning this into an aggregation query. The query should be returning 10 times the number of rows in the sheet_record table.
Your query is effectively a cross join. So, if you did:
select *
from sheet_record right join
(select * from sheet_record limit 10) as sr
on 1=1;
You would get 10 rows for each record in sheet_record. Each of those records would have additional columns from one of ten records from the same table.
You are using a count(*) function, without any groupings. This will pretty much will result in retrieving a single row back. Try running your query without the count() to see if you get something closer to what you expect.
Eventually with help of commentators I did understood what was wrong. Not wrong actually, but what exactly i was not catching.
// this code below is work fine. query will return page 15 with 10 records in.
select *from sheet_record inner join (select count(*) as total from sheet_record) as sr on 1=1 limit 10 offset 140;
I was thinking that join takes table from left and join with the right table. But the moment i was working on script(above) I had on right side a view(table built by subquery) instead of pure table and i was thinking that left side as well a view, made by (select * from sheet_record) which is a mistake.
Idea is to get set of records from table X with additional column having value of total number of records in table.
(This is common problem when there is a demand to show table in UI using paging. To know how many pages still should be available i need to know how many record in total so i can calculate how many pages still available)
I think it should be something
select * from (
(here is some subquery which will give a view using count(*) function on some table X and it will be used as left table)
right join
(here is some subquery which will get some set or records from table X with limit and offset)
on 1=1 //becouse i need all row from right table(view) in all cases it should be true
)
Query with right join will a bit complicated.
I am using postgres.
So eventually i managed to get result with right join
select * from (select count(*) as total from sheet_record) as srt right join (select * from sheet_record limit 10 offset 140) as sr on 1=1;

What are the advantages of a query using a derived table(s) over a query not using them?

I know how derived tables are used, but I still can’t really see any real advantages of using them.
For example, in the following article http://techahead.wordpress.com/2007/10/01/sql-derived-tables/ the author tried to show benefits of a query using derived table over a query without one with an example, where we want to generate a report that shows off the total number of orders each customer placed in 1996, and we want this result set to include all customers, including those that didn’t place any orders that year and those that have never placed any orders at all( he’s using Northwind database ).
But when I compare the two queries, I fail to see any advantages of a query using a derived table ( if nothing else, use of a derived table doesn't appear to simplify our code, at least not in this example):
Regular query:
SELECT C.CustomerID, C.CompanyName, COUNT(O.OrderID) AS TotalOrders
FROM Customers C LEFT OUTER JOIN Orders O ON
C.CustomerID = O.CustomerID AND YEAR(O.OrderDate) = 1996
GROUP BY C.CustomerID, C.CompanyName
Query using a derived table:
SELECT C.CustomerID, C.CompanyName, COUNT(dOrders.OrderID) AS TotalOrders
FROM Customers C LEFT OUTER JOIN
(SELECT * FROM Orders WHERE YEAR(Orders.OrderDate) = 1996) AS dOrders
ON
C.CustomerID = dOrders.CustomerID
GROUP BY C.CustomerID, C.CompanyName
Perhaps this just wasn’t a good example, so could you show me an example where benefits of derived table are more obvious?
thanx
REPLY TO GBN:
In this case, you couldn't capture both products and order aggregates if there is no relation between Customers and Products.
Could you elaborate what exactly you mean? Wouldn’t the following query produce the same result set as your query:
SELECT
C.CustomerID, C.CompanyName,
COUNT(O.OrderID) AS TotalOrders,
COUNT(DISTINCT P.ProductID) AS DifferentProducts
FROM Customers C LEFT OUTER JOIN Orders O ON
C.CustomerID = O.CustomerID AND YEAR(O.OrderDate) = 1996
LEFT OUTER JOIN Products P ON
O.somethingID = P.somethingID
GROUP BY C.CustomerID, C.CompanyName
REPLY TO CADE ROUX:
In addition, if expressions are used to derive columns from derived columns with a lot of shared intermediate calculations, a set of nested derived tables or stacked CTEs is the only way to do it:
SELECT x, y, z1, z2
FROM (
SELECT *
,x + y AS z1
,x - y AS z2
FROM (
SELECT x * 2 AS y
FROM A
) AS A
) AS A
Wouldn't the following query produce the same result as your above query:
SELECT x, x * 2 AS y, x + x*2 AS z1, x - x*2 AS z2
FROM A
In your examples, the derived table is not strictly necessary. There are numerous cases where you might need to join to an aggregate or similar, and a derived table is really the only way to handle that:
SELECT *
FROM A
LEFT JOIN (
SELECT x, SUM(y)
FROM B
GROUP BY x
) AS B
ON B.x = A.x
In addition, if expressions are used to derive columns from derived columns with a lot of shared intermediate calculations, a set of nested derived tables or stacked CTEs is the only way to do it:
SELECT x, y, z1, z2
FROM (
SELECT *
,x + y AS z1
,x - y AS z2
FROM (
SELECT x * 2 AS y
FROM A
) AS A
) AS A
As far as maintainability, using stacked CTEs or derived tables (they are basically equivalent) and can make for more readable and maintainable code, as well as facilitating cut-and-paste re-use and refactoring. The optimizer can typically flatten then very easily.
I typically use stacked CTEs instead of nesting for a little better readability (same two examples):
WITH B AS (
SELECT x, SUM(y)
FROM B
GROUP BY x
)
SELECT *
FROM A
LEFT JOIN B
ON B.x = A.x
WITH A1 AS (
SELECT x * 2 AS y
FROM A
)
,A2 AS (
SELECT *
,x + y AS z1
,x - y AS z2
FROM A1
)
SELECT x, y, z1, z2
FROM A2
Regarding your question about:
SELECT x, x * 2 AS y, x + x*2 AS z1, x - x*2 AS z2
FROM A
This has the x * 2 code repeated 3 times. If this business rule needs to change, it will have to change in 3 places - a recipe for injection of defects. This gets compounded any time you have intermediate calculations which need to be consistent and defined in only one place.
This would not be as much of a problem if SQL Server's scalar user-defined functions could be inlined (or if they performed acceptably), you could simply build your UDFs to stack your results and the optimizer would elimnate redundant calls. Unfortunately SQL Server's scalar UDF implementation cannot handle that well for large sets of rows.
I typically use a derived table (or a CTE, which is a sometimes-superior alternative to derived queries in SQL 2005/2008) to simplify reading and building queries, or in cases where SQL doesn't allow me to do a particular operation.
For example, one of the things you can't do without a derived table or CTE is put an aggregate function in a WHERE clause. This won't work:
SELECT name, city, joindate
FROM members
INNER JOIN cities ON cities.cityid = derived.cityid
WHERE ROW_NUMBER() OVER (PARTITION BY cityid ORDER BY joindate) = 1
But this will work:
SELECT name, city, joindate
FROM
(
SELECT name,
cityid,
joindate,
ROW_NUMBER() OVER (PARTITION BY cityid ORDER BY joindate) AS rownum
FROM members
) derived INNER JOIN cities ON cities.cityid = derived.cityid
WHERE rn = 1
Advanced caveats, especially for large-scale analytics
If you're working on relatively small data sets (not gigabytes) you can probably stop reading here. If you're working with gigabytes ot terabytes of data and using derived tables, read on...
For very large-scale data operations, it's sometimes preferable to create a temporary table instead of using a derived query. This may happen if SQL's statistics suggest that your derived query will return many more rows than the query will actually return, which happens more often than you'd think. Queries where your main query self-joins with a non-recursive CTE are are also problematic.
It's also possible that derived tables will generate unexpected query plans. For example, even if you put a strict WHERE clause in your derived table to make that query very selective, SQL Server may re-order your query plan so your WHERE clause is evaluated in the query plan. See this Microsoft Connect feedback for a discussion of this issue and a workaround.
So, for very performance-intensive queries (especially data-warehousing queries on 100GB+ tables), I always like to prototype a temporary-table solution to see if you get better performance than you get from a derived table or CTE. This seems counter-intuitive since you're doing more I/O than an ideal single-query solution, but with temp tables you get total control over the query plan used and the order each subquery is evaluated. Sometimes this can increase performance 10x or more.
I also tend to prefer temp tables in cases where I have to use query hints to force SQL to do what I want-- if the SQL optimizer is already "misbehaving", temp tables are often a clearer way to force them to act the way you want.
I'm not suggesting this is a common case-- most of the time the temporary table solution will be at least a little worse and sometimes query hints are one's only recourse. But don't assume that a CTE or derived-query solution will be your fastest option either. Test, test, test!
Derived tables often replace correlated subqueries and are generally considerably faster.
They also can be used to limit greatly the number of records searched thorugh for a large table and thus may also improve speed of the query.
As with all potentially performance improving techniques, you need to test to see if they did improve performance. A derived table will almost always strongly outperform a correlated subquery but there is the possibility it may not.
Further there are times when you need to join to data containing an aggregate calulation which is almost impossible to do without a derived table or CTE (which is essentually another way of writing a derived tbale in many cases).
Derived tables are one of my most useful ways of figuring out complex data for reporting as well. You can do this in pieces using table variables or temp tables too, but if you don't want to see the code in procedural steps, people often change them to derived tables once they work out what they want using temp tables.
Aggregating data from a union is another place where you need derived tables.
Using your terminology and example the derived tables is only more complex with no advantages. However, some things require a derived table. Those can be in the most complex cases CTEs (as demonstrated above). But, simple joins can demonstrate the necessity of derived tables, all you must do is craft a query that requires the use of an aggregate, here we use a variant of the quota query to demonstrate this.
Select all of the customer's most expensive transactions
SELECT transactions.*
FROM transactions
JOIN (
select user_id, max(spent) AS spent
from transactions
group by user_id
) as derived_table
USING (
derived_table.user_id = transaction.user_id
AND derived_table.spent = transactions.spent
)
In this case, the derived table allows YEAR(O.OrderDate) = 1996 in a WHERE clause.
In the outer where clause, it's useless because it would change the JOIN to INNER.
Personally, I prefer the derived table (or CTE) construct because it puts the filter into the correct place
Another example:
SELECT
C.CustomerID, C.CompanyName,
COUNT(D.OrderID) AS TotalOrders,
COUNT(DISTINCT D.ProductID) AS DifferentProducts
FROM
Customers C
LEFT OUTER JOIN
(
SELECT
OrderID, P.ProductID
FROM
Orders O
JOIN
Products P ON O.somethingID = P.somethingID
WHERE YEAR(Orders.OrderDate) = 1996
) D
ON C.CustomerID = D.CustomerID
GROUP BY
C.CustomerID, C.CompanyName
In this case, you couldn't capture both products and order aggregates if there is no relation between Customers and Products. Of course, this is contrived but I hope I've captured the concept
Edit:
I need to explicitly JOIN T1 and T2 before the JOIN onto MyTable. It does happen. The derived T1/T2 join can be a different query to 2 LEFT JOINs with no derived table. It happens quite often
SELECT
--stuff--
FROM
myTable M1
LEFT OUTER JOIN
(
SELECT
T1.ColA, T2.ColB
FROM
T1
JOIN
T2 ON T1.somethingID = T2.somethingID
WHERE
--filter--
) D
ON M1.ColA = D.ColA AND M1.ColB = D.ColB