Adding a "calculated column" to BigQuery query without repeating the calculations

Adding a "calculated column" to BigQuery query without repeating the calculations - google-bigquery

I want to resuse value of calculated columns in a new third column.
For example, this query works:
select
countif(cond1) as A,
countif(cond2) as B,
countif(cond1)/countif(cond2) as prct_pass
From
Where
Group By
But when I try to use A,B instead of repeating the countif, it doesn't work because A and B are invalid:
select
countif(cond1) as A,
countif(cond2) as B,
A/B as prct_pass
From
Where
Group By
Can I somehow make the more readable second version work ?
Is this first one inefficient ?

You should construct a subquery (i.e. a double select) like
SELECT A, B, A/B as prct_pass
FROM
(
SELECT countif(cond1) as A,
countif(cond2) as B
FROM <yourtable>
)
The same amount of data will be processed in both queries.
In the subquery one you will do only 2 countif(), in case that step takes a long time then doing 2 instead of 4 should be more efficient indeed.
Looking at an example using bigquery public datasets:
SELECT
countif(homeFinalRuns>3) as A,
countif(awayFinalRuns>3) as B,
countif(homeFinalRuns>3)/countif(awayFinalRuns>3) as division
FROM `bigquery-public-data.baseball.games_post_wide`
or
SELECT A, B, A/B as division FROM
(
SELECT countif(homeFinalRuns>3) as A,
countif(awayFinalRuns>3) as B
FROM `bigquery-public-data.baseball.games_post_wide`
)
we can see that doing all in one (without a subquery) is actually slightly faster. (I ran the queries 6 times for different values of the inequality, 5 times was faster and one time slower)
In any case, the efficiency will depend on how taxing is to compute the condition in your particular dataset.

Related

Can I multiply the output of a SQL query from two separate tables within the same query?

I am taking two values (A, B) from similar but different tables. E.g. A is the count(*) of Table R, but B is a complex calculation based off a slightly adapted table (we can call it S).
So I did this:
SELECT
(SELECT count(*)*60 FROM R) AS A,
[calculation for B] AS B
FROM R
WHERE
[modification to R to get S]
Not sure if this was the smartest way to do it (probably was not, I'm a new user).
Now I want to do some multiplications:
A*B
B-(A*0.75)
B-(A*0.8)
B-(A*0.85)
etc.
Is there a way to do this within the same query?
Thanks.

The simplest way,
SELECT A*B p1, B-(A*0.75) p2, B-(A*0.8) p3, B-(A*0.85) p4, ...
FROM (
-- your original query producing columns A, B ...
) t

Does a 4-column composite index benefit a 3-column query?

I have a table with columns A, B, C, D, E, ..., N. with PK(A). I also have a composite, unique, non-clustered index defined for columns D, C, B, A, in that order.
If I use a query like:
where D = 'a' and C = 'b' and B = 'c'
without a clause for A, do I still get the benefits of the index?

Yes, SQL server can perform a seek operation on the index (D, C, B, A) for these queries:
WHERE D = 'd'
WHERE D = 'd' AND C = 'c'
WHERE D = 'd' AND C = 'c' AND B = 'b'
WHERE D = 'd' AND C = 'c' AND B = 'b' AND A = 'a'
And it could perform a scan operation on the said index for these:
WHERE C = 'c' AND B = 'b'
WHERE A = 'a'
-- etc
But there is one more thing to consider: the order of columns inside indexes matters. It is possible to have two indexes (D, C, B, A) and (B, C, D, A) perform differently. Consider the following example:
WHERE Active = 1 AND Type = 2 AND Date = '2019-09-10'
Assuming that the data contains two distinct values for Active, 10 for Type and 1000 for Date, an index on (Date, Type, Active, Other) will be more efficient than (Active, Type, Date, Other).
You could also consider creating different variations of the said index for different queries.
PS: if column A is not used inside the WHERE clause then you can simply INCLUDE it.

Yeah, it will still use the index properly and make your query more efficient.
Think of it like a nested table of contents in a text-book:
Skim till you see the chapter name (e.g. "Data Structures")
Skim until you see the relevant section title ("Binary Trees")
Skim until you see the relevant topic (e.g. "Heaps").
If you only want to get to the binary trees section, the contents going to the topic level doesn't hurt you at all :).
Now... if you wanted to find binary trees and you didn't know they were a data structure, then this table of contents wouldn't be very useful for you (e.g. if you were missing "D").

Yes, now SQL can do a seek to just those records, instead of having to scan the whole table.
In addition:
SQL will have better statistics (SQL will auto-create statistics on the set of columns composing the index) as to how many rows are likely to satisfy your query
The fact that it is UNIQUE will also tell SQL about the data, which may result in a better plan (for example if you do DISTINCT or UNION, it will know that those columns are already distinct).
SQL will have to read less data, because instead of having to read all "N" columns (even though you only need 3), it can read the index, which will have 4, so only one will be "superfluous".
Note that because the particular query in question is WHERE on D, C, and B, in this case the order of the index won't matter. If the WHERE clause is only on C and B, you would get much less benefit because SQL would no longer be able to seek on D, and (1) and (2) above wouldn't apply. (3) still would, though.

PostgreSQL: How to access column on anonymous record

I have a problem that I'm working on. Below is a simplified query to show the problem:
WITH the_table AS (
SELECT a, b
FROM (VALUES('data1', 2), ('data3', 4), ('data5', 6)) x (a, b)
), my_data AS (
SELECT 'data7' AS c, array_agg(ROW(a, b)) AS d
FROM the_table
)
SELECT c, d[array_upper(d, 1)]
FROM my_data
In the my data section, you'll notice that I'm creating an array from multiple rows, and the array is returned in one row with other data. This array needs to contain the information for both a and b, and keep two values linked together. What would seem to make sense would be to use an anonymous row or record (I want to avoid actually creating a composite type).
This all works well until I need to start pulling data back out. In the above instance, I need to access the last entry in the array, which is done easily by using array_upper, but then I need to access the value in what used to be the b column, which I cannot figure out how to do.
Essentially, right now the above query is returning:
"data7";"(data5,6)"
And I need to return
"data7";6
How can I do this?
NOTE: While in the above example I'm using text and integers as the types for my data, they are not the actual final types, but are rather used to simplify the example.
NOTE: This is using PostgreSQL 9.2
EDIT: For clarification, Something like SELECT 'data7', 6 is not what I'm after. Imagine that the_table is actually pulling from database tables and not the WITH statement the I put in for convenience, and I don't readily know what data is in the table.
In other words, I want to be able to do something like this:
SELECT c, (d[array_upper(d, 1)]).b
FROM my_data
And get this back:
"data7";6
Essentially, once I've put something into an anonymous record by using the row() function, how do I get it back out? How do I split up the 'data5' part and the 6 part so that they don't both return in one column?
For another example:
SELECT ROW('data5', 6)
makes 'data5' and 6 return in one column. How do I take that one column and break it back into the original two?
I hope that clarifies

If you can install the hstore extension:
with the_table as (
select a, b
from (values('data1', 2), ('data3', 4), ('data5', 6)) x (a, b)
), my_data as (
select 'data7' as c, array_agg(row(a, b)) as d
from the_table
)
select c, (avals(hstore(d[array_upper(d, 1)])))[2]
from my_data
;
c | avals
-------+-------
data7 | 6

This is just a very quick throw together around a similarish problem - not an answer to your question. This appears to be one direction towards identifying columns.
with x as (select 1 a, 2 b union all values (1,2),(1,2),(1,2))
select a from x;

Sum two counts in a new column without repeating the code

I have one maybe stupid question.
Look at the query :
select count(a) as A, count(b) as b, count(a)+count(b) as C
From X
How can I sum up the two columns without repeating the code:
Something like:
select count(a) as A, count(b) as b, A+B as C
From X

For the sake of completeness, using a CTE:
WITH V AS (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
)
SELECT A, B, A + B as C
FROM V

This can easily be handled by making the engine perform only two aggregate functions and a scalar computation. Try this.
SELECT A, B, A + B as C
FROM (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
) T

You may get the two individual counts of a same table and then get the summation of those counts, like bellow
SELECT
(SELECT COUNT(a) FROM X )+
(SELECT COUNT(b) FROM X )
AS C

Let's agree on one point: SQL is not an Object-Oriented language. In fact, when we think of computer languages, we are thinking of procedural languages (you use the language to describe step by step how you want the data to be manipulated). SQL is declarative (you describe the desired result and the system works out how to get it).
When you program in a procedural languages your main concerns are: 1) is this the best algorithm to arrive at the correct result? and 2) do these steps correctly implement the algorithm?
When you program in a declarative language your main concern is: is this the best description of the desired result?
In SQL, most of your effort will be going into correctly forming the filtering criteria (the where clause) and the join criteria (any on clauses). Once that is done correctly, you're pretty much just down to aggregating and formating (if applicable).
The first query you show is perfectly formed. You want the number of all the non-null values in A, the number of all the non-null values in B, and the total of both of those amounts. In some systems, you can even use the second form you show, which does nothing more than abstract away the count(x) text. This is convenient in that if you should have to change a count(x) to sum(x), you only have to make a change in one place rather than two, but it doesn't change the description of the data -- and that is important.
Using a CTE or nested query may allow you to mimic the abstraction not available in some systems, but be careful making cosmetic changes -- changes that do not alter the description of the data. If you look at the execution plan of the two queries as you show them, the CTE and the subquery, in most systems they will probably all be identical. In other words, you've painted your car a different color, but it's still the same car.
But since it now takes you two distinct steps in 4 or 5 lines to explain what it originally took only one step in one line to express, it's rather difficult to defend the notion that you have made an improvement. In fact, I'll bet you can come up with a lot more bullet points explaining why it would be better if you had started with the CTE or subquery and should change them to your original query than the other way around.
I'm not saying that what you are doing is wrong. But in the real world, we are generally short of the spare time to spend on strictly cosmetic changes.

Calling table valued function twice. Could be once?

I need to use in SQL a function which returns a table
At this moment I have:
SELECT
K.ID,
(SELECT A from dbo.TableFunction1 (K.ID,0,77)) AS A,
(SELECT B from dbo.TableFunction1 (K.ID,0,77)) AS B
FROM K
I'm worried because I execute the same function with the same parameters twice, once to get one column and next time to get another column.
It turns out I can't do:
SELECT
K.ID,
(SELECT A,B from dbo.TableFunction1 (K.ID,0,77))
FROM K
as I get: Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.
Could this query be improved so I called the function only once?

Try using cross apply:
select K.ID, tf1.A, tf1.B
from K cross apply
dbo.TableFunction1(K.ID, 0, 77) tf1

You can do 3 things here:
Use cross apply
select k.id, f.b, f.b
from k
cross apply dbo.TableFunction(k.id, 0, 77)
This will work fine if your query stays that simple, but if you start to have other joins, that limit the number of rows that you would return from "K", then you can still end up running "TableFunction" on every row in "K". I've seen that turn into a performance nightmare.
Convert the function to a 2 scalar functions
select
K.ID,
dbo.ScalarFunctionA (K.ID,0,77)) AS A,
dbo.ScalarFunctionB (K.ID,0,77)) AS B
FROM K
This also has drawbacks if you have a big query within that function, you're now running that query twice for every row you return. If you're only returning 1 row, no problem, if you're returning thousands performance takes another hit.
Unwrap the function completely and include it within the query. Most likely the fastest, but comes with the drawback of not reusing the code.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas