Sum two counts in a new column without repeating the code

Sum two counts in a new column without repeating the code - sql

I have one maybe stupid question.
Look at the query :
select count(a) as A, count(b) as b, count(a)+count(b) as C
From X
How can I sum up the two columns without repeating the code:
Something like:
select count(a) as A, count(b) as b, A+B as C
From X

For the sake of completeness, using a CTE:
WITH V AS (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
)
SELECT A, B, A + B as C
FROM V

This can easily be handled by making the engine perform only two aggregate functions and a scalar computation. Try this.
SELECT A, B, A + B as C
FROM (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
) T

You may get the two individual counts of a same table and then get the summation of those counts, like bellow
SELECT
(SELECT COUNT(a) FROM X )+
(SELECT COUNT(b) FROM X )
AS C

Let's agree on one point: SQL is not an Object-Oriented language. In fact, when we think of computer languages, we are thinking of procedural languages (you use the language to describe step by step how you want the data to be manipulated). SQL is declarative (you describe the desired result and the system works out how to get it).
When you program in a procedural languages your main concerns are: 1) is this the best algorithm to arrive at the correct result? and 2) do these steps correctly implement the algorithm?
When you program in a declarative language your main concern is: is this the best description of the desired result?
In SQL, most of your effort will be going into correctly forming the filtering criteria (the where clause) and the join criteria (any on clauses). Once that is done correctly, you're pretty much just down to aggregating and formating (if applicable).
The first query you show is perfectly formed. You want the number of all the non-null values in A, the number of all the non-null values in B, and the total of both of those amounts. In some systems, you can even use the second form you show, which does nothing more than abstract away the count(x) text. This is convenient in that if you should have to change a count(x) to sum(x), you only have to make a change in one place rather than two, but it doesn't change the description of the data -- and that is important.
Using a CTE or nested query may allow you to mimic the abstraction not available in some systems, but be careful making cosmetic changes -- changes that do not alter the description of the data. If you look at the execution plan of the two queries as you show them, the CTE and the subquery, in most systems they will probably all be identical. In other words, you've painted your car a different color, but it's still the same car.
But since it now takes you two distinct steps in 4 or 5 lines to explain what it originally took only one step in one line to express, it's rather difficult to defend the notion that you have made an improvement. In fact, I'll bet you can come up with a lot more bullet points explaining why it would be better if you had started with the CTE or subquery and should change them to your original query than the other way around.
I'm not saying that what you are doing is wrong. But in the real world, we are generally short of the spare time to spend on strictly cosmetic changes.

Related

Is it possible to concisely tell SQL SELECT to omit some temporary/dummy columns?

Suppose I have this query:
select
x + y as _total,
abs(x - y) / _total as _err,
round(100 * _err) as pct_err,
x,
y
from foo;
This assumes I have a table with x and y, and calculates an error between them. Note that columns I prefixed with _ are dummy columns - they're only there to show the steps of the calculation more clearly. Is there a way to omit them from the result?
I don't want to simply collapse the three columns into a single expression. It would be messier, and consider also a calculation with 10 steps and much longer field names.
I don't want to make this a CTE and then re-select only the columns I want. That seems too much hassle for such a simple thing.
It would be okay if I could just put the dummy columns at the end where they would be out of the way, but SQL doesn't seem to allow referencing a column that comes after.
Note that "no" is an acceptable answer, if you have reasonably comprehensive knowledge of SQL syntax :)

Nested SQL evaluation question with unnest

this may be a basic question but I just couldn't figure it out. Sample data and query could be found here. (under the "First-touch" tab)
I'll skip the marketing terminology here but basically what the query does is attributing credits/points to placements (ads) based on certain rule. Here, the rule is "first-touch", which means the credit goes to the first ad user interacted with - could be view or click. The "FLOODLIGHT" here means the user takes action to actually buy the product (conversion).
As you can see in the sample data, user 1 has one conversion and the first ad is placement 22 (first-touch), so 22 gets 1 point. User 2 has two conversions and the first ad of each is 11, so 11 gets 2 points.
The logic is quite simple here but I had a difficult time understanding the query itself. What's the point of comparing prev_conversion_event.event_time < conversion_event.event_time? Aren't they essentially the same? I mean both of them came from UNNEST(t.*_path.events). And attributed_event.event_time also came from the same place.
What does prev_conversion_event.event_time, conversion_event.event_time, and attributed_event.event_time evaluate to in this scenario anyway? I'm just confused as hell here. Much appreciate the help!
For convenience I'm pasting the sample data, the query and output below:
Sample data
Output
/* Substitute *_paths for the specific paths table that you want to query. */
SELECT
(
SELECT
attributed_event_metadata.placement_id
FROM (
SELECT
AS STRUCT attributed_event.placement_id,
ROW_NUMBER() OVER(ORDER BY attributed_event.event_time ASC) AS rank
FROM
UNNEST(t.*_paths.events) AS attributed_event
WHERE
attributed_event.event_type != "FLOODLIGHT"
AND attributed_event.event_time < conversion_event.event_time
AND attributed_event.event_time > (
SELECT
IFNULL( (
SELECT
MAX(prev_conversion_event.event_time) AS event_time
FROM
UNNEST(t.*_paths.events) AS prev_conversion_event
WHERE
prev_conversion_event.event_type = "FLOODLIGHT"
AND prev_conversion_event.event_time < conversion_event.event_time),
0)) ) AS attributed_event_metadata
WHERE
attributed_event_metadata.rank = 1) AS placement_id,
COUNT(*) AS credit
FROM
adh.*_paths AS t,
UNNEST(*_paths.events) AS conversion_event
WHERE
conversion_event.event_type = "FLOODLIGHT"
GROUP BY
placement_id
HAVING
placement_id IS NOT NULL
ORDER BY
credit DESC

It is a quite convoluted query to be fair, I think I know what are you asking, please correct me if not the case.
What's the point of comparing prev_conversion_event.event_time < conversion_event.event_time?
You are doing something like "I want all the events from this (unnest), and for every event, I want to know which events are the predecessor of each other".
Say you have [A, B, C, D] and they are ordered in succession (A happened before B, A and B happened before C, and so on), the result of that unnesting and joining over that condition will get you something like [A:(NULL), B:(A), C:(A, B), D:(A, B, C)] (excuse the notation, hope it is not confusing), being each key:value pair, the Event:(Predecessors). Note that A has no events before it, but B has A, etc.
Now you have a nice table with all the conversion events joined with the events that happened before that one.

Repeating operations vs multilevel queries

I was always bothered by how should I approach those, which solution is better. I guess the sample code should explain it better.
Lets imagine we have a table that has 3 columns:
(int)Id
(nvarchar)Name
(int)Value
I want to get the basic columns plus a number of calculations on the Value column, but with each of the calculation being based on a previous one, In other words something like this:
SELECT
*,
Value + 10 AS NewValue1,
Value / NewValue1 AS SomeOtherValue,
(Value + NewValue1 + SomeOtherValue) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
Obviously this will not work. NewValue1, SomeOtherValue and YetAnotherValue are on the same level in the query so they can't refer to each other in the calculations.
I know of two ways to write queries that will give me the desired result. The first one involves repeating the calculations.
SELECT
*,
Value + 10 AS NewValue1,
Value / (Value + 10) AS SomeOtherValue,
(Value + (Value + 10) + (Value / (Value + 10))) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
The other one involves constructing a multilevel query like this:
SELECT
t2.*,
(t2.Value + t2.NewValue1 + t2.SomeOtherValue) / 10 AS YetAnotherValue
FROM
(
SELECT
t1.*,
t1.Value / t1.NewValue1 AS SomeOtherValue
FROM
(
SELECT
*,
Value + 10 AS NewValue1
FROM
MyTable
WHERE
Name LIKE "A%"
) t1
) t2
But which one is the right way to approach the problem or simply "better"?
P.S. Yes, I know that "better" or even "good" solution isn't always the same thing in SQL and will depend on many factors.

I have tired a number of different combination of calculations in both variants. They always produced the same execution plan, so it could be assumed that there is no difference in the performance aspect. From the code usability perspective the first approach i obviously better as the code is more readable and compact.

There is no "right" way to write such queries. SQL Server, as with most databases (MySQL being a notable exception), does not create intermediate tables for each subquery. Instead, it optimizes the query as a whole and often moves all the calculations for the expressions into a single processing node.
The reason that column aliases cannot be re-used at the same level goes to the ANSI standard definition. In particular, nothing in the standard specifies the order of evaluation for the individual expressions. Without knowing the order, SQL cannot guarantee that the variable is defined before evaluated.
I often write multi-level queries -- either using subqueries or CTEs -- to make queries more readable and more maintainable. But then again, I will also copy logic from one variable to the other because it is expedient. In my opinion, this is something that the writer of the query needs to decide on, taking into account whether the query is part of the code for a system that needs to be maintained, local coding standards, whether the query is likely to be modified, and similar considerations.

Select pair of rows that obey a rule

I have a big table (1M rows) with the following columns:
source, dest, distance.
Each row defines a link (from A to B).
I need to find the distances between a pair using anoter node.
An example:
If want to find the distance between A and B,
If I find a node x and have:
x -> A
x -> B
I can add these distances and have the distance beetween A and B.
My question:
How can I find all the nodes (such as x) and get their distances to (A and B)?
My purpose is to select the min value of distance.
P.s: A and B are just one connection (I need to do it for 100K connections).
Thanks !

As Andomar said, you'll need the Dijkstra's algorithm, here's a link to that algorithm in T-SQL: T-SQL Dijkstra's Algorithm

Assuming you want to get the path from A-B with many intermediate steps it is impossible to do it in plain SQL for an indefinite number of steps. Simply put, it lacks the expressive power, see http://en.wikipedia.org/wiki/Expressive_power#Expressive_power_in_database_theory . As Andomar said, load the data into a process and us Djikstra's algorithm.

This sounds like the traveling salesman problem.
From a SQL syntax standpoint: connect by prior would build the tree your after using the start with and limit the number of layers it can traverse; however, doing will not guarantee the minimum.

I may get downvoted for this, but I find this an interesting problem. I wish that this could be a more open discussion, as I think I could learn a lot from this.
It seems like it should be possible to achieve this by doing multiple select statements - something like SELECT id FROM mytable WHERE source="A" ORDER BY distance ASC LIMIT 1. Wrapping something like this in a while loop, and replacing "A" with an id variable, would do the trick, no?
For example (A is source, B is final destination):
DECLARE var_id as INT
WHILE var_id != 'B'
BEGIN
SELECT id INTO var_id FROM mytable WHERE source="A" ORDER BY distance ASC LIMIT 1
SELECT var_id
END
Wouldn't something like this work? (The code is sloppy, but the idea seems sound.) Comments are more than welcome.

Join the table to itself with destination joined to source. Add the distance from the two links. Insert that as a new link with left side source, right side destination and total distance if that isn't already in the table. If that is in the table but with a shorter total distance then update the existing row with the shorter distance.
Repeat this until you get no new links added to the table and no updates with a shorter distance. Your table now contains a link for every possible combination of source and destination with the minimum distance between them. It would be interesting to see how many repetitions this would take.
This will not track the intermediate path between source and destination but only provides the shortest distance.

IIUC this should do, but I'm not sure if this is really viable (performance-wise) due to the big amount of rows involved and to the CROSS JOIN
SELECT
t1.src AS A,
t1.dest AS x,
t2.dest AS B,
t1.distance + t2.distance AS total_distance
FROM
big_table AS t1
CROSS JOIN
big_table AS t2 ON t1.dst = t2.src
WHERE
A = 'insert source (A) here' AND
B = 'insert destination (B) here'
ORDER BY
total_distance ASC
LIMIT
1
The above snippet will work for the case in which you have two rows in the form A->x and x->B but not for other combinations (e.g. A->x and B->x). Extending it to cover all four combiantions should be trivial (e.g. create a view that duplicates each row and swaps src and dest).

Retrieving object data from multiple tables help

Sorry if I'm not wording this too well, but let me try to explain what I'm doing. I have a main object of class A, that has multiple objects of classes B, C, D and E.
such that:
Class ObjectA
{
ObjectB[] myObjectBs;
ObjectC[] myObjectCs;
ObjectD[] myObjectDs;
ObjectE[] myObjectEs;
}
where A---B mapping is 1 to many, for B, C, D and E. That is, all B,C,D,E objects are associated with only one object A.
I'm storing the data for all these objects in a database, with Table A holding all the data for the instances of Class A, etc.
Now, when getting the data for this at run time on the fly, I'm running 5 different queries for each object.
(very simplified psuedocode)
objectA=sql("select * from tableA where id=#id#");
objectA.setObjectBs(sql("select * from tableB where a_id=#id#");
objectA.setObjectCs(sql("select * from tableC where a_id=#id#");
objectA.setObjectDs(sql("select * from tableD where a_id=#id#");
objectA.setObjectEs(sql("select * from tableE where a_id=#id#");
if that makes sense.
Now, I'm wondering, is this the most efficient way of doing it? I feel like there should be a way to get all this info in 1 query, but doing something like "select * from a,b,c,d,e where a.id = #id# and b.a_id = #id# and c.a_id = #id# and d.a_id = #id# and e.a_id = #id#" will give a result set with all the columns of A,B,C,D,E for each row, and there will be many many more rows that I'd be needing.
If there was only one array of objects (like just ObjectBs) it could be done with a simple join and then handled by my database framework. If the relationships were A(one)....B(many) and B(one)....C(many) it could be done with two joins and work. But for A(one)....B(many) and A(one)....C(many) etc I can't think of a good way to do joins or return this data without having too many rows, as with joins if A has 10 Bs and 10Cs, it'll return 100 rows rather than 20.
So, is the way I'm currently doing it, with 5 different selects, the most efficient (which it seems like its not), or is there a better way of doing it?
Also, If I were to grab a large set of these at once (say, 5000 ObjectAs and all the associated Bs, Cs, Ds, and Es), would there be a way to do it without running a ton of consecutive queries one after the other?

You can try iBatis using N+1 Select Lists
http://ibatis.apache.org/docs/dotnet/datamapper/ch03s05.html
Hth.

There is a huge performance issue with N+1 selects (check https://github.com/balak1986/prime-cache-support-in-ibatis2/blob/master/README). So please don't use it unless there is no other way of achieving this.
Luckily iBatis has groupBy property, that is created exactly to map data for these kind of complex object.
Check the example from http://www.cforcoding.com/2009/06/ibatis-tutorial-aggregation-with.html

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas