SQL, how to group rows based on field values - sql

i have question about query result group. The image is a example. it is cable list. Each cable come with two attributes, 'From' location and 'To' location. if we'd like to group the cable list by location, it becomes tricky.
when we group the data by two locations, the results will land into two groups.
'A->B',
'B->A'
But in reality, it makes more sense to combine these two groups into one, 'From<->', or saying cable list between two locations.
Think of add one more filed to mark cables between two locations. But didn't come up with any idea.
thank you for sharing your thoughts.
regards,
Roland
SQL, group
enter image description here

You can group by the least and the greatest of the 2 columns.
Use your database's functions to do this.
The SQL standard is a CASE expression like:
GROUP BY CASE WHEN A < B THEN A ELSE B END,
CASE WHEN A < B THEN B ELSE A END
Or maybe you can use a function like IF() or IIF():
GROUP BY IF(A < B, A, B),
IF(A < B, B, A)
or functions like LEAST() and GREATEST():
GROUP BY LEAST(A, B),
GREATEST(A, B)

Related

Adding a "calculated column" to BigQuery query without repeating the calculations

I want to resuse value of calculated columns in a new third column.
For example, this query works:
select
countif(cond1) as A,
countif(cond2) as B,
countif(cond1)/countif(cond2) as prct_pass
From
Where
Group By
But when I try to use A,B instead of repeating the countif, it doesn't work because A and B are invalid:
select
countif(cond1) as A,
countif(cond2) as B,
A/B as prct_pass
From
Where
Group By
Can I somehow make the more readable second version work ?
Is this first one inefficient ?
You should construct a subquery (i.e. a double select) like
SELECT A, B, A/B as prct_pass
FROM
(
SELECT countif(cond1) as A,
countif(cond2) as B
FROM <yourtable>
)
The same amount of data will be processed in both queries.
In the subquery one you will do only 2 countif(), in case that step takes a long time then doing 2 instead of 4 should be more efficient indeed.
Looking at an example using bigquery public datasets:
SELECT
countif(homeFinalRuns>3) as A,
countif(awayFinalRuns>3) as B,
countif(homeFinalRuns>3)/countif(awayFinalRuns>3) as division
FROM `bigquery-public-data.baseball.games_post_wide`
or
SELECT A, B, A/B as division FROM
(
SELECT countif(homeFinalRuns>3) as A,
countif(awayFinalRuns>3) as B
FROM `bigquery-public-data.baseball.games_post_wide`
)
we can see that doing all in one (without a subquery) is actually slightly faster. (I ran the queries 6 times for different values of the inequality, 5 times was faster and one time slower)
In any case, the efficiency will depend on how taxing is to compute the condition in your particular dataset.

Nested SQL evaluation question with unnest

this may be a basic question but I just couldn't figure it out. Sample data and query could be found here. (under the "First-touch" tab)
I'll skip the marketing terminology here but basically what the query does is attributing credits/points to placements (ads) based on certain rule. Here, the rule is "first-touch", which means the credit goes to the first ad user interacted with - could be view or click. The "FLOODLIGHT" here means the user takes action to actually buy the product (conversion).
As you can see in the sample data, user 1 has one conversion and the first ad is placement 22 (first-touch), so 22 gets 1 point. User 2 has two conversions and the first ad of each is 11, so 11 gets 2 points.
The logic is quite simple here but I had a difficult time understanding the query itself. What's the point of comparing prev_conversion_event.event_time < conversion_event.event_time? Aren't they essentially the same? I mean both of them came from UNNEST(t.*_path.events). And attributed_event.event_time also came from the same place.
What does prev_conversion_event.event_time, conversion_event.event_time, and attributed_event.event_time evaluate to in this scenario anyway? I'm just confused as hell here. Much appreciate the help!
For convenience I'm pasting the sample data, the query and output below:
Sample data
Output
/* Substitute *_paths for the specific paths table that you want to query. */
SELECT
(
SELECT
attributed_event_metadata.placement_id
FROM (
SELECT
AS STRUCT attributed_event.placement_id,
ROW_NUMBER() OVER(ORDER BY attributed_event.event_time ASC) AS rank
FROM
UNNEST(t.*_paths.events) AS attributed_event
WHERE
attributed_event.event_type != "FLOODLIGHT"
AND attributed_event.event_time < conversion_event.event_time
AND attributed_event.event_time > (
SELECT
IFNULL( (
SELECT
MAX(prev_conversion_event.event_time) AS event_time
FROM
UNNEST(t.*_paths.events) AS prev_conversion_event
WHERE
prev_conversion_event.event_type = "FLOODLIGHT"
AND prev_conversion_event.event_time < conversion_event.event_time),
0)) ) AS attributed_event_metadata
WHERE
attributed_event_metadata.rank = 1) AS placement_id,
COUNT(*) AS credit
FROM
adh.*_paths AS t,
UNNEST(*_paths.events) AS conversion_event
WHERE
conversion_event.event_type = "FLOODLIGHT"
GROUP BY
placement_id
HAVING
placement_id IS NOT NULL
ORDER BY
credit DESC
It is a quite convoluted query to be fair, I think I know what are you asking, please correct me if not the case.
What's the point of comparing prev_conversion_event.event_time < conversion_event.event_time?
You are doing something like "I want all the events from this (unnest), and for every event, I want to know which events are the predecessor of each other".
Say you have [A, B, C, D] and they are ordered in succession (A happened before B, A and B happened before C, and so on), the result of that unnesting and joining over that condition will get you something like [A:(NULL), B:(A), C:(A, B), D:(A, B, C)] (excuse the notation, hope it is not confusing), being each key:value pair, the Event:(Predecessors). Note that A has no events before it, but B has A, etc.
Now you have a nice table with all the conversion events joined with the events that happened before that one.

SQL Combine two columns into one only using 1 set of data

On the same row I have dulpicate data.
I have columns aCust, bCust, aPart, bPart, aSM, bSM, aSales, bSales.
I want to combine the Cust together, parts together, and SM together while keeping Sales separate. Some rows have data in both a and b, some a's are null and some b's are null. How do I combine this? If there is data in both a and b, it is always identical (except for sales).
Try this query. It looks weird but will do the job. You didn't specify what RDBMS you're using (Oracle, MySQL, SQL Server, etc.). That's why I didn't use anything like ISNULL.
select
case when aCust is null then bCust else bCust end as Cust,
case when aPart is null then bPart else bPart end as Part,
case when aSM is null then bSM else bSM end as SM,
aSales, bSales
from
tbl
You can do this with coalesce():
select coalesce(aCust, bCust) as Cust,
coalesce(aPart, bPart) as Part,
coalesce(aSM, bSM) as SM,
aSales, bSales
from table;
This will choose the first non-NULL value to return for each field.

Sum two counts in a new column without repeating the code

I have one maybe stupid question.
Look at the query :
select count(a) as A, count(b) as b, count(a)+count(b) as C
From X
How can I sum up the two columns without repeating the code:
Something like:
select count(a) as A, count(b) as b, A+B as C
From X
For the sake of completeness, using a CTE:
WITH V AS (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
)
SELECT A, B, A + B as C
FROM V
This can easily be handled by making the engine perform only two aggregate functions and a scalar computation. Try this.
SELECT A, B, A + B as C
FROM (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
) T
You may get the two individual counts of a same table and then get the summation of those counts, like bellow
SELECT
(SELECT COUNT(a) FROM X )+
(SELECT COUNT(b) FROM X )
AS C
Let's agree on one point: SQL is not an Object-Oriented language. In fact, when we think of computer languages, we are thinking of procedural languages (you use the language to describe step by step how you want the data to be manipulated). SQL is declarative (you describe the desired result and the system works out how to get it).
When you program in a procedural languages your main concerns are: 1) is this the best algorithm to arrive at the correct result? and 2) do these steps correctly implement the algorithm?
When you program in a declarative language your main concern is: is this the best description of the desired result?
In SQL, most of your effort will be going into correctly forming the filtering criteria (the where clause) and the join criteria (any on clauses). Once that is done correctly, you're pretty much just down to aggregating and formating (if applicable).
The first query you show is perfectly formed. You want the number of all the non-null values in A, the number of all the non-null values in B, and the total of both of those amounts. In some systems, you can even use the second form you show, which does nothing more than abstract away the count(x) text. This is convenient in that if you should have to change a count(x) to sum(x), you only have to make a change in one place rather than two, but it doesn't change the description of the data -- and that is important.
Using a CTE or nested query may allow you to mimic the abstraction not available in some systems, but be careful making cosmetic changes -- changes that do not alter the description of the data. If you look at the execution plan of the two queries as you show them, the CTE and the subquery, in most systems they will probably all be identical. In other words, you've painted your car a different color, but it's still the same car.
But since it now takes you two distinct steps in 4 or 5 lines to explain what it originally took only one step in one line to express, it's rather difficult to defend the notion that you have made an improvement. In fact, I'll bet you can come up with a lot more bullet points explaining why it would be better if you had started with the CTE or subquery and should change them to your original query than the other way around.
I'm not saying that what you are doing is wrong. But in the real world, we are generally short of the spare time to spend on strictly cosmetic changes.

Using NVL for multiple columns - Oracle SQL

Good morning my beloved sql wizards and sorcerers,
I am wanting to substitute on 3 columns of data across 3 tables. Currently I am using the NVL function, however that is restricted to two columns.
See below for an example:
SELECT ccc.case_id,
NVL (ccvl.descr, ccc.char)) char_val
FROM case_char ccc, char_value ccvl, lookup_value lval1
WHERE
ccvl.descr(+) = ccc.value
AND ccc.value = lval1.descr (+)
AND ccc.case_id IN ('123'))
case_char table
case_id|char |value
123 |email| work_email
124 |issue| tim_
char_value table
char | descr
work_email | complaint mail
tim_ | timeliness
lookup_value table
descr | descrlong
work_email| xxx#blah.com
Essentially what I am trying to do is if there exists a match for case_char.value with lookup_value.descr then display it, if not, then if there exists a match with case_char.value and char_value.char then display it.
I am just trying to return the description for 'issue'from the char_value table, but for 'email' I want to return the descrlong from the lookup_value table (all under the same alias 'char_val').
So my question is, how do I achieve this keeping in mind that I want them to appear under the same alias.
Let me know if you require any further information.
Thanks guys
You could nest NVL:
NVL(a, NVL(b, NVL(c, d))
But even better, use the SQL-standard COALESCE, which does take multiple arguments and also works on non-Oracle systems:
COALESCE(a, b, c, d)
How about using COALESCE:
COALESCE(ccvl.descr, ccc.char)
Better to Use COALESCE(a, b, c, d) because of below reason:
Nested NVL logic can be achieved in single COALESCE(a, b, c, d).
It is SQL standard to use COALESCE.
COALESCE gives better performance in terms, NVL always first calculate both of the queries used and then compare if the first value is null then return a second value. but in COALESCE function it checks one by one and returns response whenever if found a non-null value instead of executing all the used queries.