SQL group by under some conditions - sql

I have a big table with tons of duplicated rows (among those columns that I care about). Let me start with the following example:
|field1 | field2| field3| field4| field5|
| aa | 1 | NULL | 1 | 0 |
| aaa | 1 | NULL | 1 | 1 |
| aaa | 1 | NULL | 1 | 2 |
| a | 2 | 0 | 1 | 3 |
| a | 2 | 0 | NULL | 4 |
| a | 2 | NULL | 2 | 5 |
| b | 3 | NULL | 2 | 6 |
| b2 | 3 | NULL | NULL | 7 |
| c | 4 | NULL | NULL | 8 |
I am interested in an effiecient query to get the following table:
|field1 | field2| field3| field4|
| aaa | 1 | NULL | 1 |
| a | 2 | 0 | 1 |
| b | 3 | NULL | 2 |
| c | 4 | NULL | NULL |
Basically, it follows the following rules:
for each value of field2, there should be one and exactly one row present
among all the rows with the same value of field2 select the row that satisfy the following in order:
select the row that field4 is not Null (if possible)
among those that have a non Null value for the field4 select the row that has has a non Null value for field 3
among those that have a non Null value for the field4 and 3, select the row that has the longest string value for field 1
among those that satisfy all above, select only one row (does not matter what is the value of field5).
I could do it with bunch of joins, but it becomes very slow. Any better suggestions?
EDIT
The field2 values may not be in an specific order. I just put 1,2,3,4 in the example but this is not generally true in my case. I did not change it directly on the table since one of the suggested solutions are actually considering sequential value for field2, so I kept if for future readers that maybe interested in that.

This type of prioritization is challenging. I think the simplest method in MySQL uses variables:
select t.*
from (select t.*,
(#rn := if(#f2 = field2, #rn + 1,
if(#f2 := field2, 1, 1)
)
) as seqnum
from t cross join
(select #rn := 0, #field2 := '') params
order by field2,
(field4 is not null) desc,
(field3 is not null) desc,
length(field1) desc
) t
where seqnum = 1;
I'm not 100% sure I have the conditions right (the third seems to conflict with the first two). But whatever the prioritization, the idea is the same: use order by to get the rows in the right order and use variables to get the first one.
EDIT:
In SQL Server -- or any other reasonable database -- you do this with row_number():
select t.*
from (select t.*,
row_number() over (partition by field2
order by (case when field4 is not null then 0 else 1 end),
(case when field3 is not null then 0 else 1 end),
len(field1)
) as seqnum
from t
) t
where seqnum = 1;

Related

Unexpected effect of filtering on result from crosstab() query with multiple values

I have a crosstab() query similar to the one in my previous question:
Unexpected effect of filtering on result from crosstab() query
The common case is to filter extra1 field with multiples values: extra1 IN(value1, value2...). For each value included on the extra1 filter, I have added an ordering expression like this (extra1 <> valueN), as appear on the above mentioned post. The resulting query is as follows:
SELECT *
FROM crosstab(
'SELECT row_name, extra1, extra2..., another_table.category, value
FROM table t
JOIN another_table ON t.field_id = another_table.field_id
WHERE t.field = certain_value AND t.extra1 IN (val1, val2, ...) --> more values
ORDER BY row_name ASC, (extra1 <> val1), (extra1 <> val2)', ... --> more ordering expressions
'SELECT category_name FROM category_name WHERE field = certain_value'
) AS ct(extra1, extra2...)
WHERE extra1 = val1; --> condition on the result
The first value of extra1 included on the ordering expression value1, get the correct resulting rows. However, the following ones value2, value3..., get wrong number of results, resulting on less rows on each one. Why is that?
UPDATE:
Giving this as our source table (table t):
+----------+--------+--------+------------------------+-------+
| row_name | Extra1 | Extra2 | another_table.category | value |
+----------+--------+--------+------------------------+-------+
| Name1 | 10 | A | 1 | 100 |
| Name2 | 11 | B | 2 | 200 |
| Name3 | 12 | C | 3 | 150 |
| Name2 | 11 | B | 3 | 150 |
| Name3 | 12 | C | 2 | 150 |
| Name1 | 10 | A | 2 | 100 |
| Name3 | 12 | C | 1 | 120 |
+----------+--------+--------+------------------------+-------+
And this as our category table:
+-------------+--------+
| category_id | value |
+-------------+--------+
| 1 | Cat1 |
| 2 | Cat2 |
| 3 | Cat3 |
+-------------+--------+
Using the CROSSTAB, the idea is to get a table like this:
+----------+--------+--------+------+------+------+
| row_name | Extra1 | Extra2 | cat1 | cat2 | cat3 |
+----------+--------+--------+------+------+------+
| Name1 | 10 | A | 100 | 100 | |
| Name2 | 11 | B | | 200 | 150 |
| Name3 | 12 | C | 120 | 150 | 150 |
+----------+--------+--------+------+------+------+
The idea is to be able to filter the resulting table so I get results with Extra1 column with values 10 or 11, as follow:
+----------+--------+--------+------+------+------+
| row_name | Extra1 | Extra2 | cat1 | cat2 | cat3 |
+----------+--------+--------+------+------+------+
| Name1 | 10 | A | 100 | 100 | |
| Name2 | 11 | B | | 200 | 150 |
+----------+--------+--------+------+------+------+
The problem is that on my query, I get different result size for Extra1 with 10 as value and Extra1 with 11 as value. With (Extra1 <> 10) I can get the correct result size on Extra1 for that value but not in the case of 11 as value.
Here is a fiddle demonstrating the problem in more detail:
https://dbfiddle.uk/?rdbms=postgres_11&fiddle=5c401f7512d52405923374c75cb7ff04
All "extra" columns are copied from the first row of the group (as pointed out in my previous answer)
While you filter with:
.... WHERE extra1 = 'val1';
...it makes no sense to add more ORDER BY expressions on the same column. Only rows that have at least one extra1 = 'val1' in their source group survive.
From your various comments, I guess you might want to see all distinct existing values of extra - within the set filtered in the WHERE clause - for the same unixdatetime. If so, aggregate before pivoting. Like:
SELECT *
FROM crosstab(
$$
SELECT unixdatetime, x.extras, c.name, s.value
FROM (
SELECT unixdatetime, array_agg(extra) AS extras
FROM (
SELECT DISTINCT unixdatetime, extra
FROM source_table s
WHERE extra IN (1, 2) -- condition moves here
ORDER BY unixdatetime, extra
) sub
GROUP BY 1
) x
JOIN source_table s USING (unixdatetime)
JOIN category_table c ON c.id = s.gausesummaryid
ORDER BY 1
$$
, $$SELECT unnest('{trace1,trace2,trace3,trace4}'::text[])$$
) AS final_result (unixdatetime int
, extras int[]
, trace1 numeric
, trace2 numeric
, trace3 numeric
, trace4 numeric);
Aside: advice given in the following related answer about the 2nd function parameter applies to your case as well:
PostgreSQL crosstab doesn't work as desired
I demonstrate a static 2nd parameter query above. While being at it, you don't need to join to category_table at all. The same, a bit shorter and faster, yet:
SELECT *
FROM crosstab(
$$
SELECT unixdatetime, x.extras, s.gausesummaryid, s.value
FROM (
SELECT unixdatetime, array_agg(extra) AS extras
FROM (
SELECT DISTINCT unixdatetime, extra
FROM source_table
WHERE extra IN (1, 2) -- condition moves here
ORDER BY unixdatetime, extra
) sub
GROUP BY 1
) x
JOIN source_table s USING (unixdatetime)
ORDER BY 1
$$
, $$SELECT unnest('{923,924,926,927}'::int[])$$
) AS final_result (unixdatetime int
, extras int[]
, trace1 numeric
, trace2 numeric
, trace3 numeric
, trace4 numeric);
db<>fiddle here - added my queries at the bottom of your fiddle.

How do I do an Oracle SQL update from select in a specific order?

I have a table with old values (some null) and new values for various attributes, all inserted at different add times throughout the months. I'm trying to update a second table with records with business month end dates. Right now, these records only contain the most recent new values for all month end dates. The goal is to create historical data by updating the previous month end values with the old values from the first table. I am a beginner and was able to come up with a query to update on one object where there was one entry from the first table. Now I am trying to expand the query to include multiple objects, with possible, multiple old values within the same month. I tried to use "order by" (since I need to make updates for a month in ascending order so it gets the latest value) but read that doesn't work with update statements without a subquery. So I tried my hand at making a more complicated query, without success. I am getting the following error: single-row subquery returns more than one row. Thanks!
TableA:
| ID | TYPE | OLD_VALUE | NEW_VALUE | ADD_TIME|
-----------------------------------------------
| 1 | A | 2 | 3 | 1/11/2019 8:00:00am |
| 1 | B | 3 | 4 | 12/10/2018 8:00:00am|
| 1 | B | 4 | 5 | 12/11/2018 8:00:00am|
| 2 | A | 5 | 1 | 12/5/2018 08:00:00am|
| 2 | A | 1 | 2 | 12/5/2019 09:00:00am|
| 2 | A | 2 | 3 | 12/5/2019 10:00:00am|
| 2 | B | 1 | 2 | 12/5/2019 10:00:00am|
TableB
| ID | MONTH_END | TYPE_A | TYPE_B |
-----------------------------------
| 1 | 1/31/19 | 3 | 5 |
| 1 | 12/31/18 | 3 | 5 |
| 1 | 11/30/18 | 3 | 5 |
| 2 | 12/31/18 | 3 | 2 |
| 2 | 11/30/18 | 3 | 2 |
Desired Output for TableB
| ID | MONTH_END | TYPE_A | TYPE_B |
-----------------------------------
| 1 | 1/31/19 | 3 | 5 |
| 1 | 12/31/18 | 2 | 5 |
| 1 | 11/30/18 | 2 | 3 |
| 2 | 12/31/18 | 3 | 2 |
| 2 | 11/30/18 | 5 | 2 |
My Query for Type A (Which I plan to adapt for Type B and execute as well for the desired output)
update TableB B
set b.type_a =
(
with aa as
(
select id, nvl(old_value, new_value) typea, add_time
from TableA
where type = 'A'
order by id, add_time ascending
)
select typea
from aa
where aa.id = b.id
and b.month_end <= aa.add_tm
)
where exists
(
with aa as
(
select id, nvl(old_value, new_value) typea, add_time
from TableA
where type = 'A'
order by id, add_time ascending
)
select typea
from aa
where aa.id = b.id
and b.month_end <= aa.add_tm
)
Kudo's for giving example input data and desired output. I found your question a bit confusing so let me rephrase to "Provide the last type a value from table a that is in the same month as the month end.
By matching on type and date of entry, we can get your answer. The "ROWNUM=1" is to limit result set to a single entry in case there is more than one row with the same add_time. This SQL is still a mess, maybe someone else can come up with a better one.
UPDATE tableb b
SET b.typea =
(SELECT old_value
FROM tablea a
WHERE LAST_DAY( TRUNC( a.add_time ) ) = b.month_end
AND TYPE = 'A'
AND add_time =
(SELECT MAX( add_time )
FROM tablea
WHERE TYPE = 'A' AND LAST_DAY( TRUNC( a.add_time ) ) = b.month_end)
AND ROWNUM = 1)
WHERE EXISTS
(SELECT old_value
FROM tablea a
WHERE LAST_DAY( TRUNC( a.add_time ) ) = b.month_end AND TYPE = 'A');

SQL select distinct when one column in and another column greater than

Consider the following dataset:
+---------------------+
| ID | NAME | VALUE |
+---------------------+
| 1 | a | 0.2 |
| 1 | b | 8 |
| 1 | c | 3.5 |
| 1 | d | 2.2 |
| 2 | b | 4 |
| 2 | c | 0.5 |
| 2 | d | 6 |
| 3 | a | 2 |
| 3 | b | 4 |
| 3 | c | 3.6 |
| 3 | d | 0.2 |
+---------------------+
I'm tying to develop a sql select statement that returns the top or distinct ID where NAME 'a' and 'b' both exist and both of the corresponding VALUE's are >= '1'. Thus, the desired output would be:
+---------------------+
| ID | NAME | VALUE |
+---------------------+
| 3 | a | 2 |
+----+-------+--------+
Appreciate any assistance anyone can provide.
You can try to use MIN window function and some condition to make it.
SELECT * FROM (
SELECT *,
MIN(CASE WHEN NAME = 'a' THEN [value] end) OVER(PARTITION BY ID) aVal,
MIN(CASE WHEN NAME = 'b' THEN [value] end) OVER(PARTITION BY ID) bVal
FROM T
) t1
WHERE aVal >1 and bVal >1 and aVal = [Value]
sqlfiddle
This seems like a group by and having query:
select id
from t
where name in ('a', 'b')
having count(*) = 2 and
min(value) >= 1;
No subqueries or joins are necessary.
The where clause filters the data to only look at the "a" and "b" records. The count(*) = 2 checks that both exist. If you can have duplicates, then use count(distinct name) = 2.
Then, you want the minimum value to be 1, so that is the final condition.
I am not sure why your desired results have the "a" row, but if you really want it, you can change the select to:
select id, 'a' as name,
max(case when name = 'a' then value end) as value
you can use in and sub-query
select top 1 * from t
where t.id in
(
select id from t
where name in ('a','b')
group by id
having sum(case when value>1 then 1 else 0)>=2
)
order by id

Efficient ROW_NUMBER increment when column matches value

I'm trying to find an efficient way to derive the column Expected below from only Id and State. What I want is for the number Expected to increase each time State is 0 (ordered by Id).
+----+-------+----------+
| Id | State | Expected |
+----+-------+----------+
| 1 | 0 | 1 |
| 2 | 1 | 1 |
| 3 | 0 | 2 |
| 4 | 1 | 2 |
| 5 | 4 | 2 |
| 6 | 2 | 2 |
| 7 | 3 | 2 |
| 8 | 0 | 3 |
| 9 | 5 | 3 |
| 10 | 3 | 3 |
| 11 | 1 | 3 |
+----+-------+----------+
I have managed to accomplish this with the following SQL, but the execution time is very poor when the data set is large:
WITH Groups AS
(
SELECT Id, ROW_NUMBER() OVER (ORDER BY Id) AS GroupId FROM tblState WHERE State=0
)
SELECT S.Id, S.[State], S.Expected, G.GroupId FROM tblState S
OUTER APPLY (SELECT TOP 1 GroupId FROM Groups WHERE Groups.Id <= S.Id ORDER BY Id DESC) G
Is there a simpler and more efficient way to produce this result? (In SQL Server 2012 or later)
Just use a cumulative sum:
select s.*,
sum(case when state = 0 then 1 else 0 end) over (order by id) as expected
from tblState s;
Other method uses subquery :
select *,
(select count(*)
from table t1
where t1.id < t.id and state = 0
) as expected
from table t;

Set-based way to calculate family ranges in SQL?

I have a table that contains parents and 0 or more children for each parent, with a flag indicating which records are parents. All of the members of a given family have the same parent id, and the parent always has the lowest id in a given family. Also, each child has a value associated with it. (Specifically, this is a database of emails and attachments, where each parent is an email and the children are the attachments.)
I have two fields I need to calculate:
Range = {lowest id in family} - {highest id in family} [populated for all members]
Value-list = {delimited list of the values of each child, in id order} [only for parent]
So, given this:
Id | Parent| HasChildren| Value | Range | Value-list
----------------------------------------|-----------
1 | 1 | 1 | | |
2 | 1 | 0 | a | |
3 | 1 | 0 | b | |
4 | 4 | 1 | | |
5 | 4 | 0 | c | |
6 | 6 | 0 | | |
I would like to end up with this:
Id | Parent| HasChildren| Value | Range | Value-list
----------------------------------------|-----------
1 | 1 | 1 | | 1-3 | a;b
2 | 1 | 0 | a | 1-3 |
3 | 1 | 0 | b | 1-3 |
4 | 4 | 1 | | 4-5 | c
5 | 4 | 0 | c | 4-5 |
6 | 6 | 0 | | 6-6 |
How can I do this efficiently? Ideally, I'd like to do this with just set-based logic, without cursors, or even stored procedures. Temporary tables are fine.
I'm working in T-SQL, if that makes a difference, though I'd be curious to see platform agnostic answers.
The following SQLFiddle Solution should do the job for you, however as #Allan mentioned, you might want to revise your database structure.
Using CTE's:
Note: my query uses table1 as name of Your table
with cte as(
select parent
,ValueList= stuff(( select ';' +isnull(t2.Value, '')
from table1 t2
where t1.parent=t2.parent
order by t2.value
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)'), 1, 2, '')
from table1 t1
group by parent
),
cte2 as (select parent
, min(id) as firstID
, max(id) as LastID
from table1
group by parent)
select *
,(select FirstID from cte2 t2 where t2.parent=t1.parent)+'-'+(select LastID from cte2 t2 where t2.parent=t1.parent) as [Range]
,(select ValueList from cte t2 where t1.parent=t2.parent and t1.[haschildren]='1') as [Value -List]
from table1 t1