Cannot use group by and over(partition by) in the same query? - sql

I have a table myTable with 3 columns. col_1 is an INTEGER and the other 2 columns are DOUBLE. For example, col_1={1, 2}, col_2={0.1, 0.2, 0.3}. Each element in col_1 is composed of all the values of col_2 and col_2 has repeated values for each element in col_1. The 3rd column can have any value as shown below:
col_1 | col_2 | Value
----------------------
1 | 0.1 | 1.0
1 | 0.2 | 2.0
1 | 0.2 | 3.0
1 | 0.3 | 4.0
1 | 0.3 | 5.0
2 | 0.1 | 6.0
2 | 0.1 | 7.0
2 | 0.1 | 8.0
2 | 0.2 | 9.0
2 | 0.3 | 10.0
What I want is to use an aggregate-function SUM() on the Value column partition by col_1 and grouped by col_2. The Above table should look like this:
col_1 | col_2 | sum_value
----------------------
1 | 0.1 | 1.0
1 | 0.2 | 5.0
1 | 0.3 | 9.0
2 | 0.1 | 21.0
2 | 0.2 | 9.0
2 | 0.3 | 10.0
I tried the following SQL query:
SELECT col_1, col_2, sum(Value) over(partition by col_1) as sum_value
from myTable
GROUP BY col_1, col_2
But on DB2 v10.5 it gave the following error:
SQL0119N An expression starting with "Value" specified in a SELECT
clause, HAVING clause, or ORDER BY clause is not specified in the
GROUP BY clause or it is in a SELECT clause, HAVING clause, or ORDER
BY clause with a column function and no GROUP BY clause is specified.
Can you kindly point out what is wrong. I do not have much experience with SQL.
Thank you.

Yes, you can, but you should be consistent regarding the grouping levels.
That is, if your query is a GROUP BY query, then in an analytic function
you can only use "detail" columns from the "non-analytic" part of your selected
columns.
Thus, you can use either the GROUP BY columns or the non-analytic aggregates,
like this example:
select product_id, company,
sum(members) as No_of_Members,
sum(sum(members)) over(partition by company) as TotalMembership
From Product_Membership
Group by Product_ID, Company
Hope that helps
SELECT col_1, col_2, sum(Value) over(partition by col_1) as sum_value
-- also try changing "col_1" to "col_2" in OVER
from myTable
GROUP BY col_2,col_1

I found the solution.
I do not need to use OVER(PARTITION BY col_1) because it is already in the GROUP BY clause. Thus, the following query gives me the right answer:
SELECT col_1, col_2, sum(Value) as sum_value
from myTable GROUP BY col_1, col_2
since I am already grouping w.r.t col_1 and col_2.
Dave, thanks, I got the idea from your post.

Related

How do I find the closest number across columns?

I have this table:
col_1 | col_2 | col_3 | compare
------+-------+-------+--------
1.1 | 2.1 | 3.1 | 2
------+-------+-------+--------
10 | 9 | 1 | 15
I want to derive a new column choice indicating the column closest to the compare value:
col_1 | col_2 | col_3 | compare | choice
------+-------+-------+---------+-------
1.1 | 2.1 | 3.1 | 2 | col_2
------+-------+-------+---------+-------
10 | 9 | 1 | 15 | col_1
Choice refers to the column where cell value is closest to the compare value.
I think the simplest method is apply:
select t.*, v.which as choice
from t cross apply
(select top (1) v.*
from (values ('col_1', col_1), ('col_2', col_2), ('col_3', col_3)
) v(which, val)
order by abs(v.val - t.compare)
) v;
In the event of ties, this returns an arbitrary closest column.
You can also use case expressions, but that gets complicated. With no NULL values:
select t.*,
(case when abs(compare - col_1) <= abs(compare - col_3) and
abs(compare - col_1) <= abs(compare - col_3)
then 'col_1'
when abs(compare - col_2) <= abs(compare - col_3)
then 'col_2'
else 'col_3'
end) as choice
from t;
In the event of ties, this returns the first column.

Find and update non duplicated record based on one of the column

I want to find all the non duplicated records and update one of the column.
Ex.
Col_1 | Col_2 | Col_3 | Col_4 | Col_5
A | AA | BB | 1 |
A | AB | BC | 2 |
A | AC | BD | 3 |
B | BB | CC | 1 |
B | BB | CC | 2 |
C | CC | DD | 1 |
My query has to group by Col_1, and I want to find out not unique record based on Col_2 and Col3 and then update the Col_5.
Basically output should be as below,
Col_1 | Col_2 | Col_3 | Col_4 | Col_5
A | AA | BB | 1 | 1
A | AB | BC | 2 | 1
A | AC | BD | 3 | 1
B | BB | CC | 1 | 0
B | BB | CC | 2 | 0
C | CC | DD | 1 | 0
Does anyone have an idea how can I achieve this? This is a large database, so performance is also a key factor.
Thanks heaps,
There are plenty ways to do it. This solution comes from postgres for which I have access to, but I bet it will be working also on tsql as should have common syntax.
;WITH
cte_1 AS (
SELECT col_1 FROM some_table GROUP BY col_1 HAVING count(*) > 1
),
cte_2 AS (
SELECT col_1 FROM some_table GROUP BY col_1, col_2, col_3 HAVING count(*) > 1
),
cte_3 AS (
SELECT cte_1.col_1 FROM cte_1
LEFT JOIN cte_2 ON cte_1.col_1 = cte_2.col_1
WHERE cte_2.col_1 IS NULL
)
UPDATE some_table SET col_5 = 1
FROM cte_3 WHERE cte_3.col_1 = some_table.col_1;
So, what happens above?
First we build three CTE semi-tables which allow us to split logic into smaller parts:
cte_1 which extracts rows which can have multiple col2 and col_3 rows
cte_2 which selects those which have non-unique col_2 and col_3
cte_3 which returns those col_1 which have unique col_2 and col_3 just by LEFT JOIN
Using the last cte_3 structure we are able to update some_table correctly
I assume that your table is called some_table here. If you're worring about a performance you should provide some primary key here and also it would be good to have indexes on col_2 and col_3 (standalone but it may help if those would be composite on (col_1, col_2) and so on).
Also you may want to move it from CTE to use temporary tables (which could be also indexed to gain efficiency.
Please also note that this query works fine with your example but without real data it may be just guessing. I mean that what will happen when you would have for col_1=A some unique and non-uniqe col_2 in the same time?
But I believe it's good point to start.
;WITH
cte_1 AS (
SELECT col_1, count(*) as items FROM some_table GROUP BY col_1 HAVING count(*) > 1
),
cte_2 AS (
SELECT col_1, count(*) as items FROM some_table GROUP BY col_1, col_2, col_3 HAVING count(*) > 1
),
cte_3 AS (
SELECT cte_1.col_1 FROM cte_1
LEFT JOIN cte_2 ON cte_1.col_1 = cte_2.col_1
WHERE cte_2.col_1 IS NULL OR cte_1.items > cte_2.items
GROUP BY cte_1.col_1
)
UPDATE some_table SET col_5 = 1
FROM cte_3 WHERE cte_3.col_1 = some_table.col_1;

SQL query showing just few DISTINCT records

I have the following records:
Col_1 | Col_2 | Col_3
------+-------+------
A | A | XYZ01
A | A | XYZ02
A | A | XYZ03
A | B | XYZ04
B | B | XYZ05
B | B | XYZ06
B | B | XYZ07
B | B | XYZ08
I need a query which will return maximum of 2 records where Col_1 and Col_2 are distinct (regardless of Col_3) (that should be like 2 records sample of each distinct col_1,col_2 combination).
So this query should return:
Col_1 | Col_2 | Col_3
------+-------+------
A | A | XYZ01
A | A | XYZ02
A | B | XYZ04
B | B | XYZ05
B | B | XYZ06
SELECT *
FROM (
SELECT col_1
,col_2
,col_3
,row_number() OVER (
PARTITION BY col_1
,col_2 ORDER BY col_1
) AS foo
FROM TABLENAME
) bar
WHERE foo < 3
Top command will not work because you want to 'group by' multiple columns. What will help is partitioning the data and assigning a row number to the partition data. By partitioning on col_1 and col_2 we can create 3 different groupings.
1.All rows with 'a' in col_1
2.All rows with 'b' in col_2
3 All rows with 'a' and 'b' in col_1,col_2
we will order by col_1 (I picked this because your result set was ordered by a). Then for each row in that grouping we will count the rows and display the row number.
We will use this information as a derived table, and select * from this derived table where the rownumber is less than 3. This will get us the first two elements in each grouping
as far as Oracle The use of Rank would work.
Select * From (SELECT Col_1 ,
Col_2 ,
Col_3 ,
RANK() OVER (PARTITION BY Col_1, Col_2 ORDER BY Col_1 ) part
FROM someTable ) st where st.part < 2;
Since I was reminded that you can't use the alias in the original where clause, I made a change that will still work, though may not be the most elegant.

Creating a Postgres crosstab query and calculating difference between columns

Tni analysis
I have data in the format
Patid | TNT | date
A123 | 1.2. | 23/1/2012
A123 | 1.3. | 23/1/2012
B123 | 2.6. | 24/7/2011
B123 | 2.7. | 24/7/2011
And I would like to be able to calculate the difference between two rows like so
rowid. | TNT-1. | TNT-2. | difference
A123. |. 1.2. | 1.3. | 0.1
B123. | 2.6. | 2.7. | 0.1
Etc
I presume this is a use for the cross tab function in Postgres but am struggling to get results. Any help greatly appreciated.
You can pivot by hand and then take the difference (assuming you always have 2 records for each Patid and you don't have to take date into account):
with cte1 as (
select
Patid, TNT, date, row_number() over(partition by Patid order by TNT) as rn
from Table1
), cte2 as (
select
Patid,
max(case when rn = 1 then TNT end) as "TNT-1",
max(case when rn = 2 then TNT end) as "TNT-2"
from cte1
group by Patid
)
select
Patid as rowid, "TNT-1", "TNT-2", "TNT-2" - "TNT-1" as difference
from cte2
-------------------------------------
ROWID TNT-1 TNT-2 DIFFERENCE
A123 1.2 1.3 0.1
B123 2.6 2.7 0.1
sql fiddle demo

SQL: select top fewest rows with sum more than 0.7

The raw data table is
+--------+--------+
| id | value |
+--------+--------+
| 1 | 0.1 |
| 1 | 0.2 |
| 1 | 0.3 |
| 1 | 0.2 |
| 1 | 0.2 |
| 2 | 0.4 |
| 2 | 0.5 |
| 2 | 0.1 |
| 3 | 0.5 |
| 3 | 0.5 |
+--------+--------+
For each id, its value sum is 1. I want to select the top fewest rows of each id with value sum is more than or equal with 0.7, like
+--------+--------+
| id | value |
+--------+--------+
| 1 | 0.3 |
| 1 | 0.2 |
| 1 | 0.2 |
| 2 | 0.5 |
| 2 | 0.4 |
| 3 | 0.5 |
| 3 | 0.5 |
+--------+--------+
How to solve this problem?
It's neither pretty nor efficient but it's the best I can come up with.
Disclaimer: I'm sure this will perform horribly on any real-world dataset.
with recursive calc (id, row_list, value_list, total_value) as (
select id, array[ctid], array[value]::numeric(6,2)[], value::numeric(6,2) as total_value
from data
union all
select c.id, p.row_list||c.ctid, (p.value_list||c.value)::numeric(6,2)[], (p.total_value + c.value)::numeric(6,2)
from data as c
join calc as p on p.id = c.id and c.ctid <> all(p.row_list)
)
select id, unnest(min(value_list)) as value
from (
select id,
value_list,
array_length(row_list,1) num_values,
min(array_length(row_list,1)) over (partition by id) as min_num_values
from calc
where total_value >= 0.7
) as result
where num_values = min_num_values
group by id
SQLFiddle example: http://sqlfiddle.com/#!15/8966b/1
How does this work?
The recursive CTE (thew with recursive) part creates all possible combinations of values from the table. To make sure that the same value is not counted twice I'm collecting the CTIDs (an Postgres internal unique identifier for each row) for each row already processed into an array. The recursive join condition (p.id = c.id and c.ctid <> all(p.row_list)) then makes sure only values for the same id are added and only those that have not yet processed.
The result of the CTE is then reduced to all rows where the total sum (the column total_value) is >= 0.7.
The final outer select (the alias result) is then filtered down to those where the number of values making up the total sum is the smallest. The distinct and unnest then transforms the arrays back into a proper "table". The distinct is necessary because the CTE collects all combinations so that for e.g. id=3 the value_list array will contain {0.40,0.50} and {0.50,0.40}. Without the distinct, the unnest would return both combinations making it a total of four rows for id=3.
This also isn't that pretty but I think it'd be more efficient (and more transferable between RDBMS')
with unique_data as (
select id
, value
, row_number() over ( partition by id order by value desc ) as rn
from my_table
)
, cumulative_sum as (
select id
, value
, sum(value) over ( partition by id order by rn ) as csum
from unique_data
)
, first_over_the_mark as (
select id
, value
, csum
, lag(csum) over ( partition by id order by csum ) as prev_value
from cumulative_sum
)
select *
from first_over_the_mark
where coalesce(prev_value, 0) < 0.7
SQL Fiddle
I've done it with CTEs to make it easier to see what's happening but there's no need to use them.
It uses a cumulative sum, the first CTE makes the data unique as without it 0.2 is the same value and so all rows that have 0.2 get summed together. The second works out the running sum. The third then works out the previous value. If the previous is strictly less than 0.7 pick up everything. The idea being that if the previous cumulative sum is less than 0.7 then the current value is more (or equal) to that number.
It's worth noting that this will break down if you have any rows in your table where the value is 0.
This is a variant on Ben's method, but it is simpler to implement. You just need a cumulative sum, ordered by value in reverse, and then to take everything where the cumulative sum is less then 0.7 plus the first one that exceeds that value.
select t.*
from (select t.*,
sum(value) over (partition by id order by value desc) as csum
from t
) t
where csum - value < 0.7;
The expression csum - value is the cumulative sum minus the current value (you can also get this using something like rows between unbounded preceding and 1 preceding). Your condition is that this value is less than some threshold.
EDIT:
Ben's comment is right about duplicate values. His solution is fine. Here is another solution:
select t.*
from (select t.*,
sum(value) over (partition by id order by value desc, random()) as csum
from t
) t
where csum - value < 0.7;