Find number of rows identical one some, but different on another column - sql

Say I have the following table:
CREATE TABLE data (
PROJECT_ID VARCHAR,
TASK_ID VARCHAR,
REF_ID VARCHAR,
REF_VALUE VARCHAR
);
I want to identify rows where
PROJECT_ID, REF_ID, REF_VALUE are the same
but TASK_ID are different.
The desired output is a list of TASK_ID_1, TASK_ID_2 and COUNT(*) of such conflicts. So, for example,
DATA
+------------+---------+--------+-----------+
| PROJECT_ID | TASK_ID | REF_ID | REF_VALUE |
+------------+---------+--------+-----------+
| 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 2 |
| 1 | 2 | 1 | 1 |
| 1 | 2 | 1 | 2 |
+------------+---------+--------+-----------+
OUTPUT
+-----------+-----------+----------+
| TASK_ID_1 | TASK_ID_2 | COUNT(*) |
+-----------+-----------+----------+
| 1 | 2 | 2 |
| 2 | 1 | 2 |
+-----------+-----------+----------+
would mean that there are two entries with TASK_ID == 1 and two entries with TASK_ID == 2 that share the same values for the other three columns. The inherent symmetry in the output is fine.
How would I go about finding this information? I've tried joining the table onto itself and grouping, but this turned up more results for a single task than the table had rows altogether, so it's clearly wrong.
The database used is PostgreSQL, though a solution that applies to most common SQL systems would be preferable.

You want a self join and aggregation:
select d1.task_id as task_id_1, d2.task_id as task_id_2, count(*)
from data d1 join
data d2
on d1.project_id = d2.project_id and
d1.ref_id = d2.ref_id and
d1.ref_value = d2.ref_value and
d1.task_id <> d2.task_id
group by d1.task_id, d2.task_id;
Notes:
Add the condition d1.task_id < d2.task_id if you want each pair to occur only once in the result set.
This does not handle NULL values, although that is easy enough to handle. Use is not distinct from instead of =.
You can also simplify this a bit with the using clause:
select d1.task_id as task_id_1, d2.task_id as task_id_2, count(*)
from data d1 join
data d2
using (project_id, ref_id, ref_value)
where d1.task_id <> d2.task_id
group by d1.task_id, d2.task_id;
You can get an idea of how many rows might be returned by using:
select d.project_id, d.ref_id, d.ref_value, count(distinct d.task_id), count(*)
from data d
group by d.project_id, d.ref_id, d.ref_value;

This is how I understand your question. This assume there are only two task for the same combination.
SQL DEMO
SELECT "PROJECT_ID", "REF_ID", "REF_VALUE",
MIN("TASK_ID") as TASK_ID_1,
MAX("TASK_ID") as TASK_ID_2,
COUNT(*) as cnt
FROM Table1
GROUP BY "PROJECT_ID", "REF_ID", "REF_VALUE"
HAVING MIN("TASK_ID") != MAX("TASK_ID")
-- COUNT(*) > 1 also should work
OUTPUT
I add more column to make clear what are the same elements:
| PROJECT_ID | REF_ID | REF_VALUE | task_id_1 | task_id_2 | cnt |
|------------|--------|-----------|-----------|-----------|-----|
| 1 | 1 | 2 | 1 | 2 | 2 |
| 1 | 1 | 1 | 1 | 2 | 2 |

Related

SQL generate Data based of the ids of three tables

I have three tables store, gender, age_group each of these tables have ids. I need to generate table data for each one all possible combinations of the three.
ex. store_id = (1,2,3) gender_id = (1,2,3) age_group_id = (1,2,3)
so that i have a table that looks like this:
|store_id|gender_id|age_group_id|
|:------:|:-------:|:----------:|
| 1 | 1 | 1 |
| 1 | 2 | 1 |
| 1 | 3 | 1 |
| 2 | 1 | 3 |
| 2 | 2 | 3 |
| 3 | 1 | 3 |
| 3 | 2 | 3 |
etc. continuing on until each combination is populated, any suggestions on best approach to do this in SQL
Cross join the three tables:
select
s.Id as store_id,
g.Id as gender_id,
a.Id as age_group_id
from store s
cross join gender g
cross join age_group a

Oracle SQL: Counting how often an attribute occurs for a given entry and choosing the attribute with the maximum number of occurs

I have a table that has a number column and an attribute column like this:
1.
+-----+-----+
| num | att |
-------------
| 1 | a |
| 1 | b |
| 1 | a |
| 2 | a |
| 2 | b |
| 2 | b |
+------------
I want to make the number unique, and the attribute to be whichever attribute occured most often for that number, like this (This is the end-product im interrested in) :
2.
+-----+-----+
| num | att |
-------------
| 1 | a |
| 2 | b |
+------------
I have been working on this for a while and managed to write myself a query that looks up how many times an attribute occurs for a given number like this:
3.
+-----+-----+-----+
| num | att |count|
------------------+
| 1 | a | 1 |
| 1 | b | 2 |
| 2 | a | 1 |
| 2 | b | 2 |
+-----------------+
But I can't think of a way to only select those rows from the above table where the count is the highest (for each number of course).
So basically what I am asking is given table 3, how do I select only the rows with the highest count for each number (Of course an answer describing providing a way to get from table 1 to table 2 directly also works as an answer :) )
You can use aggregation and window functions:
select num, att
from (
select num, att, row_number() over(partition by num order by count(*) desc, att) rn
from mytable
group by num, att
) t
where rn = 1
For each num, this brings the most frequent att; if there are ties, the smaller att is retained.
Oracle has an aggregation function that does this, stats_mode().:
select num, stats_mode(att)
from t
group by num;
In statistics, the most common value is called the mode -- hence the name of the function.
Here is a db<>fiddle.
You can use group by and count as below
select id, col, count(col) as count
from
df_b_sql
group by id, col

SQL : Getting duplicate rows along with other variables

I am working on Terradata SQL. I would like to get the duplicate fields with their count and other variables as well. I can only find ways to get the count, but not exactly the variables as well.
Available input
+---------+----------+----------------------+
| id | name | Date |
+---------+----------+----------------------+
| 1 | abc | 21.03.2015 |
| 1 | def | 22.04.2015 |
| 2 | ajk | 22.03.2015 |
| 3 | ghi | 23.03.2015 |
| 3 | ghi | 23.03.2015 |
Expected output :
+---------+----------+----------------------+
| id | name | count | // Other fields
+---------+----------+----------------------+
| 1 | abc | 2 |
| 1 | def | 2 |
| 2 | ajk | 1 |
| 3 | ghi | 2 |
| 3 | ghi | 2 |
What am I looking for :
I am looking for all duplicate rows, where duplication is decided by ID and to retrieve the duplicate rows as well.
All I have till now is :
SELECT
id, name, other-variables, COUNT(*)
FROM
Table_NAME
GROUP BY
id, name
HAVING
COUNT(*) > 1
This is not showing correct data. Thank you.
You could use a window aggregate function, like this:
SELECT *
FROM (
SELECT id, name, other-variables,
COUNT(*) OVER (PARTITION BY id) AS duplicates
FROM users
) AS sub
WHERE duplicates > 1
Using a teradata extension to ISO SQL syntax, you can simplify the above to:
SELECT id, name, other-variables,
COUNT(*) OVER (PARTITION BY id) AS duplicates
FROM users
QUALIFY duplicates > 1
As an alternative to the accepted and perfectly correct answer, you can use:
SELECT {all your required 'variables' (they are not variables, but attributes)}
, cnt.Count_Dups
FROM Table_NAME TN
INNER JOIN (
SELECT id
, COUNT(1) Count_Dups
GROUP BY id
HAVING COUNT(1) > 1 -- If you want only duplicates
) cnt
ON cnt.id = TN.id
edit: According to your edit, duplicates are on id only. Edited my query accordingly.
try this,
SELECT
id, COUNT(id)
FROM
Table_NAME
GROUP BY
id
HAVING
COUNT(id) > 1

Why isn't this returning unique combinations of these attributes?

When using the following query:
with neededSkills(SkillCode) as (
select distinct SkillCode
from job natural join hasprofile natural join requires_skill
where job_code = '1'
minus
select skillcode
from person natural join hasskill
where id = '1'
)
select distinct
taughtin.c_code as c,
count(taughtin.skillcode) as s,
ti.c_code as cc,
count(ti.skillcode) as ss
from taughtin, taughtin ti
where taughtin.c_code <> ti.c_code
and taughtin.skillcode <> ti.skillcode
and taughtin.skillcode in (select skillcode from neededskills)
and ti.skillcode in (select skillcode from neededskills)
group by (taughtin.c_code, ti.c_code)
order by (taughtin.c_code);
It returns:
C | S | CC | SS
----|----|----|----
1 | 1 | 2 | 1
1 | 1 | 3 | 1
1 | 1 | 5 | 1
2 | 1 | 1 | 1
3 | 1 | 1 | 1
5 | 1 | 1 | 1
I would expect it to return only lines where the combination of C and CC was not already used. Do I misunderstand how group by works? How would I achieve this result?
I am trying to have it return:
C | S | CC | SS
----|----|----|----
1 | 1 | 2 | 1
1 | 1 | 3 | 1
1 | 1 | 5 | 1
I use Oracle SQLPlus.
You're grouping on the combination of taughtin.c_code and ti.c_code, which are seperate columns in the context of the query (even though they are the same column in the schema). A pair of 1, 2 is not the same as a pair of 2, 1; the values may be the same but the sources are not.
If you want to get the combinations one way but not the other then the simplest thing is to always make one value large than the other; instead of:
where taughtin.c_code <> ti.c_code
use:
where ti.c_code > taughtin.c_code
Though it would be better to use ANSI joins for the main query too, and I'm not a fan of natural joins. You also don't need either distinct; the first may eliminate duplicates but they don't logically matter if you're only using the temporary result set for in()

An SQL query that combines aggregate and non-aggregate values in one row

The following query gives me the information that I need but I want it to take it just a step further. In the table at the bottom (only showing a subset of the fields), I want to group by cust_line in an unusual way (at least to me it's unusual).
Let's look at the items with a cust_line of 2 as an example. I would like these to be represented by one line not 5. For this line, I would like to select all the fields except for the price field where the cust_part = "GROUPINVC". For the total field I would like it to be 'sum(total) as new_total' and for the price, I would like it to be new_total / qty_invoiced, where qty_invoiced is the value on the line where cust_part = "GROUPINV".
Is what I am asking for completely ridiculous? Is it even possible? I'm not advanced at SQL so it may also be easy and I just don't know how to approach it. I thought of using 'partition by' but I couldn't imagine how I would get it to work as I figured it would still return 5 rows where I only want 1.
I've also looked at these questions with similar titles but not really what I am looking for:
SQL query that returns aggregate AND non aggregate results
Combined aggregated and non-aggregate query in SQL
SELECT L.CUST_LINE, I.LINE_NO, I.ORDER_NO, I.STAGE, I.ORDER_LINE_POS, I.CUST_PART,
I.LINE_ITEM_NO, I.QTY_INVOICED, I.CUST_DESC, I.DESCRIPTION, I.SALE_UNIT_PRICE, I.PRICE_TOTAL,
I.INVOICE_NO, I.CUSTOMER_PO_NO, I.ORDER_NO, I.CUSTOMER_NO, I.CATALOG_DESC, I.ORDER_LINE_NOTES
FROM
(SELECT CUST_LINE, ORDER_NO, LINE_NO
FROM CUSTOMER_ORDER_LINE
GROUP BY CUST_LINE, ORDER_NO, LINE_NO
) L
INNER JOIN CUSTOMER_ORDER_IVC_REP I
ON I.ORDER_NO = L.ORDER_NO
WHERE RESULT_KEY = 999999
AND I.LINE_NO = L.LINE_NO
ORDER BY L.CUST_LINE;
| cust_line | line_no | cust_part | qty_invoiced | cust_desc | price | total |
| 1 | 4 | ... | 1 | ... | 55 | 55 |
| 2 | 1 | GROUPINV | 1 | some part | 0 | 0 |
| 2 | 6 | ... | 3 | ... | 0 | 0 |
| 2 | 2 | ... | 1 | ... | 0 | 0 |
| 2 | 3 | ... | 1 | ... | 0 | 0 |
| 2 | 7 | ... | 2 | ... | 10 | 20 |
| 3 | 7 | ... | 1 | ... | 67 | 67 |
You can use an analytic function to calculate a total over multiple rows of a result set, then filter out the rows you don't want.
Leaving out all the extra columns for sake of brevity:
SELECT cust_line, qty_invoiced, order_total/qty_invoiced AS price
FROM (
SELECT l.cust_line, qty_invoiced,
SUM(total) OVER (PARTITION BY l.cust_line) AS order_total,
COUNT(cust_line) OVER (PARTITION BY l.cust_line) AS group_count
FROM
(SELECT CUST_LINE, ORDER_NO, LINE_NO
FROM CUSTOMER_ORDER_LINE
GROUP BY CUST_LINE, ORDER_NO, LINE_NO
) L
INNER JOIN CUSTOMER_ORDER_IVC_REP I
ON I.ORDER_NO = L.ORDER_NO
WHERE RESULT_KEY = 999999
AND I.LINE_NO = L.LINE_NO
)
WHERE ( cust_part = 'GROUPINV' OR group_count = 1 )
ORDER BY cust_line
I am guessing on what you want in the PARTITION BY clause; this is essentially a GROUP BY that applies only to the SUM function. Not sure if you might also want order_no in the partition.
The trick is to select all the rows in the inner query, applying SUM across them all; then filter out the rows you are not interested in in the outermost query.