Over Partition to find duplicates and remove them based on criteria SQL - sql

I hope everyone is doing well. I have a dilemma that i can not quite figure out.
I am trying to find a unique value for a field that is not a duplicate.
For example:
Table 1
|Col1 | Col2| Col3 |
| 123 | A | 1 |
| 123 | A | 2 |
| 12 | B | 1 |
| 12 | B | 2 |
| 12 | C | 3 |
| 12 | D | 4 |
| 1 | A | 1 |
| 2 | D | 1 |
| 3 | D | 1 |
Col 1 is the field that would have the duplicate values. Col2 would be the owner of the value in Col 1. Col 3 uses the row number() Over Partition syntax to get the numbers in ascending order.
The goal i am trying to accomplish is to remove the value in col 1 if it is not truly unique when looking at col2.
Example:
Col1 has the value 123, Col2 has the value A. Although there are two instances of 123 being owned by A, i can determine that it is indeed unique.
Now look at Col1 that has the value 12 with values in Col2 of B,C,D.
Value 12 is associated with three different owners thus eliminating 12 from our result list.
So in the end i would like to see a result table such as this :
|Col1 | Col2|
| 123 | A |
| 1 | A |
| 2 | D |
| 3 | D |
To summarize, i would like to first use the partition numbers to identify if the value in col1 is repeated. From there i want to verify that the values in col 2 are the same. If so the value in col 1 and col 2 remains as one single entry. However if the values in col 2 do not match, all records for the col1 value are removed.
I will provide the syntax code for my query if needed.
Update**
I failed to mention that table 1 is the result of inner joining two tables.
So Col1 comes from table a and Col2 comes from table b.
The values in table a for col2 are hard to interpret so i had to make sense of them and assigned it proper name values.
The join query i used to combine the two are:
Select a.Col1, B.Col2 FROM Table a INNER JOIN Table b on a.Colx = b.Colx
Update**
Table a:
|Col1 | Colx| Col3 |
| 123 | SMS | 1 |
| 123 | S9W | 2 |
| 12 | NAV | 1 |
| 12 | NFR | 2 |
| 12 | ABC | 3 |
| 12 | DEF | 4 |
| 1 | SMS | 1 |
| 2 | DEF | 1 |
| 3 | DES | 1 |
Table b:
|Colx | Col2|
| SMS | A |
| S9W | A |
| DEF | D |
| DES | D |
| NAV | B |
| NFR | B |
| ABC | C |
Above are sample data for both tables that get joined in order to create the first table displayed in this body.
Thank you all so much!

NOT EXISTS operator can be used to do this task:
SELECT distinct Col1 , Col2
FROM table t
WHERE NOT EXISTS(
SELECT 1 FROM table t1
WHERE t.col1=t1.col1 AND t.col2 <> t1.col2
)

If I understand correctly, you want:
select col1, min(col2)
from t
group by col1
where min(col2) <> max(col2);
I think the third column is confusing you. It doesn't seem to play any role in the logic you want.

Related

How can I define IIf parameters across different records in a table?

I've defined a query that filters out records that are null in a specific field. I'd like to also calculate a query field that returns the type of record that follows the record that was filtered out, if it matches the parameters. The way I thought to do this was with an IIf statement with multiple parameters:
Preparing: IIf([tblCustomers!OrderId]=([tblCustomers!OrderId]+1)
AND [tblCustomers!OrderStatus]="Preparing","Preparing","")
This didn't work as I hoped, but I wasn't too surprised, as it would have to return data from the field initially tested. So, the argument that adds 1 is actually doing nothing.
Is there a way to target the next record in the table, test if it matches one of two or three strings, then return which one it is?
Edit: Following #mazoula's solution, it seems a correlated subquery is indeed the answer here. Following the guide on allenbrowne.com (linked by June7), I seemed to be on the right track. Here is my code for retrieving the status of a previous record:
SELECT tblCustomers.AccountId,
tblCustomers.OrderId,
tblCustomers.OrderStatus,
tblCustomers.OrderShipped,
tblCustomers.OrderNotes,
(SELECT TOP 1 Dupe.OrderStatus
FROM tblCustomers AS Dupe
WHERE Dupe.AccountId = tblCustomers.AccountId
AND Dupe.OrderId > tblCustomers.OrderId
ORDER BY Dupe.AccountId DESC, Dupe.OrderId) AS NextStatus
FROM tblCustomers
WHERE (((tblCustomers.OrderShipped)="N") AND
((tblCustomers.OrderNotes) Is Null))
ORDER BY tblCustomers.AccountId DESC;
Unfortunately, I am met with the following error:
At most one record can be returned by this subquery
Doing a little more research, I found that incorporating an INNER JOIN expression should solve this.
...
FROM tblCustomers
INNER JOIN OrderStatus Dupe ON Dupe.AccountId = tblCustomers.AccountId
WHERE ...
This is where I've hit another roadblock and, when the syntax is at least correct, I receive the error:
Join expression not supported.
Is this a simple syntax issue, or have misunderstood the role of a Join expression?
in Access 2016 I do this in two parts because access throws the error: must use an updateable query when I try to update based on a subquery. For instance, if I want to replace the Null Values in TableA.Field3 with 'a' if the next record's Field3 is 'a'
tableA:
-------------------------------------------------------------------------------------
| ID | Field1 | Field2 | Field3 |
-------------------------------------------------------------------------------------
| 1 | a | 1 | |
-------------------------------------------------------------------------------------
| 2 | b | 2 | |
-------------------------------------------------------------------------------------
| 3 | c | 3 | a |
-------------------------------------------------------------------------------------
| 4 | d | 4 | b |
-------------------------------------------------------------------------------------
| 5 | e | 5 | |
-------------------------------------------------------------------------------------
| 6 | f | 6 | b |
-------------------------------------------------------------------------------------
I make a table on which to base the update query:
Replacement: (SELECT TOP 1 Dupe.Field3 FROM [TableA] as Dupe WHERE Dupe.ID > [TableA].[ID])
'SQL PANE'
SELECT TableA.ID, TableA.Field1, TableA.Field2, TableA.Field3, (SELECT TOP 1 Dupe.Field3 FROM [TableA] as Dupe WHERE Dupe.ID > [TableA].[ID]) AS Replacement INTO TempTable
FROM TableA;
TempTable:
----------------------------------------------------------------------------------------------------------
| ID | Field1 | Field2 | Field3 | Replacement |
----------------------------------------------------------------------------------------------------------
| 1 | a | 1 | | |
----------------------------------------------------------------------------------------------------------
| 2 | b | 2 | | a |
----------------------------------------------------------------------------------------------------------
| 3 | c | 3 | a | b |
----------------------------------------------------------------------------------------------------------
| 4 | d | 4 | b | |
----------------------------------------------------------------------------------------------------------
| 5 | e | 5 | | b |
----------------------------------------------------------------------------------------------------------
| 6 | f | 6 | b | |
----------------------------------------------------------------------------------------------------------
Finally do the Update
UPDATE TempTable INNER JOIN TableA ON TempTable.ID = TableA.ID SET TableA.Field3 = [TempTable].[Replacement]
WHERE (((TempTable.Replacement)='a'));
TableA after update
-------------------------------------------------------------------------------------
| ID | Field1 | Field2 | Field3 |
-------------------------------------------------------------------------------------
| 1 | a | 1 | |
-------------------------------------------------------------------------------------
| 2 | b | 2 | a |
-------------------------------------------------------------------------------------
| 3 | c | 3 | a |
-------------------------------------------------------------------------------------
| 4 | d | 4 | b |
-------------------------------------------------------------------------------------
| 5 | e | 5 | |
-------------------------------------------------------------------------------------
| 6 | f | 6 | b |
notes: In the Make Table query remember to sort TableA and Dupe in the same way. Here we use the default sort of increasing ID for TableA then grab the first record with a higher ID using the default sort again. the only reason I did the filtering to 'a' in the update query is it made the Make Table query simpler.

Need sql help: For each record in table A (has more columns than table B), insert into Table B

I know my subject is a little sparse, but for the life of me I cannot figure out how to do this. I could accomplish this in C# but I am getting confused by the SQL syntax. I searched and searched and I can't seem to find what I am looking for probably because I don't understand some of the SQL that I am looking at.
TABLE 1
-----------
| CustNo | Catalog1 | Catalog2 | Catalog3 | Catalog4 |
| 1 | A | B | C | NULL |
| 2 | B | C | NULL | D |
| 3 | A | C | E | F |
TABLE 2 (empty)
COLUMNS: CustNo|Catalog
So Basically for each record in Table 1, I want to insert the catalogs into table 2.
So the desired output would look like the following.
TABLE 2
CustNo|Catalog
| 1 | A
| 1 | B
| 1 | C
| 2 | B
| 2 | C
| 2 | D
| 3 | A
| 3 | C
| 3 | E
| 3 | F
Thank you all for any help!
Just unpivot. I like to do this using apply;
insert into table2 (CustNo, Catalog)
select t1.CustNo, v.Catalog
from table1 t1 cross apply
(values (t1.Catalog1), (t1.Catalog2), (t1.Catalog3), (t1.Catalog4)
) v(catalog)
where v.Catalog is not null;

Fetch the column which has the Max value for a row in Hive

I have a scenario where i need to pick the greatest value in the row from three columns, there is a function called Greatest but it doesn't work in my version of Hive 0.13.
Please suggest better way to accomplish it.
Example table:
+---------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+---------+------+------+------+
| Group A | 1 | 2 | 3 |
+---------+------+------+------+
| Group B | 4 | 5 | 1 |
+---------+------+------+------+
| Group C | 4 | 2 | 1 |
+---------+------+------+------+
expected Result:
+---------+------------+------------+
| Col1 | output_max | max_column |
+---------+------------+------------+
| Group A | 3 | Col4 |
+---------+------------+------------+
| Group B | 5 | col3 |
+---------+------------+------------+
| Group C | 4 | col2 |
+---------+------------+------------+
select col1
,tuple.col1 as output_max
,concat('Col',tuple.col2) as max_column
from (select Col1
,sort_array(array(struct(Col2,2),struct(Col3,3),struct(Col4,4)))[2] as tuple
from t
) t
;
sort_array(Array)
Sorts the input array in ascending order according to the natural ordering of the array elements and returns it
(as of version 0.9.0).
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
hive> select col1
> ,tuple.col1 as output_max
> ,concat('Col',tuple.col2) as max_column
>
> from (select Col1
> ,sort_array(array(struct(Col2,2),struct(Col3,3),struct(Col4,4)))[2] as tuple
> from t
> ) t
> ;
OK
Group A 3 Col4
Group B 5 Col3
Group C 4 Col2

SQL Count across columns

I know that this table structure is horrible and that I should look into database normalization, but this is what I have to work with at the moment.
I need to find the most common number across the columns where one of them has a specific id (in my example 3). Both columns will never have the same value.
Query
SELECT Col1, Col2 FROM scores WHERE Col1 = 3 OR Col2 = 3
Result
+------+------+
| Col1 | Col2 |
+------+------+
| 1 | 3 |
| 3 | 1 |
| 2 | 3 |
| 6 | 3 |
| 3 | 7 |
| 3 | 9 |
| 2 | 3 |
| 5 | 3 |
+------+------+
I'm hoping to get a result like this (I don't need count for 3 since it's the ID, but it can be included)
+-------+-------+
| Value | Count |
+-------+-------+
| 1 | 2 |
| 2 | 2 |
| 5 | 1 |
| 6 | 1 |
| 7 | 1 |
| 9 | 1 |
+-------+-------+
I've tried a few things such as UNION and nested SELECT but that doesn't seem to solve this thing.
Any suggestions?
If you want a count of the values where the OTHER column is 3, then a UNION would work like this:
SELECT value, theCount = COUNT(*)
FROM (
SELECT value = col1
FROM scores
WHERE col2 = 3
UNION ALL
SELECT col2
FROM scores
WHERE col1 = 3) T
GROUP BY value
ORDER BY value;
One way is using case:
SELECT
case Col1 when 3 then Col2 else Col1 end,
count(*)
FROM scores
WHERE Col1 = 3 OR Col2 = 3
Group by
case Col1 when 3 then Col2 else Col1 end;

Self-referential CASE WHEN clause in SQL

I'm trying to migrate some poorly formed data into a database. The data comes from a CSV, and is first loaded into a staging table of all varchar columns (as I cannot enforce type safety at this stage).
The data might look like
COL1 | COL2 | COL3
Name 1 | |
2/11/16 | $350 | $230
2/12/16 | $420 | $387
2/13/16 | $435 | $727
Name 2 | |
2/11/16 | $121 | $144
2/12/16 | $243 | $658
2/13/16 | $453 | $214
The first colum is a mixture of company names as pseudo-headers, and dates for which colum 2 and 3 data is relevant. I'd like to start transforming the data by creating a 'Brand' column - where 'StoreBrand' is the value of Col1 if Col2 is NULL, or the previous row's StoreBrand otherwise. Comething like:
COL1 | COL2 | COL3 | StoreBrand
Name 1 | | | Name 1
2/11/16 | $350 | $230 | Name 1
2/12/16 | $420 | $387 | Name 1
2/13/16 | $435 | $727 | Name 1
Name 2 | | | Name 2
2/11/16 | $121 | $144 | Name 2
2/12/16 | $243 | $658 | Name 2
2/13/16 | $453 | $214 | Name 2
I wrote this:
SELECT
t.*,
CASE
WHEN t.COL2 IS NULL THEN COL1
ELSE LAG(StoreBrand) OVER ()
END AS StoreBrand
FROM
(
SELECT
ROW_NUMBER() OVER () AS i,
*
FROM
Staging_Data
) t;
But the database (postgres in this case, but we're considering alternatives so the most diverse answer is preferred) chokes on LAG(StoreBrand) because that's the derived column I'm creating. Invoking LAG(Col1) only populates the first row's real data:
COL1 | COL2 | COL3 | StoreBrand
Name 1 | | | Name 1
2/11/16 | $350 | $230 | Name 1
2/12/16 | $420 | $387 | 2/11/16
2/13/16 | $435 | $727 | 2/12/16
Name 2 | | | Name 2
2/11/16 | $121 | $144 | Name 2
2/12/16 | $243 | $658 | 2/11/16
2/13/16 | $453 | $214 | 2/12/16
My goal would be a StoreBrand column which is the first value of COL1 for all date values before the next brand name:
COL1 | COL2 | COL3 | StoreBrand
Name 1 | | | Name 1
2/11/16 | $350 | $230 | Name 1
2/12/16 | $420 | $387 | Name 1
2/13/16 | $435 | $727 | Name 1
Name 2 | | | Name 2
2/11/16 | $121 | $144 | Name 2
2/12/16 | $243 | $658 | Name 2
2/13/16 | $453 | $214 | Name 2
The value of StoreBrand when Col2 and Col3 are null is inconsequential - that row will be dropped as part of the conversion process. The important thing is associating the data rows (i.e. those with dates) with their brand.
Is there a way to reference the previous value for the column that I'm missing?
Edit for people who find this question through a search engine:
The trick was to use WITH which allows to use a temporary result at several places (link).
I think this does what you want and discards the null rows at the same time (if you want to). We basically select all the brands before the row we are currently looking at and if no "brand row" exists between it and the current row then we take it.
WITH t AS
(SELECT
ROW_NUMBER() OVER () AS i,
*
FROM
Staging_Data
)
SELECT
a.COL1,
a.COL2,
a.COL3,
(SELECT b.COL1 FROM t b WHERE b.COL2 IS NULL AND b.i <= a.i AND NOT EXISTS(
SELECT * FROM t c WHERE c.COL2 IS NULL AND c.i <= a.i AND c.i > b.i)
) StoreBrand
FROM
t a
WHERE -- I don't think you need those rows? Otherwise remove it.
a.COL2 IS NOT NULL
It can be a bit confusing. t is a temporary table we defined with your query. And a, b and c are aliases for t. We could also write FROM t AS a to make it more obvious.
I think I understand what you want. Technically, you want the ignore nulls option on lag(), so it would look like this:
select lag(case when col1 not like '%/%/%' then col1 end ignore nulls) over (order by linenumber) as brandname
The only problem? Postgres doesn't support ignore nulls.
But, you can do pretty much the same thing with a subquery. The idea is to assign a grouping identifier to each group. And this is a cumulative count of the valid brand names. Then a simple max() aggregation works:
select t.*,
max(case when col1 not like '%/%/%' then col1 end) over (partition by grp) as brand
from (select t.*,
sum(case when col1 not like '%/%/%' then 1 end) over
(order by linenumber) as grp
from t
);