Removing SQL Rows from Query if two rows have an identical ID but differences in the columns - sql

I´m currently working stuck on a SQL issue (well, mainly because I can´t find a way to google it and my SQL skills do not suffice to solve it myself)
I´m working on a system where documents are edited. If the editing process is finished, users mark the document as solved. In the MSSQL database, the corresponding row is not updated but instead, a new row is inserted. Thus, every document that has been processed has [e.g.: should have] multiple rows in the DB.
See the following situation:
ID
ID2
AnotherCondition
Steps
Process
Solved
1
1
yes
Three
ATAT
AF
2
2
yes
One
ATAT
FR
2
3
yes
One
ATAT
EG
2
4
yes
One
ATAT
AF
3
5
no
One
ABAT
AF
4
6
yes
One
ATAT
FR
5
7
no
One
AVAT
EG
6
8
yes
Two
SATT
FR
6
9
yes
Two
SATT
EG
6
10
yes
Two
SATT
AF
I need to select the rows which have not been processed yet. A "processed" document has a "FR" in the "Solved" column. Sadly other versions of the document exist in the DB, with other codes in the "Solved" columns.
Now: If there is a row which has "FR" in the "Solved" column I need to remove every row with the same ID from my SELECT statement as well. Is this doable?
In order to achieve this, I have to remove the rows with the IDs 2 | 4 (because the system sadly isn´t too reliable I guess) | and 6 in my select statement. Is this possible in general?
What I could do is to filter out the duplicates afterwards, in python/js/whatever. But I am curious whether I can "remove" these rows directly in the SQL statement as well.
To rephrase it another time: How can I make a select statement which returns only (in this example) the rows containing the ID´s 1, 3 and 5?

If you need to delete all rows where every id doesn't have any "Solved = 'no'", you can use a DELETE statement that will exclude all "id" values that have at least one "Solved = 'no'" in the corresponding rows.
DELETE FROM tab
WHERE id NOT IN (SELECT id FROM tab WHERE Solved1 = 'no');
Check the demo here.
Edit. If you need to use a SELECT statement, you can simply reverse the condition in the subquery:
SELECT *
FROM tab
WHERE id NOT IN (SELECT id FROM tab WHERE Solved1 = 'yes');
Check the demo here.

I'm not sure I understand your question correct:
...every document that has been processed has [...] multiple rows in the DB
I need to find out which documents have not been processed yet
So it seems you need to find unique documents with no versions, this could be done using a GROUP BY with a HAVING clause:
SELECT
Id
FROM dbo.TableName
GROUP BY Id
HAVING COUNT(*) = 1

Related

SQL to return records that do not have a complete set according to a second table

I have two tables. I want to find the erroneous records in the first table based on the fact that they aren't complete set as determined by the second table. eg:
custID service transID
1 20 1
1 20 2
1 50 2
2 49 1
2 138 1
3 80 1
3 140 1
comboID combinations
1 Y00020Y00050
2 Y00049Y00138
3 Y00020Y00049
4 Y00020Y00080Y00140
So in this example I would want a query to return the first row of the first table because it does not have a matching 49 or 50 or (80 and 140), and the last two rows as well (because there is no 20). The second transaction is fine, and the second customer is fine.
I couldn't figure this out with a query, so I wound up writing a program that loads the services per customer and transid into an array, iterates over them, and ensures that there is at least one matching combination record where all the services in the combination are present in the initially loaded array. Even that came off as hamfisted, but it was less of a nightmare than the awkward outer joining of multiple joins I was trying to accomplish with SQL.
Taking a step back, I think I need to restructure the combinations table into something more accommodating, but I still can't think of what the approach would be.
I do not have DB2 so I have tested on Oracle. However listagg function should be there as well. The table service is the first table and comb the second one. I assume the service numbers to be sorted as in the combinations column.
select service.*
from service
join
(
select S.custid, S.transid
from
(
select custid, transid, listagg(concat('Y000',service)) within group(order by service) as agg
from service
group by custid, transid
) S
where not exists
(
select *
from comb
where S.agg = comb.combinations
)
) NOT_F on NOT_F.custid = service.custid and NOT_F.transid = service.transid
I dare to say that your database design does not conform to the first normal form since the combinations column is not atomic. Think about it.

subtract every next column value from previous?

I have a dataset, where somehow the next singular data is added on top of the previous data for one row, and that for every column, which means,
row with ID 1 is the original pure data, but row with e.g ID 10 has added the data from the previous 9 datasets on itself...
what I now want is to get the original pure data for every distinct item, which means for every ID, how can I substract all data from lets say ID, 10? I would have to substract those of the previous one, for ID 9 and so on...
I want to do this either in SQL Server or in Rapidminer, I am working with those tools, any idea?
here is a sample:
ID col1 col2 col3
1 12 2 3
2 15 5 5
3 20 8 8
so the real correct data for Item with ID 3 is not 20, 8, 8 it is (20-15),(8-5),(8-5) so its 5,3,3...
subtract the later from its previous for every item except the first..
1 12 2 3
Try it out with lag series operator, it will work for sure! To get this operator you should install the series extension from the RM marketplace.
What this operator does - he copies the selected attributes and pushes every row of the example set for one point, so row with ID 1 gets a copy with ID 2 etc (you can also specify the value for a lag). Afterwards you can substract one value from another with Generate Attributes.
I think lag() is the answer to your question:
select (case when id = 1 then col
else col - lag(col) over (order by id)
end)
However, sample data would clarify the question.
Within RapidMiner there is the Differentiate operator contained in the Series extension (which is not installed by default and needs to be downloaded from the RapidMiner Marketplace). This can be used to calculate differences between attributes in adjacent examples.

Multicriteria Insert/Update

I'm trying to create a query that will insert new records to a table or update already existing records, but I'm getting stuck on the filtering and grouping for the criteria I want.
I have two tables: tbl_PartInfo, and dbo_CUST_BOOK_LINE.
I'm want to select from dbo_CUST_BOOK_LINE based upon the combination of CUST_ORDER_ID, CUST_ORDER_LINE_NO, and REVISION_ID. Each customer order can have multiple lines, and each line can have multiple revision. I'm trying to select the unique combinations of each order and it's connected lines, but take the connected information for the row with the highest value in the revision column.
I want to insert/update from dbo_CUST_BOOK_LINE the following columns:
CUST_ORDER_ID
PART_ID
USER_ORDER_QTY
UNIT_PRICE
I want to insert/update them into tbl_PartInfo as the following columns respectively:
JobID
DrawingNumber
Quantity
UnitPrice
So if I have the following rows in dbo_CUST_BOOK_LINE (PART_ID omitted for example)
CUST_ORDER_ID CUST_ORDER_LINE_NO REVISION_ID USER_ORDER_QTY UNIT_PRICE
SCabc 1 1 0 100
SCabc 1 2 4 150
SCabc 1 3 4 125
SCabc 2 3 2 200
SCxyz 1 1 0 0
SCxyz 1 2 3 50
It would return
CUST_ORDER_ID CUST_ORDER_LINE_NO (REVISION_ID) USER_ORDER_QTY UNIT_PRICE
SCabc 1 3 4 125
SCabc 2 3 2 200
SCxyz 1 2 3 50
but with PART_ID included and without REVISION_ID
So far, my code is just for the inset portion as I was trying to get the correct records selected, but I keep getting duplicates of CUST_ORDER_ID and CUST_ORDER_LINE_NO.
INSERT INTO tbl_PartInfo ( JobID, DrawingNumber, Quantity, UnitPrice, ProductFamily, ProductCategory )
SELECT dbo_CUST_BOOK_LINE.CUST_ORDER_ID, dbo_CUST_BOOK_LINE.PART_ID, dbo_CUST_BOOK_LINE.USER_ORDER_QTY, dbo_CUST_BOOK_LINE.UNIT_PRICE, dbo_CUST_BOOK_LINE.CUST_ORDER_LINE_NO, Max(dbo_CUST_BOOK_LINE.REVISION_ID) AS MaxOfREVISION_ID
FROM dbo_CUST_BOOK_LINE, tbl_PartInfo
GROUP BY dbo_CUST_BOOK_LINE.CUST_ORDER_ID, dbo_CUST_BOOK_LINE.PART_ID, dbo_CUST_BOOK_LINE.USER_ORDER_QTY, dbo_CUST_BOOK_LINE.UNIT_PRICE, dbo_CUST_BOOK_LINE.CUST_ORDER_LINE_NO;
This has been far more complicated that anything I've done so far, so any help would be greatly appreciated. Sorry about the long column names, I didn't get to choose them.
I did some research and think I found a way to make it work, but I'm still testing it. Right now I'm using three queries, but it should be easily simplified into two when complete.
The first is an append query that takes the two columns I want to get distinct combo's from and selects them and using "group by," while also selecting max of the revision column. It appends them to another table that I'm using called tbl_TempDrop. This table is only being used right now to reduce the number of results before the next part.
The second is an update query that updates tbl_TempDrop to include all the other columns I wanted by setting the criteria equal to the three selected columns from the first query. This took an EXTREMELY long time to complete when I had 700,000 records to work with, hence the use of the tbl_TempDrop.
The third query is a basic append query that appends the rows of tbl_TempDrop to the end destination, tbl_PartInfo.
All that's left is to run all three in a row.
I didn't want to include the full details of any tables or queries yet until I ensure that it works as desired, and because some of the names are vague since I will be using this method for multiple query searches.
This website helped me a little to make sure I had the basic idea down. http://www.techonthenet.com/access/queries/max_query2_2007.php
Let me know if you see any flaws with the ideology!

Delete duplicates when the duplicates are not in the same column

Here is a sample of my data (n>3000) that ties two numbers together:
id a b
1 7028344 7181310
2 7030342 7030344
3 7030354 7030353
4 7030343 7030345
5 7030344 7030342
6 7030364 7008059
7 7030659 7066051
8 7030345 7030343
9 7031815 7045692
10 7032644 7102337
Now, the problem is that id=2 is a duplicate of id=5 and id=4 is a duplicate of id=8. So, when I tried to write if-then statements to map column a to column b, basically the numbers just get swapped. There are many cases like this in my full data.
So, my question is to identify the duplicate(s) and somehow delete one of the duplicates (either id=2 or id=5). And I preferably want to do this in Excel but I could work with SQL Server or SAS, too.
Thank you in advance. Please comment if my question is not clear.
What I want:
id a b
1 7028344 7181310
2 7030342 7030344
3 7030354 7030353
4 7030343 7030345
6 7030364 7008059
7 7030659 7066051
9 7031815 7045692
10 7032644 7102337
All sorts of ways to do this.
In SAS or SQL, this is simple (for SQL Server, the SQL portion should be identical or nearly so):
data have;
input id a b;
datalines;
1 7028344 7181310
2 7030342 7030344
3 7030354 7030353
4 7030343 7030345
5 7030344 7030342
6 7030364 7008059
7 7030659 7066051
8 7030345 7030343
9 7031815 7045692
10 7032644 7102337
;;;;
run;
proc sql undopolicy=none;
delete from have H where exists (
select 1 from have V where V.id < H.id
and (V.a=H.a and V.b=H.b) or (V.a=H.b and V.b=H.a)
);
quit;
The excel solution would require creating an additional column I believe with the concatenation of the two strings, in order (any order will do) and then a lookup to see if that is the first row with that value or not. I don't think you can do it without creating an additional column (or using VBA, which if you can use that will have a fairly simple solution as well).
Edit:
Actually, the excel solution IS possible without creating a new column (well, you need to put this formula somewhere, but without ANOTHER additional column).
=IF(OR(AND(COUNTIF(B$1:B1,B2),COUNTIF(C$1:C1,C2)),AND(COUNTIF(B$1:B1,C2),COUNTIF(C$1:C1,B2))),"DUPLICATE","")
Assuming ID is in A, B and C contain the values (and there is no header row). That formula goes in the second row (ie, B2/C2 values) and then is extended to further rows (so row 36 will have the arrays be B1:B35 and C1:C35 etc.). That puts DUPLICATE in the rows which are duplicates of something above and blank in rows that are unique.
I haven't tested this out but here is some food for thought, you could join the table against itself and get the ID's that have duplicates
SELECT
id, a, b
FROM
[myTable]
INNER JOIN ( SELECT id, a, b FROM [myTable] ) tbl2
ON [myTable].a = [tbl2].b
OR [myTable].b = tbl2.a

SQL statement to switch values

I have several tables where a field is for priority (1 to 5). Problem here is that different projects have been using 5 as highest and some 1 for highest and I going to harmonize this.
My easy option is to create a temp table and copy the data over and switch as this table:
1 -> 5
2 -> 4
3 -> 3
4 -> 2
5 -> 1
I'm not that good with SQL but it feels that there should be an easy way to switch those values right off with an statement but I do have concerns of when there are huge amount of data and if something goes wrong half way then the data will be in a mess.
Should I just go with my temp table solution or should do you have a nice way of doing this straight in SQL? (Oracle 10g is being used)
Many thanks!
simply update the second table like this, a temp table is not needed because you are just reversing the priority:
update table_2
set priority = 6-priority;
You can use a CASE statement
case PRIORITY
when 5 then 1
when 4 then 2
when 3 then 3
when 2 then 4
when 1 then 5
else PRIORITY
end
Edit: texBlues' solution is much better, but I leave this here for cases where the maths isn't as neat.
To be sure that no 'mess' results if the update goes awry, use a transaction. Building on tekBlues solution (+1 for this).
START TRANSACTION;
update table_2
set priority = 6-priority;
...
COMMIT;
This is especially valid if you want to update multiple tables in one go. Single statements are implicitely handled, as hainstech pointed out in his comment correctly.