How to rollup specific strings in a query - sql

I would like to combine rows with duplicates in a specific column such that specific items are listed and others are excluded
I have attempted to use string_agg, group_by and self joins, I feel like I may simply need a better self join but I am not sure.
one two three four
1 1 a NULL
2 4 b e
3 7 c x
3 7 c z
I would like it to look something like this (with the elements that were the same remaining unsegregated)
one two three
1 1 a NULL
2 4 b e
3 7 c x,z

If you are using MySQL :
SELECT one, two, three, GROUP_CONCAT(four)
FROM table
GROUP BY one, two, three
Otherwise, this is a bad thing to do in a RDBMS because this is not a relationnal operation.
You should do this in the client-side of your project.

Related

How to pivot wider in SQL or use a more "dynamic" form of the LEAD function?

I have a table that looks as follows:
Policy Number Benefit Code Transaction Code
1 A 2
1 B 1
2 A 3
3 A 2
1 C 2
For analysis purposes, it would be much more convenient to have the table in the following form:
PN BC 1 TC 1 BC 2 TC 2 BC 3 TC 3
1 A 2 B 1 C 2
2 A 3 NULL NULL NULL NULL
3 A 2 NULL NULL NULL NULL
I believe this can be done, for example, in R using the tidyverse package, where the concept is basically pivoting the table from long-form to wide-form. Now, I know that I could possibly use the LEAD function in SQL, but the problem/issue is that I do not know how many benefit codes and transaction codes each policy has (i.e. they are not fixed).
Thus, my query is:
How can I "pivot wider" my table to achieve something like the above?
Other than "pivoting wider", is there a more dynamic form of the LEAD function in SQL, where it takes all subsequent rows of a group (in my case, each policy number) and puts them in new columns?
Any intuitive explanations or suggestions will be greatly appreciated :)

How to merge two tables with a different amount and order of columns in SSMS?

I have one large table with ~10,000 rows of data and 100 columns that I want to continuously update. The problem is that the files I will use to update (.csv) often are in different orders or contain extra/missing columns. If there are extra columns in the update I am fine discarding them, but I want the remaining columns to match up exactly, even if some are missing or out of order.
I know that there is a solution in creating a select and simply listing all columns, but I am looking for something more elegant/foolproof. Many of the examples I have seen work well enough using MERGE, UNION, or JOIN but I can't get them to work for this much larger dataset, which is why it has been giving me so much trouble. I am not very experienced with SQL so I would appreciate some additional padding to the explanation.
Where ABCD are columns and 1 is data: Here is the master table
a b c d
1 1 1 1
Here is the update table:
b c d e
1 _ 1 1
Only imagine that there are 100 columns and 100 rows to append to the 10,000 stored.
Desired:
a b c d e
1 1 1 1
_ 1 _ 1 1
Or even
a b c d
1 1 1 1
_ 1 _ 1
e:
This answer is exactly what I want, but it doesn't seem possible in TSQL
https://stackoverflow.com/a/52524364/11777090
do union all
select a,b,c,d,0 from table
union all
select 0,b,c,d,e from table

Fact table joins

I have two facttables A and B.
A has a positionkey column in it and B has 4 columns called position_1,position_2,
position_3,position_4 and has an indicator of that position. For eg, if B has 2 under position_1 it means two people with position 1 were assigned. if B has 1 under position_2 it means one person with position 2 was assigned.
I would want to join these two tables by position and other keys.
Is there a possibility to do this?
You could use a CASE in your JOIN conditions.
ON a.PositionValue = CASE
WHEN a.PositionKey=1 THEN b.Position_1
WHEN a.PositionKey=2 THEN b.Position_2
etc...
END

Find 'Most Similar' Items in Table by Foreign Key

I have a child table with a number of charact/value pairs for a given 'material' (MaterialID). Any material can have a number of charact values and may have several of the same name (see id's 2,3).
The table has a large number of records (8+ million). What I'm trying to do is find the materials that are the most similar to a supplied material. That is, when I supply a MaterialID, I would like an ordered list of the most similar other materials (those with the most matching charact/value pairs).
I've done some research but, I may be missing some key terms or just not conceptualizing the problem correctly.
Any hints as to how to go about this would be very much appreciated.
ID MaterialID Charact Value
1 1 ROT_DIR CCW
2 1 SPECIAL_FEATURE CATALOG_CP
3 1 SPECIAL_FEATURE CHROME
4 1 SCHEDULE 80
5 2 BEARING_TYPE SB
6 2 SCHEDULE 80
7 3 ROT_DIR CCW
8 3 SPECIAL_FEATURE CATALOG_HSB
9 3 BEARING_TYPE SP
10 4 NDE_STYLE W_FAN
11 4 BEARING_TYPE SB
12 4 ROT_DIR CW*
You can do this with a self join:
select t.materialid, count(*) as nummatches
from t join
t tmat
on t.Charact = tmat.Charact and t.value = tmat.value
where tmat.materialid = #MaterialId
group by t.materialid
order by nummatches desc;
Notes:
You might want to remove the specified material, by adding where t.MaterialId <> tmat.MaterialId to the where clause.
If you want all materials, then make the join a left join and move the where condition to the on clause.
If you want only one material with the most matches, use select top 1.
If you want all materials with the most matches when there are ties, use `select top (1) with ties.

Delete duplicates when the duplicates are not in the same column

Here is a sample of my data (n>3000) that ties two numbers together:
id a b
1 7028344 7181310
2 7030342 7030344
3 7030354 7030353
4 7030343 7030345
5 7030344 7030342
6 7030364 7008059
7 7030659 7066051
8 7030345 7030343
9 7031815 7045692
10 7032644 7102337
Now, the problem is that id=2 is a duplicate of id=5 and id=4 is a duplicate of id=8. So, when I tried to write if-then statements to map column a to column b, basically the numbers just get swapped. There are many cases like this in my full data.
So, my question is to identify the duplicate(s) and somehow delete one of the duplicates (either id=2 or id=5). And I preferably want to do this in Excel but I could work with SQL Server or SAS, too.
Thank you in advance. Please comment if my question is not clear.
What I want:
id a b
1 7028344 7181310
2 7030342 7030344
3 7030354 7030353
4 7030343 7030345
6 7030364 7008059
7 7030659 7066051
9 7031815 7045692
10 7032644 7102337
All sorts of ways to do this.
In SAS or SQL, this is simple (for SQL Server, the SQL portion should be identical or nearly so):
data have;
input id a b;
datalines;
1 7028344 7181310
2 7030342 7030344
3 7030354 7030353
4 7030343 7030345
5 7030344 7030342
6 7030364 7008059
7 7030659 7066051
8 7030345 7030343
9 7031815 7045692
10 7032644 7102337
;;;;
run;
proc sql undopolicy=none;
delete from have H where exists (
select 1 from have V where V.id < H.id
and (V.a=H.a and V.b=H.b) or (V.a=H.b and V.b=H.a)
);
quit;
The excel solution would require creating an additional column I believe with the concatenation of the two strings, in order (any order will do) and then a lookup to see if that is the first row with that value or not. I don't think you can do it without creating an additional column (or using VBA, which if you can use that will have a fairly simple solution as well).
Edit:
Actually, the excel solution IS possible without creating a new column (well, you need to put this formula somewhere, but without ANOTHER additional column).
=IF(OR(AND(COUNTIF(B$1:B1,B2),COUNTIF(C$1:C1,C2)),AND(COUNTIF(B$1:B1,C2),COUNTIF(C$1:C1,B2))),"DUPLICATE","")
Assuming ID is in A, B and C contain the values (and there is no header row). That formula goes in the second row (ie, B2/C2 values) and then is extended to further rows (so row 36 will have the arrays be B1:B35 and C1:C35 etc.). That puts DUPLICATE in the rows which are duplicates of something above and blank in rows that are unique.
I haven't tested this out but here is some food for thought, you could join the table against itself and get the ID's that have duplicates
SELECT
id, a, b
FROM
[myTable]
INNER JOIN ( SELECT id, a, b FROM [myTable] ) tbl2
ON [myTable].a = [tbl2].b
OR [myTable].b = tbl2.a