SQL Counting empy fields in a row - sql

I'm new to SQL and i'm not sure what i want to do is possible.
Basically i want to count the number of empty fields present in a row and add it as a column at the end, like so:
Original data
ID
Code
Name
Age
1111
aaa
name1
23
1111
bbb
name1
23
2222
cccc
3333
fdfd
34
3333
rrrr
Result:
ID
Code
Name
Age
Empty Fields
1111
aaa
name1
23
0
1111
bbb
name1
23
0
2222
cccc
2
3333
fdfd
34
1
3333
rrrr
2
After that i want to concatenate the code field for each duplicate ID and delete the line with the higher Empty fields, like so:
Result:
First:
ID
Code
Name
Age
Empty Fields
1111
aaa,bbb
name1
23
0
1111
aaa,bbb
name1
23
0
2222
cccc
2
3333
fdfd,rrrr
34
1
3333
rrrr,fdfd
2
End Result:
ID
Code
Name
Age
Empty Fields
1111
aaa,bbb
name1
23
0
2222
cccc
2
3333
fdfd,rrrr
34
1

This is an inefficient method/logic but since you've laid out your steps I assume there's some reason for this logic.
Count the Number of Missing using CMISS which counts across character and numeric variables. Set the variable to 0 initially so it's not included in the missing count. Use _all_ to refer to all variables in the data set.
*calculate number missing;
data step1;
set have;
*set N_Missing to 0 so it is not counted as missing;
N_Missing = 0;
*count missing;
N_Missing = cmiss(Of _all_);
run;
Combine CODE into one field, using a data step.
*combine CODE from into one field;
proc sort data=step1;
by id code;
run;
data step2;
set have;
by ID code;
retain Code_Agg;
length Code_Agg $80.;
if first.ID then Code_Agg = Code;
else Code_Agg = catx(", ", Code_Agg, Code);
if last.ID then output;
keep ID Code_AGG;
run;
Merge output from #1 and #2 to get the middle table. You may need to sort the data ahead of time.
*merge results with prior table;
data step3;
merge step1 step2;
by ID;
run;
Keep record only of interest using NODUPKEY in PROC SORT which keeps only unique record. Note the double sort, first sorting to get the order correct to take the first record.
proc sort data=step3;
by ID N_Missing;
run;
proc sort data=step3 out=final nodupkey;
by ID;
run;

Related

Remove group rows

How do I remove group rows based on other columns for a particular ID such that:
ID Att Comp Att. Inc. Att
aaa 2 0 2
aaa 2 0 2
bbb 3 1 2
bbb 3 1 2
bbb 3 0 2
becomes:
ID Att Comp Att. Inc. Att
aaa 2 0 2
bbb 3 1 2
I need to discard cases which are not just duplicate, but also infer the same data based on the columns.
Use drop_duplicates -- check out the documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
I can't tell for sure from your description what you want to pay attention for for duplicates, but you can tell drop_duplicates which column(s) to look at.

How to do a last observation carrying forward using SAS PROC SQL

I have the data below. I want to write a sas proc sql code to get the last non-missing values for each patient(ptno).
data sda;
input ptno visit weight;
format ptno z3. ;
cards;
1 1 122
1 2 123
1 3 .
1 4 .
2 1 156
2 2 .
2 3 70
2 4 .
3 1 60
3 2 .
3 3 112
3 4 .
;
run;
proc sql noprint;
create table new as
select ptno,visit,weight,
case
when weight = . then weight
else .
end as _weight_1
from sda
group by ptno,visit
order by ptno,visit;
quit;
The sql code above does not work well.
The desire output data like this:
ptno visit weight
1 1 122
1 2 123
1 3 123
1 4 123
2 1 156
2 2 .
2 3 70
2 4 70
3 1 60
3 2 .
3 3 112
3 4 112
Since you do have effectively a row number (visit), you can do this - though it's much slower than the data step.
Here it is, broken out into a separate column for demonstration purposes - of course in your case you will want to coalesce this into one column.
Basically, you need a subquery that determines the maximum visit number less than the current one that does have a legitimate weight count, and then join that to the table to get the weight.
proc sql;
select ptno, visit, weight,
(
select weight
from sda A,
(select ptno, max(visit) as visit
from sda D
where D.ptno=S.ptno
and D.visit<S.visit
and D.weight is not null
group by ptno
) V
where A.visit=V.visit and A.ptno=V.ptno
)
from sda S
;
quit;
Although you don't describe it that way you do not carry forward VISIT 1 right?
I don't know why you would want to do this using SQL. In SAS a data step is much better suited to the task. I like using the "update trick". If you're interested in how this works I will leave it to you to study the UPDATE statement.
data locf;
update sda(obs=0 keep=ptno) sda;
by ptno;
output;
if visit eq 1 then call missing(weight);
run;

How can I identify the text values that have the lowest row IDs across 4 columns?

I found a few articles that are close, but not the same as what I am trying to do. I have an Excel file that has 4 columns of duplicated data, each column is sorted based on a numeric value that came from a different worksheet.
I need to identify the 25(or so?) rows where the value of the four columns match, and the row ID is the lowest. There will be roughly 250 rows of data to sift through, so I only really need the top 10%.
I don't HAVE to approach it this way. I can dump this data into Access if this cannot be done in Excel. Or I can assign columns next to each text column (a way of assigning IDs to each field in column 1, 2, 3, and 4) and use those values. The approach is negotiable, as long as the outcome works.
Here's what my data looks like in Excel:
A B C D
abc bcd abc def
cde fgh def bcd
def def bcd abc
bcd hji xyz lmn
So in this case I would want to highlight (or somehow identify) the value "def" because it appears closest to the top of all 4 columns, hence it has the lowest row ID. The value "bcd" would be second on the list since it also is identified in all 4 and has a low row id.
Any suggestions would be appreciated. I know SQL fairly well, so if you think dumping it in a DB would be best and you can suggest a query that would be awesome. But ideally... keeping it in Excel would be the least amount of work for me. I'm open to formulas, conditional formatting, etc.
Thanks!!
I THINK I came up with a fairly cool solution...
So, supposing you have this data in columns A-D, begining in cell A2, say.
Now, you know that you ONLY want values if they already exist in column A - Otherwise they're not in all 4 columns.
So:
In E2, type in the formula =Row() - This basically says where A's value is located
In F2, type in =Match($A2,B:B,0) - This will find the first match for A2's value in columns B
Drag that formula across to G2 & H2 (to find the first match for A2's value in C & D respectively).
In I2, type in the formula =Sum(E2:H2)
Now, drag E:H down for your entire dataset.
So, If H = #N/A, that means the values weren't in all 4 columns
And the lower the value for H, the lower the rank of the match - (Column A's text being the value you're matching for).
Now you could sort according to Column H, etc, to suit your needs.
Hope this does the trick (and makes sense)!
Cool Q, BTW!!!
Do you have, or can you create, a master list of all of the possible cell values? If so, then some simple VLOOKUPs on each of the 4 data columns could give, for each unique cell value, the row number in each column. Add up the 4 reesults and sort on the total.
If you don't have the master list of unique values, I'd tend to go to Access because it's a pretty easy set of queries to get what you want.
Clarification Needed
When I first came up with this answer I used the same approach that John used in his clever Excel answer, namely to use the sum of the minimum rows per column to produce the rank. That produces the sample result in the question, but consider the following modified test data:
F1 F2 F3 F4 RowNum
--- --- --- --- ------
XXX bar baz bat 1
foo XXX baz bat 2
YYY bar XXX bat 3
foo YYY baz bat 4
foo bar YYY bat 5
foo bar baz YYY 6
foo bar baz bat 7
foo bar baz bat 8
foo bar baz bat 9
foo bar baz XXX 10
XXX appears in rows 1, 2, 3, and 10, so the sum would be 16. YYY appears in rows 3, 4, 5, and 6 so the sum would be 18. Ranking by sum would declare XXX the winner, even though if you started scanning for XXX from row 1 you would have to go all the way to row 10 to reach the last XXX, whereas if you started scanning for YYY from row 1 you would only have to go down to row 6 to reach the last YYY.
In this case should YYY actually be the winner?
(original answer)
The following code will import the Excel data into Access and add a [RowNum] column
Sub ImportExcelData()
On Error Resume Next '' in case it doesn't already exist
DoCmd.DeleteObject acTable, "ExcelData"
On Error GoTo 0
DoCmd.TransferSpreadsheet acImport, acSpreadsheetTypeExcel12Xml, "ExcelData", "C:\Users\Gord\Documents\ExcelData.xlsx", False
CurrentDb.Execute "ALTER TABLE ExcelData ADD COLUMN RowNum AUTOINCREMENT(1,1)", dbFailOnError
End Sub
So now we have an [ExcelData] table in Access like this
F1 F2 F3 F4 RowNum
--- --- --- --- ------
abc bcd abc def 1
cde fgh def bcd 2
def def bcd abc 3
bcd hji xyz lmn 4
Let's create a saved query named ExcelItems in Access to string the entries out in a long "list"...
SELECT F1 AS Item, RowNum, 1 AS ColNum FROM ExcelData
UNION ALL
SELECT F2 AS Item, RowNum, 2 AS ColNum FROM ExcelData
UNION ALL
SELECT F3 AS Item, RowNum, 3 AS ColNum FROM ExcelData
UNION ALL
SELECT F4 AS Item, RowNum, 4 AS ColNum FROM ExcelData
...returning...
Item RowNum ColNum
---- ------ ------
abc 1 1
cde 2 1
def 3 1
bcd 4 1
bcd 1 2
fgh 2 2
def 3 2
hji 4 2
abc 1 3
def 2 3
bcd 3 3
xyz 4 3
def 1 4
bcd 2 4
abc 3 4
lmn 4 4
Now we can find the lowest RowNum where Item is found for each ColNum...
TRANSFORM Min(ExcelItems.[RowNum]) AS MinOfRowNum
SELECT ExcelItems.[Item]
FROM ExcelItems
GROUP BY ExcelItems.[Item]
PIVOT ExcelItems.[ColNum] In (1,2,3,4);
...returning...
Item 1 2 3 4
---- - - - -
abc 1 1 3
bcd 4 1 3 2
cde 2
def 3 3 2 1
fgh 2
hji 4
lmn 4
xyz 4
If we save that query as ExcelItems_Crosstab then we can use it to rank the items that appear in all four columns:
SELECT Item, [1]+[2]+[3]+[4] AS Rank
FROM ExcelItems_Crosstab
WHERE ([1]+[2]+[3]+[4]) IS NOT NULL
ORDER BY 2
...returning...
Item Rank
---- ----
def 9
bcd 10

View to replace values with max value corresponding to a match

I am sure my question is very simple for some, but I cannot figure it out and it is one of those things difficult to search an answer for. I hope you can help.
In a table in SQL I have the following (simplified data):
UserID UserIDX Number Date
aaa bbb 1 21.01.2000
aaa bbb 5 21.01.2010
ppp ggg 9 21.01.2009
ppp ggg 3 15.02.2020
xxx bbb 99 15.02.2020
And I need a view which will give me the same amount of records, but for every combination of UserID and UserIDX, there should be only 1 value under the Number field, i.e. the highest value found in the combination data set. The Date field needs to remain unchanged. So the above would be transformed to:
UserID UserIDX Number Date
aaa bbb 5 21.01.2000
aaa bbb 5 21.01.2010
ppp ggg 9 21.01.2009
ppp ggg 9 15.02.2020
xxx bbb 99 15.02.2020
So, for all instances of aaa+bbb combination the unique value in Number should be 5 and for ppp+ggg the unique number is 9.
Thank you very much.
Leo
select userid,useridx,maxnum,date
from table a
inner join (
select userid,useridx,max(number) maxnum
from table
group by userid,useridx) b
on a.userid = b.userid and a.useridx = b.useridx

SQL - querying without duplicate base on another column, /improving condition.?

I have written a query which involves joins and finally returns the below result,
Name ID
AAA 1
BBB 1
BBB 6
CCC 1
CCC 6
DDD 6
EEE 1
But I want my result to be still filtered in such a way that, the duplicate values in the first column should be ignored which has lesser value. ie, CCC and BBB which are duplicates with value 1 should be removed. The result should be
AAA 1
BBB 6
CCC 6
DDD 6
EEE 1
Note: I have a condition called Where (ID = '6' or ID = '1'), is there any way to improve this condition saying Where ID = 6 or ID = 1 (if no 6 is available in that table)"
You will likely want to add:
GROUP BY name
to the bottom of your query and change ID to MAX(ID) in your SELECT statement
It is hard to give a more specific answer without seeing the query you've already written.