count the instances of value in sub-query, update table - sql

I am trying to count the number of times a value (mytype) appears within a distinct id value, and update my table with this count (idsubtotal) for each row. The table I have:
id | mytype | idsubtotal
-----+--------+-----------
44 red
101 red
101 red
101 blue
101 yellow
494 red
494 blue
494 blue
494 yellow
494 yellow
I need to calculate/update the idsubtotal column, so it is like:
id | mytype | idsubtotal
-----+--------+-----------
44 red 1
101 red 2
101 red 2
101 blue 1
101 yellow 1
494 red 1
494 blue 2
494 blue 2
494 yellow 2
494 yellow 2
When I try this below, it is counting how many times the mytype value appears in the entire table, but I need to know how many times it appears within that sub-group of id values (e.g. How many times does "red" appear within id 101 rows, answer = 2).
SELECT id, mytype,
COUNT(*) OVER (PARTITION BY mytype) idsubtotal
FROM table_name
I know storing this subtotal in the table itself (versus calculating it live when needed) constitutes a bad data model for the table, but I need to do it this way in my case.
Also, my question is similar to this question but slightly different, and nothing I've tried to tweak using my very primitive understanding of SQL from the previous responses or other posts have worked. TIA for any ideas.

UPDATE table_name a
SET idsubtotal=( SELECT COUNT(1)
FROM table_name b
WHERE a.id=b.id
AND a.mytype=b.mytype
)

When I try this below, it is counting how many times the mytype value appears in the entire table, but I need to know how many times it appears within that sub-group of id values (e.g. How many times does "red" appear within id 101 rows, answer = 2).
SELECT id, mytype, COUNT(*)
FROM table_name
GROUP BY id, mytype

Related

SQL condition based on number on field name

I have a table containing numbers of people in each area, by age.
There is a column for each age, as shown in this table (junk data):
Area
0
1
2
3
...
90+
A
123
65
45
20
--
66
B
442
456
124
422
--
999
C
442
99
88
747
--
234
I need to group these figures into age bands (0-19. 20-39, 40-59...)
eg:
Area
0-19
20-39
40-59
60+
A
789
689
544
1024
B
1564
884
1668
1589
C
800
456
456
951
What is the best way to do this?
I could do a simple SUM as below, but that feels like a massive amount of script for something that feels like it should be straightforward.
SELECT
[0]+[1]+[2]+...[19] AS [0-19],
[20]+[21]+[22]+...[39] AS [20-39]
...
Is there a simpler way? I'm wondering if PIVOT can help but am struggling to visualise how to use it to get my desired result.
Hoping I'm missing something obvious!
EDIT This is how the data has been supplied to me, I know it's not a great table design but unfortunately that's out of my hands.
I would suggest creating a view on top of your table like so:
CREATE VIEW v_t_normal
SELECT Area, Age, Value
FROM t
CROSS APPLY (VALUES
(0, [0])
(1, [1])
...
(90, [90+])
) AS ca(Age, Value)
That view will normalize present your data in somewhat normalized form. You will not be able to edit the data in the view but you should be able to perform basic math and aggregation on the data. The 90+ value will still cause headache as it encapsulates more than one value.
I'm going to answer by suggesting an alternative table design which will make life easier:
Area | Age | Count
A | 0 | 123
A | 1 | 65
...
B | 0 | 442
Here we are storing each area's age in a separate record, rather than column. With this design in place, your ask is easy to come by using conditional aggregation:
SELECT
Area,
SUM(CASE WHEN Age BETWEEN 0 AND 19 THEN Count ELSE 0 END) AS [0-19],
SUM(CASE WHEN Age BETWEEN 20 AND 39 THEN Count ELSE 0 END) AS [20-39],
...
FROM yourNewTable
GROUP BY Area;

How to do conditional count based on row value in SAS/SQL?

Re-uploading since there was some problems with my last post, and I did not know that we were supposed to post sample data. I'm fairly new to SAS, and I have a problem that I know how to solve in Excel but not SAS. however, the dataset is too large to reasonably use in Excel.
I have four variables: id, year_start, groupname, test_score.
Sample data:
id year_start group_name test_score
1 19931231 Red 90
1 19941230 Red 89
1 19951231 Red 91
1 19961231 Red 92
2 19930630 Red 85
2 19940629 Red 87
2 19950630 Red 95
3 19950931 Blue 90
3 19960931 Blue 90
4 19930331 Red 95
4 19940331 Red 97
4 19950330 Red 98
4 19960331 Red 95
5 19931231 Red 96
5 19941231 Red 97
My goal is to achieve a ranked list (fractional) by test_score for each year. I hoped that I would be able to achieve this using PROC RANK FRACTION. This function would calculate order by a test_score (highest is 1, 2nd highest is 2 and so on) and then divide by the total number of observations to provide a fractional rank. Unfortunately, year_start differs widely from row to row. For each id/year combo, I want to perform a one-year look-back from year-start, and rank that observation compared to all other id's that have a year_start in that one year range. I'm not interested in comparing by calendar year, and the rank of each id should be relative to its own year_start. Adding another level of complication, I would like this rank to be performed by groupname.
PROC SQL is totally fine if someone has a SQL solution.
Using the above data, the ranks would be like this:
id year_start group_name test_score rank
1 19931231 Red 90 0.75
1 19941230 Red 89 0.8
1 19951231 Red 91 1
1 19961231 Red 92 1
2 19930630 Red 85 1
2 19940629 Red 87 0.8
2 19950630 Red 95 0.75
3 19950931 Blue 90 1
3 19960931 Blue 90 1
4 19930331 Red 95 1
4 19940331 Red 97 0.2
4 19950330 Red 98 0.2
4 19960331 Red 95 0.333
5 19931231 Red 96 0.25
5 19941231 Red 97 0.667
In order to calculate the rank for row 1,
we first exclude blue observations.
Then we count the number of observations that fall within a year before that year_start, 19931231 (so we have 4 observations).
We count how many of these observations have a higher test_score, and then add 1 to find the order of the current observation (So it is the 3rd highest).
Then, we divide the order by the total number to get the rank (3/4= 0.75).
In Excel, the formula for this variable would look something like this. Assume formula is for row 1 and there are 100 rows. id=A, year_start=B, groupname=C, and test_score=D:
=(1+countifs(D1:D100,">"&D1,
B1:B100,"<="&B1,
B1:B100,">"&B1-365.25,
C1:C100, C1))/
countifs(B1:B100,"<="&B1,
B1:B100,">"&B1-365.25,
C1:C100, C1)
Thanks so much for the help!
ahammond428
Your example isn't correct if I'm reading it correctly, so it's hard to know exactly what you're trying to do. But try the following and see if it works. You may need to tweak inequalities to be open or closed depending on whether you want to include one year to the date. Note that your year_start column needs to be imported in a SAS date format for this to work. Otherwise you can change it over with input(year_start, yymmdd8.).
proc sql;
select distinct
a.id,
a.year_start,
a.group_name,
a.test_score,
1+sum(case when b.test_score > a.test_score then 1 else 0 end) as rank_num,
count(b.id) as rank_denom,
calculated rank_num / calculated rank_denom as rank
from testdata a left join testdata b
on a.group_name = b.group_name
and intnx('year',a.year_start,-1,'s') le b.year_start le a.year_start
group by a.id, a.year_start, a.group_name, a.test_score
order by id, year_start;
quit;
Note that I changed dates of 9/31 to 9/30 (since there is no 9/31), but left 3/30, 6/29, and 12/30 alone since perhaps that was intended, though the other dates seem to be quarter-end.
Consider correlated count subqueries in SQL:
DATA
data ranktable;
infile datalines missover;
input id year_start group_name $ test_score;
datalines;
1 19931231 Red 90
1 19941230 Red 89
1 19951231 Red 91
1 19961231 Red 92
2 19930630 Red 85
2 19940629 Red 87
2 19950630 Red 95
3 19950930 Blue 90
3 19960930 Blue 90
4 19930331 Red 95
4 19940331 Red 97
4 19950330 Red 98
4 19960331 Red 95
5 19931231 Red 96
5 19941231 Red 97
;
run;
data ranktable;
set ranktable;
format year_start date9.;
year_start = input(put(year_start,z8.),yymmdd8.);
run;
PROC SQL
Additional fields included for your review
proc sql;
select r.id, r.year_start, r.group_name, r.test_score,
put(intnx('year', r.year_start, -1, 's'), yymmdd10.) as year_ago,
(select count(*) from ranktable sub
where sub.test_score >= r.test_score
and sub.group_name = r.group_name
and sub.year_start <= r.year_start
and sub.year_start >= intnx('year', r.year_start, -1, 's')) as num_rank,
(select count(*) from ranktable sub
where sub.group_name = r.group_name
and sub.year_start <= r.year_start
and sub.year_start >= intnx('year', r.year_start, -1, 's')) as denom_rank,
calculated num_rank / calculated denom_rank as rank
from ranktable r;
run;
OUTPUT
You will notice a slight difference between your expected results which may be due to the quarter day (365.25) you apply for all years as SAS's intnx takes one full calendar year in days which change with each year

How can I identify the text values that have the lowest row IDs across 4 columns?

I found a few articles that are close, but not the same as what I am trying to do. I have an Excel file that has 4 columns of duplicated data, each column is sorted based on a numeric value that came from a different worksheet.
I need to identify the 25(or so?) rows where the value of the four columns match, and the row ID is the lowest. There will be roughly 250 rows of data to sift through, so I only really need the top 10%.
I don't HAVE to approach it this way. I can dump this data into Access if this cannot be done in Excel. Or I can assign columns next to each text column (a way of assigning IDs to each field in column 1, 2, 3, and 4) and use those values. The approach is negotiable, as long as the outcome works.
Here's what my data looks like in Excel:
A B C D
abc bcd abc def
cde fgh def bcd
def def bcd abc
bcd hji xyz lmn
So in this case I would want to highlight (or somehow identify) the value "def" because it appears closest to the top of all 4 columns, hence it has the lowest row ID. The value "bcd" would be second on the list since it also is identified in all 4 and has a low row id.
Any suggestions would be appreciated. I know SQL fairly well, so if you think dumping it in a DB would be best and you can suggest a query that would be awesome. But ideally... keeping it in Excel would be the least amount of work for me. I'm open to formulas, conditional formatting, etc.
Thanks!!
I THINK I came up with a fairly cool solution...
So, supposing you have this data in columns A-D, begining in cell A2, say.
Now, you know that you ONLY want values if they already exist in column A - Otherwise they're not in all 4 columns.
So:
In E2, type in the formula =Row() - This basically says where A's value is located
In F2, type in =Match($A2,B:B,0) - This will find the first match for A2's value in columns B
Drag that formula across to G2 & H2 (to find the first match for A2's value in C & D respectively).
In I2, type in the formula =Sum(E2:H2)
Now, drag E:H down for your entire dataset.
So, If H = #N/A, that means the values weren't in all 4 columns
And the lower the value for H, the lower the rank of the match - (Column A's text being the value you're matching for).
Now you could sort according to Column H, etc, to suit your needs.
Hope this does the trick (and makes sense)!
Cool Q, BTW!!!
Do you have, or can you create, a master list of all of the possible cell values? If so, then some simple VLOOKUPs on each of the 4 data columns could give, for each unique cell value, the row number in each column. Add up the 4 reesults and sort on the total.
If you don't have the master list of unique values, I'd tend to go to Access because it's a pretty easy set of queries to get what you want.
Clarification Needed
When I first came up with this answer I used the same approach that John used in his clever Excel answer, namely to use the sum of the minimum rows per column to produce the rank. That produces the sample result in the question, but consider the following modified test data:
F1 F2 F3 F4 RowNum
--- --- --- --- ------
XXX bar baz bat 1
foo XXX baz bat 2
YYY bar XXX bat 3
foo YYY baz bat 4
foo bar YYY bat 5
foo bar baz YYY 6
foo bar baz bat 7
foo bar baz bat 8
foo bar baz bat 9
foo bar baz XXX 10
XXX appears in rows 1, 2, 3, and 10, so the sum would be 16. YYY appears in rows 3, 4, 5, and 6 so the sum would be 18. Ranking by sum would declare XXX the winner, even though if you started scanning for XXX from row 1 you would have to go all the way to row 10 to reach the last XXX, whereas if you started scanning for YYY from row 1 you would only have to go down to row 6 to reach the last YYY.
In this case should YYY actually be the winner?
(original answer)
The following code will import the Excel data into Access and add a [RowNum] column
Sub ImportExcelData()
On Error Resume Next '' in case it doesn't already exist
DoCmd.DeleteObject acTable, "ExcelData"
On Error GoTo 0
DoCmd.TransferSpreadsheet acImport, acSpreadsheetTypeExcel12Xml, "ExcelData", "C:\Users\Gord\Documents\ExcelData.xlsx", False
CurrentDb.Execute "ALTER TABLE ExcelData ADD COLUMN RowNum AUTOINCREMENT(1,1)", dbFailOnError
End Sub
So now we have an [ExcelData] table in Access like this
F1 F2 F3 F4 RowNum
--- --- --- --- ------
abc bcd abc def 1
cde fgh def bcd 2
def def bcd abc 3
bcd hji xyz lmn 4
Let's create a saved query named ExcelItems in Access to string the entries out in a long "list"...
SELECT F1 AS Item, RowNum, 1 AS ColNum FROM ExcelData
UNION ALL
SELECT F2 AS Item, RowNum, 2 AS ColNum FROM ExcelData
UNION ALL
SELECT F3 AS Item, RowNum, 3 AS ColNum FROM ExcelData
UNION ALL
SELECT F4 AS Item, RowNum, 4 AS ColNum FROM ExcelData
...returning...
Item RowNum ColNum
---- ------ ------
abc 1 1
cde 2 1
def 3 1
bcd 4 1
bcd 1 2
fgh 2 2
def 3 2
hji 4 2
abc 1 3
def 2 3
bcd 3 3
xyz 4 3
def 1 4
bcd 2 4
abc 3 4
lmn 4 4
Now we can find the lowest RowNum where Item is found for each ColNum...
TRANSFORM Min(ExcelItems.[RowNum]) AS MinOfRowNum
SELECT ExcelItems.[Item]
FROM ExcelItems
GROUP BY ExcelItems.[Item]
PIVOT ExcelItems.[ColNum] In (1,2,3,4);
...returning...
Item 1 2 3 4
---- - - - -
abc 1 1 3
bcd 4 1 3 2
cde 2
def 3 3 2 1
fgh 2
hji 4
lmn 4
xyz 4
If we save that query as ExcelItems_Crosstab then we can use it to rank the items that appear in all four columns:
SELECT Item, [1]+[2]+[3]+[4] AS Rank
FROM ExcelItems_Crosstab
WHERE ([1]+[2]+[3]+[4]) IS NOT NULL
ORDER BY 2
...returning...
Item Rank
---- ----
def 9
bcd 10

update table from other table whithout join

Here is the deal. I have a table T with many columns but two of interest: gen_ID, ordernumber.
Records in this table are always by groups of 5 with the gen_ID being the same and the ordernumber being blank.
So in essence, it looks like this:
Gen_ID ordernumber
233
233
233
233
233
234
234
234
234
234
Now I have a query Q that, when executed, randomizes the numbers 1, 2, 3, 4, and 5.
I want to update ordernumber with the random numbers of Q so it looks like this:
Gen_ID ordernumber
233 3
233 4
233 1
233 2
233 5
234 4
234 5
234 3
234 2
234 1
Etc...
Any idea on how to do this using MS Access 2010 SQL?
Udate query would be fine but I cannot join the two since I don't have a common ID.
Any suggestions? Note that I can run this magic query once a set of 5 records are created in the table (I don't need to have that done once I have more than one set).
I don't think this can be achieved by SQL alone and will need some VB running alongside. My approach would be to get your 1 - 5 numbers in a random order stored in an "Array", you can then open up a recordset to "T" and step through one by one assigning a number from your array. You could also loop this process to begin again whenever it detects a new Gen_ID in "T" and thus populate the whole table in one pass.

SQL comparing two tables with common id but the id in table 2 could being in two different columns

Given the following SQL tables:
Administrators:
id Name rating
1 Jeff 48
2 Albert 55
3 Ken 35
4 France 56
5 Samantha 52
6 Jeff 50
Meetings:
id originatorid Assitantid
1 3 5
2 6 3
3 1 2
4 6 4
I would like to generate a table from Ken's point of view (id=3) therefore his id could be possibly present in two different columns in the meetings' table. (The statement IN does not work since I introduce two different field columns).
Thus the ouput would be:
id originatorid Assitantid
1 3 5
2 6 3
If you really just need to see which column Ken's id is in, you only need an OR. The following will produce your example output exactly.
SELECT * FROM Meetings WHERE originatorid = 3 OR Assistantid = 3;
If you need to take the complex route and list names along with meetings, an OR in your join's ON clause should work here:
SELECT
Administrators.name,
Administrators.id,
Meetings.originatorid,
Meetings.Assistantid
FROM Administrators
JOIN Meetings
ON Administrators.id = Meetings.originatorid
OR Administrators.id = Meetings.Assistantid
Where Administrators.name = 'Ken'