Count Frequency of subgroup Proc SQL - sql

I am trying to find a way to select a frequency count for rows of a subgroup with no distinct identifiers (well, I guess the distinct identifier is a combination of statuses). Consider the sample data:
data have;
input Series $ Game Name $ Points;
datalines;
A 1 LeBron 2
A 1 LeBron 3
A 1 LeBron 2
A 1 LeBron 2
A 2 LeBron 2
A 2 LeBron 2
A 2 LeBron 3
A 3 LeBron 2
;
run;
Each row here is a shot LeBron took in a game within a series. I want The series/game summary, with a count for number of shots. Like this:
Series Game Name Freq Sum 2pt 3pt
A 1 LeBron 4 9 3 1
A 2 LeBron 3 7 2 1
A 3 LeBron 1 2 1 0
I have to use Proc SQL here, rather then proc means because I am pulling the data in from multiple tables. Also, I will have several thousand "Series" and many more "Games" and "Names" so please keep answer general Here is what I have:
proc sql;
create table want as
select Series,
Game,
Name,
sum(points) as totalpoints
from have
group by 1,2,3;
run;
Thanks.
Pyll

No particular reason you couldn't use PROC MEANS pulling from multiple tables - you can always create a view (either in SQL or in the data step). But anyway,
proc sql;
create table want as
select Series,
Game,
Name,
sum(points) as totalpoints,
count(points) as numbershotsmade
from have
group by 1,2,3;
run;
You can also use the n function which does the same thing.
count(points) will count the non-null points values; count(1) will count the total number of rows even if points is null.

Related

SQL - Update in a cross apply query

UPDATE Table1
SET SomeColumn = X.SomeOtherColumn
FROM Table1 T1
CROSS APPLY
(SELECT TOP 1 SomeOtherColumn
FROM Table2 T2
WHERE T2.SomeJoinColumn = T1.SomeJoinColumn
ORDER BY CounterColumn) AS X
I want to increase CounterColumn by 1 each time the cross apply query runs. Is there any way I could achieve this?
Some context and sample data
I have a table containing information about companies. I want to anonymize the company numbers in this table. To do this, I want to use data from another table, containing synthetized data. This table has a much smaller sample size. So I have to reuse the same synthetic companies multiple times. For each row in the table I anonymize, I want to pick a synthetic company of the same type. I want to use all the synthetic companies. That's where the counter comes in, counting how many times I've used that specific synthetic company. By sorting by this counter, I was hoping to be able to always pick the synthetic company that's been used the least.
Company table (Table1)
CompanyNumber
Type
67923
2
82034
2
90238
7
29378
2
92809
5
72890
2
Synthetic company table (Table2)
SyntheticCompanyNumber
Type
Counter
08366
5
0
12588
2
0
33823
2
0
27483
7
0
Expected output of Company table:
CompanyNumber
Type
12588
2
33823
2
27483
7
12588
2
08366
5
33823
2
Expected output of synthetic company table
SynteticCompanyNumber
Type
Counter
08366
5
1
12588
2
2
33823
2
2
27483
7
1

How to select last element for each ID

I would like select some elements from the last id
Here an example that I have :
id money
1 200
1 150
1 500
3 50
4 40
4 300
5 110
Here what I would like :
1 500
3 50
4 300
5 110
So like you can see, I took last id and the money who corresponds.
I tried to do a group by id order by id descending with limit 1. But limit 1 is not available in proc sql from sas and it doesn't work.
Thanks in advance
Unlike SAS datasets, SQL tables represent unordered sets. In your case, it looks like you want the maximum value in the second column, in which case you can use aggregation:
proc sql;
select id, max(money)
from t
group by id;
If you actually mean the last row per id based on the ordering in the SAS dataset, I would suggest using a data step instead.

Using SQL in SAS, how do I create a new column that counts/indicates the uniqueness of values in an existing column..?

My data is as follows:
ID
1
2
3
3
4
5
6
6
I want to create a column that indicates the uniqueness of a value in the ID column as such:
ID COUNT
1 1
2 1
3 1
3 0
4 1
5 1
6 1
6 0
I'd like to do this without creating a temporary table, via a subquery or something. Any assistance would be much appreciated.
One option would be to use by functionality in the data step:
data have;
input ID;
datalines;
1
2
3
3
4
5
6
6
;run;
data want;
set have;
by ID;
if first.ID then count = 1;
else count = 0;
run;
That type of logic is not really amenable to SQL since the order of observations is not really insured. In a more modern version of SQL you could use windowing functions (like ROW_NUMBER() with PARTITION BY) to impose an record count.
If you really wanted to try to do it just in PROC SQL you might need to resort to using the undocumented MONOTONIC() function. But even then to defeat the optimizer eliminating the duplicate rows you might need to make a temporary table with the row counter first.
data have;
input ID ##;
datalines;
1 2 3 3 4 5 6 6
;
proc sql ;
create table _temp_ as select id,monotonic() as row from have;
create table want as
select a.id
, b.row=min(b.row) as FLAG
from have a,_temp_ b
where a.id=b.id
group by a.id
order by 1,2
;
quit;

SQL Creating new variables

I am fairly inexperienced with SQL, but am working to try to condense my code into one query so that it is more efficient. Below is a simplified example of a much more complex problem I have. I am having problems with the syntax of creating the summary groups and variables. In my case, the data are housed in several different table, but the joins are not a problem for me so I have only created one table here.
This is the data I have:
Name Class Wk Score ExCred X
Joe A 1 35 ? 3
Hal A 1 50 5 4
Sal A 1 45 ? 3
Kim B 1 30 5 6
Cal B 1 40 ? 6
Joe A 2 50 ? 2
Hal A 2 40 ? 3
Sal A 2 40 ? 4
Kim B 2 40 5 5
Cal B 2 40 ? 4
The table I am trying to create will look like this:
Class Wk Avg_Score Sum_X
A 1 45 10
B 1 37.5 12
A 2 43.3 9
B 2 42.5 9
So, the data are summarized by class and week. The avg_score is the average of the sum and 'score' and 'ExCred' for each student. Sum_X is simply the sum of X for each class.
I have had success with this in SAS SQL by using multiple proc means statements, but this is clunky and seems to take a really long time. There has to be a more elegant way to do this. I know it probably involves the group by statement..... Help?
Thanks. Pyll
I see no particular reason not to use proc means here. It should be significantly faster than proc sql on datasets of substantial size.
proc means data=have;
class class wk;
types class*wk;
var score x;
output out=want mean(score)= sum(x)=;
run;
Just preprocess the data to include ExCred into the Score variable; if execution time is an issue use a view to do so.
If you did want to do it in sql, you would indeed use a group by.
proc sql;
create table want as
select class, wk, mean(score+ex_cred), sum(x)
from have
group by class, wk;
quit;

Conditioning on multiple rows in a column in Teradata

Suppose I have a table that looks like this:
id attribute
1 football
1 NFL
1 ball
2 football
2 autograph
2 nfl
2 blah
2 NFL
I would like to get a list of distinct ids where the attribute column contains the terms "football", "NFL", and "ball". So 1 would be included, but 2 would not. What's the most elegant/efficient way to do this in Terdata?
The number of attributes can vary for each id, and terms can repeat. For example, NFL appears twice for id 2.
You can use the following:
select id
from yourtable
where attribute in ('football', 'NFL', 'ball')
group by id
having count(distinct attribute) = 3
See SQL Fiddle with Demo (fiddle is showing MySQL, but this should work in TeraData)