SQL Creating new variables - sql

I am fairly inexperienced with SQL, but am working to try to condense my code into one query so that it is more efficient. Below is a simplified example of a much more complex problem I have. I am having problems with the syntax of creating the summary groups and variables. In my case, the data are housed in several different table, but the joins are not a problem for me so I have only created one table here.
This is the data I have:
Name Class Wk Score ExCred X
Joe A 1 35 ? 3
Hal A 1 50 5 4
Sal A 1 45 ? 3
Kim B 1 30 5 6
Cal B 1 40 ? 6
Joe A 2 50 ? 2
Hal A 2 40 ? 3
Sal A 2 40 ? 4
Kim B 2 40 5 5
Cal B 2 40 ? 4
The table I am trying to create will look like this:
Class Wk Avg_Score Sum_X
A 1 45 10
B 1 37.5 12
A 2 43.3 9
B 2 42.5 9
So, the data are summarized by class and week. The avg_score is the average of the sum and 'score' and 'ExCred' for each student. Sum_X is simply the sum of X for each class.
I have had success with this in SAS SQL by using multiple proc means statements, but this is clunky and seems to take a really long time. There has to be a more elegant way to do this. I know it probably involves the group by statement..... Help?
Thanks. Pyll

I see no particular reason not to use proc means here. It should be significantly faster than proc sql on datasets of substantial size.
proc means data=have;
class class wk;
types class*wk;
var score x;
output out=want mean(score)= sum(x)=;
run;
Just preprocess the data to include ExCred into the Score variable; if execution time is an issue use a view to do so.
If you did want to do it in sql, you would indeed use a group by.
proc sql;
create table want as
select class, wk, mean(score+ex_cred), sum(x)
from have
group by class, wk;
quit;

Related

How to pivot wider in SQL or use a more "dynamic" form of the LEAD function?

I have a table that looks as follows:
Policy Number Benefit Code Transaction Code
1 A 2
1 B 1
2 A 3
3 A 2
1 C 2
For analysis purposes, it would be much more convenient to have the table in the following form:
PN BC 1 TC 1 BC 2 TC 2 BC 3 TC 3
1 A 2 B 1 C 2
2 A 3 NULL NULL NULL NULL
3 A 2 NULL NULL NULL NULL
I believe this can be done, for example, in R using the tidyverse package, where the concept is basically pivoting the table from long-form to wide-form. Now, I know that I could possibly use the LEAD function in SQL, but the problem/issue is that I do not know how many benefit codes and transaction codes each policy has (i.e. they are not fixed).
Thus, my query is:
How can I "pivot wider" my table to achieve something like the above?
Other than "pivoting wider", is there a more dynamic form of the LEAD function in SQL, where it takes all subsequent rows of a group (in my case, each policy number) and puts them in new columns?
Any intuitive explanations or suggestions will be greatly appreciated :)

Export queries results from SAS to Excel

I'm importing an input.txt into SAS.
The content of the file is:
SUBJECT GENDER HEIGHT WEIGHT
1 M 68.5 -155
2 F 61.2 99
3 F 63.0 115
4 M 70.0 -205
5 M 68.6 170
6 F 65.1 -125
7 M 72.4 220
8 F 72.4 220
I want to export to Excel the following results, based on the WEIGHT column (if they are negative or not):
TOTAL NEGATIVE % NEGATIVE
8 -3 37,5%
I imagined the easiest way to do this is by creating 3 SELECT COUNT (*) queries and putting the results of each one into one variable, and then printing these variables into Excel, but i don't know how to do this exactly.
Also, there might be an easiest way.
By the way, I'm new to SAS, I've been working with it since a couple of days.
Any insights?
In regards to SQL, no need for 3 separate queries. You should be able to do all this in a single query with CASE:
select count(*),
count(case when weight < 0 then 1 end) negativecount,
count(case when weight < 0 then 1 end)/count(*) negativepercentage
from yourtable
SQL Fiddle Demo
Should be easy enough to format the percentage as needed.
PROC SQL;
create table WANT as
select count(*) as total,
sum(weight<0) as negative,
calculated negative/calculated total as percent format=percent8.2
from have;
quit;
The export portion depends on your environment. You can generate the code by going to File>Export and select Excel as the destination.

Count Frequency of subgroup Proc SQL

I am trying to find a way to select a frequency count for rows of a subgroup with no distinct identifiers (well, I guess the distinct identifier is a combination of statuses). Consider the sample data:
data have;
input Series $ Game Name $ Points;
datalines;
A 1 LeBron 2
A 1 LeBron 3
A 1 LeBron 2
A 1 LeBron 2
A 2 LeBron 2
A 2 LeBron 2
A 2 LeBron 3
A 3 LeBron 2
;
run;
Each row here is a shot LeBron took in a game within a series. I want The series/game summary, with a count for number of shots. Like this:
Series Game Name Freq Sum 2pt 3pt
A 1 LeBron 4 9 3 1
A 2 LeBron 3 7 2 1
A 3 LeBron 1 2 1 0
I have to use Proc SQL here, rather then proc means because I am pulling the data in from multiple tables. Also, I will have several thousand "Series" and many more "Games" and "Names" so please keep answer general Here is what I have:
proc sql;
create table want as
select Series,
Game,
Name,
sum(points) as totalpoints
from have
group by 1,2,3;
run;
Thanks.
Pyll
No particular reason you couldn't use PROC MEANS pulling from multiple tables - you can always create a view (either in SQL or in the data step). But anyway,
proc sql;
create table want as
select Series,
Game,
Name,
sum(points) as totalpoints,
count(points) as numbershotsmade
from have
group by 1,2,3;
run;
You can also use the n function which does the same thing.
count(points) will count the non-null points values; count(1) will count the total number of rows even if points is null.

SUM by Combination in SAS

I want to get from this table:
[ProductCode] [ClientNO] [Fund]
11 3 100
12 4 45
11 3 18
12 4 5
To this one:
[ProductCode] [ClientNO] [Fund]
11 3 118
12 4 50
So basically sum FUND when all the given variables match.
I'm almost there with this statement:
Proc sql;
create table SumByCombination as
select *, sum(Fund) as Total
from FundsData
group by ProductCode,ClientNO
;
quit;
But with this I get all the rows (duplicates) with a SUM column.
Edit: This is what I get.
[ProductCode] [ClientNO] [_SUM_]
11 3 118
12 4 50
11 3 118
12 4 50
I know this should be a no-brainer but I keep getting stuck.
What would be the easiest way to do this in Proc SQL ? What about other methods ?
Thanks
Stop using SELECT * in your queries. You should explicitly identify the columns that you want the SELECT to return.
Select * is nasty and evil and should very very rarely, if ever, be used.
Here is the SQL Fiddle, which returns your expected result
select ProductCode
,ClientNO
,sum(Fund) as Total
from FundsData
group by
ProductCode
,ClientNO
You're using SAS, so do it the SAS way - PROC MEANS.
proc means data=fundsdata;
var fund;
class productcode clientno;
types productcode*clientno;
output out=sumbycombination sum(fund)=fund;
run;

how does a SQL query work?

How does a SQL query work?
How does it get compiled?
Is the from clause compiled first to see if the table exists?
How does it actually retrieve data from the database?
How and in what format are the tables stored in a database?
I am using phpmyadmin, is there any way I can peek into the files where data is stored?
I am using MySQL
sql execution order:
FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> DISTINCT -> ORDER BY -> LIMIT .
SQL Query mainly works in three phases .
1) Row filtering - Phase 1: Row filtering - phase 1 are done by FROM, WHERE , GROUP BY , HAVING clause.
2) Column filtering: Columns are filtered by SELECT clause.
3) Row filtering - Phase 2: Row filtering - phase 2 are done by DISTINCT , ORDER BY , LIMIT clause.
In here i will explain with an example . Suppose we have a students table as follows:
id_
name_
marks
section_
1
Julia
88
A
2
Samantha
68
B
3
Maria
10
C
4
Scarlet
78
A
5
Ashley
63
B
6
Abir
95
D
7
Jane
81
A
8
Jahid
25
C
9
Sohel
90
D
10
Rahim
80
A
11
Karim
81
B
12
Abdullah
92
D
Now we run the following sql query:
select section_,sum(marks) from students where id_<10 GROUP BY section_ having sum(marks)>100 order by section_ LIMIT 2;
Output of the query is:
section_
sum
A
247
B
131
But how we got this output ?
I have explained the query step by step . Please read bellow:
1. FROM , WHERE clause execution
Hence from clause works first therefore from students where id_<10 query will eliminate rows which has id_ greater than or equal to 10 . So the following rows remains after executing from students where id_<10 .
id_
name_
marks
section_
1
Julia
88
A
2
Samantha
68
B
3
Maria
10
C
4
Scarlet
78
A
5
Ashley
63
B
6
Abir
95
D
7
Jane
81
A
8
Jahid
25
C
9
Sohel
90
D
2. GROUP BY clause execution
now GROUP BY clause will come , that's why after executing GROUP BY section_ rows will make group like bellow:
id_
name_
marks
section_
9
Sohel
90
D
6
Abir
95
D
1
Julia
88
A
4
Scarlet
78
A
7
Jane
81
A
2
Samantha
68
B
5
Ashley
63
B
3
Maria
10
C
8
Jahid
25
C
3. HAVING clause execution
having sum(marks)>100 will eliminates groups . sum(marks) of D group is 185 , sum(marks) of A groupd is 247 , sum(marks) of B group is 131 , sum(marks) of C group is 35 . So we can see tha C groups's sum is not greater than 100 . So group C will be eliminated . So the table looks like this:
id_
name_
marks
section_
9
Sohel
90
D
6
Abir
95
D
1
Julia
88
A
4
Scarlet
78
A
7
Jane
81
A
2
Samantha
68
B
5
Ashley
63
B
4. SELECT clause execution
select section_,sum(marks) query will only decides which columns to prints . It is decided to print section_ and sum(marks) column .
section_
sum
D
185
A
245
B
131
5. ORDER BY clause execution
order by section_ query will sort the rows ascending order.
section_
sum
A
245
B
131
D
185
6. LIMIT clause execution
LIMIT 2; will only prints first 2 rows.
section_
sum
A
245
B
131
This is how we got our final output .
Well...
First you have a syntax check, followed by the generation of an expression tree - at this stage you can also test whether elements exist and "line up" (i.e. fields do exist WITHIN the table). This is the first step - any error here any you just tell the submitter to get real.
Then you have.... analysis. A SQL query is different from a program in that it does not say HOW to do something, just WHAT THE RESULT IS. Set based logic. So you get a query analyzer in (depending on product bad to good - oracle long time has crappy ones, DB2 the most sensitive ones even measuring disc speed) to decide how best to approach this result. This is a really complicated beast - it may try dozens or hundreds of approaches to find one he believes to be fastest (cost based, basically some statistics).
Then that gets executed.
The query analyzer, by the way, is where you see huge differences. Not sure about MySQL - SQL Server (Microsoft) shines in that it does not have the best one (but one of the good ones), but that it really has nice visual tools to SHOW the query plan, compare the estimates the the analyzer to the real needs (if they differ too much table statistics may be off so the analyzer THINKS a large table is small). They present that nicely visually.
DB2 had a great optimizer for some time, measuring - i already said - disc speed to put it into it's estimates. Oracle went "left to right" (no real analysis) for a long time, and took user provided query hints (crap approach). I think MySQL was VERY primitive too in the start - not sure where it is now.
Table format in database etc. - that is really something you should not care for. This is documented (clearly, especially for an open source database), but why should you care? I have done SQL work for nearly 15 years or so and never had that need. And that includes doing quite high end work in some areas. Unless you try building a database file repair tool.... it makes no sense to bother.
The order of SQL statement clause execution-
FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY
My answer is specific to Oracle database, which provides tutorials pertaining to your queries. Well, when SQL database engine processes any SQL query/statement, It first starts parsing and within parsing it performs three checks Syntax, Semantic and Shared Pool. To know how do these checks work? Follow the link below.
Once query parsing is done, it triggers the Execution plan. But hey Database Engine! you are smart enough. You do check if this SQL query has already been parsed (Soft Parse), if so then you directly jump on execution plan or else you deep dive and optimize the query (Hard Parse). While performing hard parse, you also use a software called Row Source Generation which provides Iterative Execution Plan received from optimizer. Enough! see the SQL query processing stages below.
Note - Before execution plan, it also performs Bind operations for variable's values and once the query is executed It performs Fetch to obtain the records and finally store into result set. So in short, the order is-
PASRE -> BIND -> EXECUTE -> FETCH
And for in depth details, this tutorial is waiting for you.
This may be helpful to someone.
If you're using SSMS for Sql Server and want to know where your data files are stored, you can use this query
SELECT
mdf.database_id,
mdf.name,
mdf.physical_name as data_file,
ldf.physical_name as log_file,
db_size = CAST((mdf.size * 8.0)/1024 AS DECIMAL(8,2)),
log_size = CAST((ldf.size * 8.0 / 1024) AS DECIMAL(8,2))
FROM (SELECT * FROM sys.master_files WHERE type_desc = 'ROWS' ) mdf
JOIN (SELECT * FROM sys.master_files WHERE type_desc = 'LOG' ) ldf
ON mdf.database_id = ldf.database_id
Here's a copy of the output