SAS - Proc Sql - Creating and merging using a new variable - sql

I want to create a new variable based on certain conditions in a table and merge it to another table using the newly created variable in a single data or a proc sql step.
eg)
table 1
var new_var
x 3x
y 4y
z 5z
table 2
new_var additional_var
3x a
3x a
4y z
and merging both the tables using the new_var in a single step
Thanks

You can accomplish this using a join with an inline view. Inline views can save you time coding and reduce I/O operations, which is the biggest cause of slowdowns. This is particularly true on non-SSD hard drives.
proc sql noprint;
create table want as
select var, t1.new_var, additional_var
from table1 as t1
LEFT JOIN
(select new_var,
CASE
when(<conditions>) then 'a'
else 'z'
END as additional_var
from table2) as t2
ON t1.new_var = t2.new_var
quit;

Related

Left Joining table B on table A is instantaneous when B has matches to A, but takes forever (> 1 minute) when there are no matches. Why is this?

I want to:
SELECT T.TournamentId, Sum(Score) as Score
FROM Tournament T
LEFT JOIN Scores S on S.TournamentId = T.TournamentId
WHERE T.TournamentId = x
GROUP BY T.TournamentId
When I choose a TounamentId "x" in the WHERE clause that hasn't started yet, the query takes forever to run. When I choose an "x" for a tournament that has started, it runs instantly.
The real query is a bit more complicated than this. But this is the root of the issue. Why would this be and what can I do to speed it up? I'd like to be able to use the same query for both cases. But if there's nothing I can do, I'll create a second query to run when the Tournament hasn't started.
If any of the tables are very big this problem makes sense, and in that case you should try to limit it as much as possible (Some people will claim the server will do it itself, but that is not the case all the time).
Try for instance
SELECT
T.TournamentId, SUM(Score) AS Score
FROM
Tournament T
LEFT JOIN
Scores S ON S.TournamentId = T.TournamentId AND S.TournamentId = x
WHERE
T.TournamentId = x
GROUP BY
T.TournamentId
Otherwise you can write a stored procedure and create a temporary table that is a copy of S but only contains rows where TournamentId = x. Joins are always heavy on databases and making the tables smaller before using them in a join can speed things up millions of percent.
As pointed out you can also use index to speed things up, in that case the index need to be on the values you join on, you should also consider rebuilding the index on a regular basis (easy in mssql, PAIN in mysql).
And to speed things up even more you can add custom indexes to your temporary tables, especially if you have many huge tables this will make things alot faster, if it's just a few hundred/thousand line impact will be negligible or even negative..
# Make sure the tables are not already there
DROP TEMPORARY TABLE IF EXISTS tmp_Tournament;
DROP TEMPORARY TABLE IF EXISTS tmp_Scores;
CREATE TEMPORARY TABLE tmp_Tournament
SELECT * from Tournament WHERE TournamentId = x;
CREATE INDEX tmp_Tournament_TournamentId ON tmp_Tournament (TournamentId);
CREATE TEMPORARY TABLE tmp_Scores
SELECT * FROM Scores WHERE TournamentId = x;
CREATE INDEX tmp_Scores_TournamentId ON tmp_Scores (TournamentId);
SELECT
T.TournamentId, SUM(Score) AS Score
FROM
tmp_Tournament T
LEFT JOIN
tmp_Scores S ON S.TournamentId = T.TournamentId AND S.TournamentId = x
WHERE
T.TournamentId = x
GROUP BY
T.TournamentId;
# Just some cleanup
DROP TEMPORARY TABLE IF EXISTS tmp_Tournament;
DROP TEMPORARY TABLE IF EXISTS tmp_Scores;

How to force evaluation of subquery before joining / pushing down to foreign server

Suppose I want to query a big table with a few WHERE filters. I am using Postgres 11 and a foreign table; foreign data wrapper (FDW) is clickhouse_fdw. But I am also interested in a general solution.
I can do so as follows:
SELECT id,c1,c2,c3 from big_table where id=3 and c1=2
My FDW is able to do the filtering on the remote foreign data source, ensuring that the above query is quick and doesn't pull down too much data.
The above works the same if I write:
SELECT id,c1,c2,c3 from big_table where id IN (3,4,5) and c1=2
I.e all of the filtering is sent downstream.
However, if the filtering I'm trying to do is slightly more complex:
SELECT bt.id,bt.c1,bt.c2,bt.c3
from big_table bt
join lookup_table l on bt.id=l.id
where c1=2 and l.x=5
then the query planner decides to filter on c1=2 remotely but apply the other filter locally.
In my use case, calculating which ids have l.x=5 first and then sending those off to be filtered remotely will be much quicker, so I tried to write it the following way:
SELECT id,c1,c2,c3
from big_table
where c1=2
and id IN (select id from lookup_table where x=5)
However, the query planner still decides to perform the second filter locally on ALL of the results from big_table that satisfy c1=2, which is very slow.
Is there some way I can "force" (select id from lookup_table where x=5) to be pre-calculated and sent as part of a remote filter?
Foreign data wrapper
Typically, joins or any derived tables from subqueries or CTEs are not available on the foreign server and have to be executed locally. I.e., all rows remaining after the simple WHERE clause in your example have to be retrieved and processed locally like you observed.
If all else fails you can execute the subquery SELECT id FROM lookup_table WHERE x = 5 and concatenate results into the query string.
More conveniently, you can automate this with dynamic SQL and EXECUTE in a PL/pgSQL function. Like:
CREATE OR REPLACE FUNCTION my_func(_c1 int, _l_id int)
RETURNS TABLE(id int, c1 int, c2 int, c3 int) AS
$func$
BEGIN
RETURN QUERY EXECUTE
'SELECT id,c1,c2,c3 FROM big_table
WHERE c1 = $1
AND id = ANY ($2)'
USING _c1
, ARRAY(SELECT l.id FROM lookup_table l WHERE l.x = _l_id);
END
$func$ LANGUAGE plpgsql;
Related:
Table name as a PostgreSQL function parameter
Or try this search on SO.
Or you might use the meta-command \gexec in psql. See:
Filter column names from existing table for SQL DDL statement
Or this might work: (Feedback says does not work.)
SELECT id,c1,c2,c3
FROM big_table
WHERE c1 = 2
AND id = ANY (ARRAY(SELECT id FROM lookup_table WHERE x = 5));
Testing locally, I get a query plan like this:
Index Scan using big_table_idx on big_table (cost= ...)
Index Cond: (id = ANY ($0))
Filter: (c1 = 2)
InitPlan 1 (returns $0)
-> Seq Scan on lookup_table (cost= ...)
Filter: (x = 5)
Bold emphasis mine.
The parameter $0 in the plan inspires hope. The generated array might be something Postgres can pass on to be used remotely. I don't see a similar plan with any of your other attempts or some more I tried myself. Can you test with your fdw?
Related question concerning postgres_fdw:
postgres_fdw: possible to push data to foreign server for join?
General technique in SQL
That's a different story. Just use a CTE. But I don't expect that to help with the FDW.
WITH cte AS (SELECT id FROM lookup_table WHERE x = 5)
SELECT id,c1,c2,c3
FROM big_table b
JOIN cte USING (id)
WHERE b.c1 = 2;
PostgreSQL 12 changed (improved) behavior, so that CTEs can be inlined like subqueries, given some preconditions. But, quoting the manual:
You can override that decision by specifying MATERIALIZED to force separate calculation of the WITH query
So:
WITH cte AS MATERIALIZED (SELECT id FROM lookup_table WHERE x = 5)
...
Typically, none of this should be necessary if your DB server is configured properly and column statistics are up to date. But there are corner cases with uneven data distribution ...

How do I perform multiple left joins while maintaining an existing index?

In SAS, I have a large table that I want to increment with information from multiple small tables by performing left joins (or equivalent). My logic requires many steps (i.e. can't join everything at the same time). After each join, I want to keep large_table's existing index, making the best use of it. How can I rewrite the following code to accomplish this?
/*Join 1*/
proc sql;
create table large_table as
select a.*, b.newinfo1
from large_table a
left join small_table1 b on a.id = b.id;
quit;
/*some logic*/
/*Join 2*/
proc sql;
create table large_table as
select a.*, b.newinfo2
from large_table a
left join small_table2 b on a.id = b.id;
quit;
/*...*/
Better would certainly be to perform one query. But if you don't have that ability, you have a few options.
The most SAS-like is not a SQL query but a MODIFY statement. This performs a left join, and modifies the master dataset - doesn't replace it. You do have to have all of the variables pre-defined for this to work.
data class(index=(name));
set sashelp.class;
call missing(predict); *define PREDICT so it is available to be updated;
where sex='F';
run;
data classfit(index=(name));
set sashelp.classfit;
run;
data class;
modify class classfit; *only PREDICT will be appended here;
by name;
select (_IORC_); *this processes the 'left' join;
when (%sysrc(_sok)) replace; *if in master then replace;
when (%sysrc(_dsenmr)) delete; *if not in master then delete;
otherwise abort;
end;
run;
proc contents data=class;
run;
You could do something similar in SQL using an UPDATE statement.
proc sql;
update class
set predict = (
select predict from classfit
where class.name=classfit.name
);
quit;
proc contents data=class;
run;
A left join equivalent for single new column is a SAS custom format. Create two new custom formats from the small table and instead of rebuilding the large table, create a view that repeats the id twice, applying the new formats to the repeats -- extending the view scope of the information.
A solution centered on a view makes the large_table extension responsive to any changes in the small tables when the small table concept formats are updated.
For example
data fmt1 / view=fmt1;
fmtname = 'small_1_concept';
set small_table1(rename=(id=start newinfo1=label));
run;
data fmt2 / view=fmt2;
fmtname = 'small_2_concept';
set small_table2(rename=(id=start newinfo2=label));
run;
proc format cntlin=fmt1;
proc format cntlin=fmt2;
proc sql;
create view large_table_extended_v as
select
large_table.*
, id as id1 format=small_1_concept.
, id as id2 format=small_2_concept.
from
large_table
;
quit;

Why would using a temp table vs a table variable improve the speed of this query?

I currently have a performance issue with a query (that is more complicated than the example below). Originally the query would run and take say 30 seconds, then when I switched out the use of a table variable to using a temp table instead, the speed is cut down to a few seconds.
Here is a trimmed down version using a table variable:
-- Store XML into tables for use in query
DECLARE #tCodes TABLE([Code] VARCHAR(100))
INSERT INTO
#tCodes
SELECT
ParamValues.ID.value('.','VARCHAR(100)') AS 'Code'
FROM
#xmlCodes.nodes('/ArrayOfString/string') AS ParamValues(ID)
SELECT
'SummedValue' = SUM(ot.[Value])
FROM
[SomeTable] st (NOLOCK)
JOIN
[OtherTable] ot (NOLOCK)
ON ot.[SomeTableID] = st.[ID]
WHERE
ot.[CodeID] IN (SELECT [Code] FROM #tCodes) AND
st.[Status] = 'ACTIVE' AND
YEAR(ot.[SomeDate]) = 2013 AND
LEFT(st.[Identifier], 11) = #sIdentifier
Here is the version with the temp table which performs MUCH faster:
SELECT
ParamValues.ID.value('.','VARCHAR(100)') AS 'Code'
INTO
#tCodes
FROM
#xmlCodes.nodes('/ArrayOfString/string') AS ParamValues(ID)
SELECT
'SummedValue' = SUM(ot.[Value])
FROM
[SomeTable] st (NOLOCK)
JOIN
[OtherTable] ot (NOLOCK)
ON ot.[SomeTableID] = st.[ID]
WHERE
ot.[CodeID] IN (SELECT [Code] FROM #tCodes) AND
st.[Status] = 'ACTIVE' AND
YEAR(ot.[SomeDate]) = 2013 AND
LEFT(st.[Identifier], 11) = #sIdentifier
The problem I have with performance is solved with the change but I just don't understand why it fixes the issue and would prefer to know why. It could be related to something else in the query but all I have changed in the stored proc (which is much more complicated) is to switch from using a table variable to using a temp table. Any thoughts?
The differences and similarities between table variables and #temp tables are looked at in depth in my answer here.
Regarding the two queries you have shown (unindexed table variable vs unindexed temp table) three possibilities spring to mind.
INSERT ... SELECT to table variables is always serial. The SELECT can be parallelised for temp tables.
Temp tables can have column statistics histograms auto created for them.
Usually the cardinality of table variables is assumed to be 0 (when they are compiled when the table is empty)
From the code you have shown (3) seems the most likely explanation.
This can be resolved by using OPTION (RECOMPILE) to recompile the statement after the table variable has been populated.

How to merge two SAS datasets on one of two possible variables?

I am trying to merge two large (million+) datasets in SAS. I'm pretty new to SAS and this is my first stackexchange question so hopefully the following makes sense...
SETUP:
All observations in the "Master" dataset have a unique identifier var1 and some have unique identifier var2. Some observations in the "Addition" dataset have unique identifier var1 and some have unique identifier var2; some observations have var2 but not var2.
I want to merge in all matches from the Addition dataset on EITHER var1 or var2 into the Master dataset.
METHODS I HAVE EXPLORED:
Option A: proc sql left join on var1 OR var2. Unfortunately, because there are multiple missing observations on var2 in both Master and Addition this runs into a Cartesian product problem - it works, but is impractically slow with my large datasets.
proc sql;
create table match as
select a.id1, a.id2, varmast, b.varadd
from master a
left join addition b
on (a.id1=b.id1 and a.id2=b.id2) or (a.id2=b.id2 and b.id2 is not null);
quit;
Option B: I'm thinking maybe merge on the first identifier and then use proc sql update to update from the Addition variables using the second identifier? I'm not sure of the syntax.
Option C: I could see probably doing this with a few regular merges & then appending and deduping, but as this would probably take 5+ steps and each step takes a few minutes to run (on a good day) am hoping for something shorter.
I suspect that two left joins are what you want . . . and it should have better performance. The result is something like this:
proc sql;
create table match as
select m.id1, a.id2, varmast, coalesce(a.varadd, a2.varadd) as varadd
from master m left join
addition a
on m.id1 = a.id1 and m.id2 = a.id2 left join
addition a2
on m.id1 = a2.id1 and m.id2 is null and a.id1 is null
quit;