Optimize SAS proc sql - sql

I am joining 2 tables and creating a mini Cartesian join between them so that all businesses within a city and state are matched up, then I am using some fuzzy logic to try and match business name and street name. There are ~3 million records on the input table and ~25 million records on the output table, so it is taking an extremely long time to run. I have created indexes on all the columns being joined and all columns being used in the where statement.
My next thought was to replace the city/state names with integers but I'd be adding processing time to create those tables. Does anyone have any other thoughts on decreasing the processing time.
proc sql;
create index output_stname on tbl._output (output_stname);
create index output_namevar on tbl._output (output_namevar);
create index key on tbl._output (key);
create index city on tbl._output (city);
create index state on tbl._output (state);
create index input_stname on tbl._input (input_stname);
create index input_namevar on tbl._input (input_namevar);
create index key_input on tbl._input (key_input);
create index city_input on tbl._input (city_input);
create index state_input on tbl._input (state_input);
;
quit;
proc sql;
create table tbl._level2 as
select distinct
key_input,
name_input,
address_input,
city_input,
state_input,
zip_input,
key,
business_nm1,
address,
city,
state,
zip,
'2 - Street Name & Business Name Match' as matchtype
from tbl._input a
left join tbl._output b on a.city_input=b.city and a.state_input=b.state
where
compged(a.input_stname,b.output_stname) <= 50 and
compged(input_namevar,output_namevar) <= 50
and case
when length(strip(a.input_namevar)) <= 2 then 1
when length(strip(b.output_namevar)) <= 2 then 1
else 0
end = 0
;
quit;

I would start with a composite index on the output table:
proc sql;
create index output_stname on tbl._output (state, city, output_stname, output_namevar);
This should speed the joins. However, the select distinct is still suspicious. It is generally better to not have to use select distinct.

I would suggest not processing this with SQL. The SQL optimizer can't really optimize this very well due to the COMPGED and the CASE statements, as it doesn't really know how often those are going to be true; and the COMPGED is very expensive. As such you're going to get a very slow process in any event.
Most likely, a hash solution is best. It's hard to say without looking at the data (how many city/state pairs are there, for example - are there a huge number of unique ones, or a relatively small number?). But a hash solution will likely be faster, particularly as it avoids the index creation step, assuming you can fit the output table into the hash (or, alternately, fit the input table into the hash) in memory.

Related

Would Creating an Index speed up a query in SAS

I have never created an Index before but I'm thinking it may help here. I have a SAS dataset of approx. 7million records. It is a listing of employee entries along with their respective timestamps. I am identifying if there are any subsequent entries by the same user on the same day and then noting the timestamp.
The data set (Entries) is 3 columns: Storage_ID, User_ID and EventTimestamp.
I'm thinking maybe an Index on Stoarge_ID and User_ID would help speed things along.
If they would help, how/where would I need to go about creating the index?
PROC SQL;
CREATE TABLE sub_ENTRIES AS
SELECT A.*,
(SELECT
MIN(B.EVENTTIMESTAMP)
FROM
ENTRIES B
WHERE
A.STORAGE_ID=B.STORAGE_ID
AND A.USER_ID=B.USER_ID
AND DATEPART(A.EVENTTIMESTAMP)=DATEPART(B.EVENTTIMESTAMP)
AND B.EVENTTIMESTAMP > A.EVENTTIMESTAMP
) AS NEXT_ACCESS FORMAT=DATETIME27.6
FROM
ENTRIES A
;
You can create a composite index (two or more columns) using SQL.
For example:
Proc SQL;
create index STORAGE_USER on ENTRIES (storage_id, user_id);
The general syntax is for a index key of n columns is:
create index <index-name>
on <table-name>
( <column-name-1>,
<column-name-2>,
…
<column-name-<n>>
)
The index is most effective / applicable when the query select or join criteria involves all the columns of the composite key. Using OPTION MSGLEVEL=I to have SAS log index usage.

dictionary database, one table vs table for each char

I have a very simple database contains one table T
wOrig nvarchar(50), not null
wTran nvarchar(50), not null
The table has +50 million rows. I execute a simple query
select wTran where wOrig = 'myword'
The query takes about 40 sec to complete. I divided the table based on the first char of wOrig and the execution time is much smaller than before (based on each table new length).
Am I missing something here? Should not the database use more efficient way to do the search, like binary search?
My question What changes to the database options - based on this situation - could make the search more efficient in order to keep all the data in one table?
You should be using an index. For your query, you want an index on wTran(wOrig). Your query will be much faster:
create index idx_wTran_wOrig on wTran(wOrig);
Depending on considerations such as space and insert/update characteristics, a clustered index on (wOrig) or (wOrig, wTran) might be the best solution.

Indexes, EXPLAIN PLAN, and record access in Oracle SQL

I have been learning about indexes in Oracle SQL, and I wanted to conduct a small experiment with a test table to see how indexes really worked. As I discovered from an earlier post made here, the best way to do this is with EXPLAIN PLAN. However, I am running into something which confuses me.
My sample table contains attributes (EmpID, Fname, Lname, Occupation, .... etc). I populated it with 500,000 records using a java program I wrote (random names, occupations, etc). Now, here are some sample queries with and without indexes:
NO INDEX:
SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
EXPLAIN PLAN says:
OPERATION OPTIMIZER COST
TABLE ACCESS(FULL) TEST.EMPLOYEE ANALYZED 1169
Now I create index:
CREATE INDEX occupation_idx
ON EMPLOYEE (Occupation);
WITH INDEX "occupation_idx":
SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
EXPLAIN PLAN says:
OPERATION OPTIMIZER COST
TABLE ACCESS(FULL) TEST.EMPLOYEE ANALYZED 1169
So... the cost is STILL the same, 1169? Now I try this:
WITH INDEX "occupation_idx":
SELECT Occupation FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
EXPLAIN PLAN says:
OPERATION OPTIMIZER COST
INDEX(RANGE SCAN) TEST.OCCUPATION_IDX ANALYZED 67
So, it appears that the index only is utilized when that column is the only one I'm pulling values from. But I thought that the point of an index was to unlock the entire record using the indexed column as the key? The search above is a pretty pointless one... it searches for values which you already know. The only worthwhile query I can think of which ONLY involves an indexed column's value (and not the rest of the record) would be an aggregate such as COUNT or something.
What am I missing?
Even with your index, Oracle decided to do a full scan for the second query.
Why did it do this? Oracle would have created two plans and come up with a cost for each:-
1) Full scan
2) Index access
Oracle selected the plan with the lower cost. Obviously it came up with the full scan as the lower cost.
If you want to see the cost of the index plan, you can do an explain plan with a hint like this to force the index usage:
SELECT /*+ INDEX(EMPLOYEE occupation_idx) */ Fname
FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
If you do an explain plan on the above, you will see that the cost is greater than the full scan cost. This is why Oracle did not choose to use the index.
A simple way to consider the cost of the index plan is:-
The blevel of the index (how many blocks must be read from top to bottom)
The number of table blocks that must be subsequently read for records matching in the index. This relies on Oracle's estimate of the number of employees that have an occupation of 'DOCTOR'. In your simple example, this would be:
number of rows / number of distinct values
More complicated considerations include the clustering factory and index cost adjustments which both reflect the likelyhood that a block that is read is already in memory and hence does not need to read from disk.
Perhaps you could update your question with the results from your query with the index hint and also the results of this query:-
SELECT COUNT(*), COUNT(DISTINCT( Occupation ))
FROM EMPLOYEE;
This will allow people to comment on the cost of the index plan.
I think I see what's happening here.
When you have the index in place, and you do:
SELECT Occupation FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
The execution plan will use the index. This is a no-brainer, cause all the data that's required to satisfy the query is right there in the index, and Oracle never even has to reference the table at all.
However, when you do:
SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
then, if Oracle uses the index, it will do an INDEX RANGE SCAN followed by a TABLE ACCESS BY ROWID to look up the Fname that corresponds to that Occupation. Now, depending on how many rows have DOCTOR for Occupation, Oracle will have to make one or more trips to the table, to look up the Fname. If, for example, you have a table, and all the employees have Occupation set to 'DOCTOR', the index isn't of much use, and Oracle will simply do a FULL TABLE SCAN of the table. If there are 10,000 employees, and only one is a DOCTOR, then again, it's a no-brainer, and Oracle will use the index.
But there are some subtleties, when you're somewhere between those two extremes. People like to talk about 'selectivity', i.e., how many rows are identifed by the index, vs. the size of the table, when discussing whether the index will be used. But, that's not really true. What Oracle really cares about is block selectivity. That is, how many blocks does it have to visit, to satisfy the query? So, first, how "wide" is the RANGE SCAN? The more limited the range of values specified by the predicate values, the better. Second, when your query needs to do table lookups, how many different blocks will it have to visit to find all the data it needs. That is, how "random" is the data in the table relative to the index order? This is called the CLUSTERING_FACTOR. If you analyze the index to collect statistics, and then look at USER_INDEXES, you'll see that the CLUSTERING_FACTOR is now populated.
So, what's CLUSTERING_FACTOR? CLUSTERING_FACTOR is the "orderedness" of the table, with respect to the index's key column(s). The value of CLUSTERING_FACTOR will always be between the number of blocks in a table and the number of rows in a table. A low CLUSTERING_FACTOR, that is, one that is very near to the number of blocks in the table, indicates a table that's very ordered, relative to the index. A high CLUSTERING_FACTOR, that is, one that is very near to the number of rows in the table, is very unordered, relative to the index.
It's an important concept to understand that the CLUSTERING_FACTOR describes the order of data in the table relative to the index. So, rebuilding an index, for example, will not change the CLUSTERING_FACTOR. It's also important to understand that the same table could have two indexes, and one could have an excellent CLUSTERING_FACTOR, and the other could have an extremely poor CLUSTERING_FACTOR. The table itself can only be ordered in one way.
So, why have I spent so much time describing CLUSTERING_FACTOR? Because when you have an execution plan that does an INDEX RANGE SCAN followed by TABLE ACCESS BY ROWID, you can be sure that the CLUSTERING_FACTOR has been considered by Oracle's optimizer, to come up with the execution plan. For example, suppose you have a 10,000 row table, and suppose 100 of the rows have Occupation = 'DOCTOR'. You write the query above, asking for the Fname of the employees whose occupation is DOCTOR. Well, Oracle can very easily and efficiently determine the rowids of the rows where occupation is DOCTOR. But, how many table blocks will Oracle need to visit, to do the Fname lookup? It could be only 1 or 2 table blocks, if the data is clustered (ordered) by Occupation in the table. But, it could be as many as 100, if the data is very unordered in the table! So, again, 10,000 row table, and, let's assume, (for the purposes of illustration and simple math) that the table has 100 rows/block, and so, 100 blocks. Depending on table order (i.e. CLUSTERING_FACTOR), the number of table block visits could be as few as 1, or as many as 100.
So, I hope this helps you understand why the optimizer may be reluctant to use an index in some cases.
An index is the copy of the table which only stores the following data:
Indexed field(s)
A pointer to the original row (rowid).
Say you have a table like this:
rowid id name occupation
[1] 1 John clerk
[2] 2 Jim manager
[3] 3 Jane boss
Then an index on occupation would look like this:
occupation rowid
boss [3]
manager [2]
clerk [1]
, with the records sorted on occupation in a B-Tree.
As you can see, if you only select the indexed fields, you only need the index (the second table).
If you select anything other than occupation:
SELECT *
FROM mytable
WHERE occupation = 'clerk'
then the engine should make two things: first find the relevant records in the index, second, find the records in the original table by rowid. It's like if you joined the two tables on rowid.
Since the rowids in the index are not in order, the reads to the original table are not sequential and can be slow. It may be faster to read the original table in sequential order and just filter the records with occupation = 'clerk'.
The engine does not "unlock" the records: it just finds the rowid in the index, and if there are not enough data in the index itself, it looks up data in the original table by the rowid found.
As a WAG. Analyze the table, and the index, then see if the plan changes.
When you are selecting just the occupation, the entire query can be satisfied from the index. The index literally has a copy of the occupation. The moment you add an additional column to the select, Oracle has to go to the data record, to get it. The optimizer chooses to read all of the data rows instead of all of the index rows, and the data rows. It's cheaper.

Oracle: Full text search with condition

I've created an Oracle Text index like the following:
create index my_idx on my_table (text) indextype is ctxsys.context;
And I can then do the following:
select * from my_table where contains(text, '%blah%') > 0;
But lets say we have a have another column in this table, say group_id, and I wanted to do the following query instead:
select * from my_table where contains(text, '%blah%') > 0 and group_id = 43;
With the above index, Oracle will have to search for all items that contain 'blah', and then check all of their group_ids.
Ideally, I'd prefer to only search the items with group_id = 43, so I'd want an index like this:
create index my_idx on my_table (group_id, text) indextype is ctxsys.context;
Kind of like a normal index, so a separate text search can be done for each group_id.
Is there a way to do something like this in Oracle (I'm using 10g if that is important)?
Edit (clarification)
Consider a table with one million rows and the following two columns among others, A and B, both numeric. Lets say there are 500 different values of A and 2000 different values of B, and each row is unique.
Now lets consider select ... where A = x and B = y
An index on A and B separately as far as I can tell do an index search on B, which will return 500 different rows, and then do a join/scan on these rows. In any case, at least 500 rows have to be looked at (aside from the database being lucky and finding the required row early.
Whereas an index on (A,B) is much more effective, it finds the one row in one index search.
Putting separate indexes on group_id and the text I feel only leaves the query generator with two options.
(1) Use the group_id index, and scan all the resulting rows for the text.
(2) Use the text index, and scan all the resulting rows for the group_id.
(3) Use both indexes, and do a join.
Whereas I want:
(4) Use the (group_id, "text") index to find the text index under the particular group_id and scan that text index for the particular row/rows I need. No scanning and checking or joining required, much like when using an index on (A,B).
Oracle Text
1 - You can improve performance by creating the CONTEXT index with FILTER BY:
create index my_idx on my_table(text) indextype is ctxsys.context filter by group_id;
In my tests the filter by definitely improved the performance, but it was still slightly faster to just use a btree index on group_id.
2 - CTXCAT indexes use "sub-indexes", and seem to work similar to a multi-column index. This seems to be the option (4) you're looking for:
begin
ctx_ddl.create_index_set('my_table_index_set');
ctx_ddl.add_index('my_table_index_set', 'group_id');
end;
/
create index my_idx2 on my_table(text) indextype is ctxsys.ctxcat
parameters('index set my_table_index_set');
select * from my_table where catsearch(text, 'blah', 'group_id = 43') > 0
This is likely the fastest approach. Using the above query against 120MB of random text similar to your A and B scenario required only 18 consistent gets. But on the downside, creating the CTXCAT index took almost 11 minutes and used 1.8GB of space.
(Note: Oracle Text seems to work correctly here, but I'm not familiar with Text and I can't gaurentee this isn't an inappropriate use of these indexes like #NullUserException said.)
Multi-column indexes vs. index joins
For the situation you describe in your edit, normally there would not be a significant difference between using an index on (A,B) and joining separate indexes on A and B. I built some tests with data similar to what you described and an index join required only 7 consistent gets versus 2 consistent gets for the multi-column index.
The reason for this is because Oracle retrieves data in blocks. A block is usually 8K, and an index block is already sorted, so you can probably fit the 500 to 2000 values in a few blocks. If you're worried about performance, usually the IO to read and write blocks is the only thing that matters. Whether or not Oracle has to join together a few thousand rows is an inconsequential amount of CPU time.
However, this doesn't apply to Oracle Text indexes. You can join a CONTEXT index with a btree index (a "bitmap and"?), but the performance is poor.
I'd put an index on group_id and see if that's good enough. You don't say how many rows we're talking about or what performance you need.
Remember, the order in which the predicates are handled is not necessarily the order in which you wrote them in the query. Don't try to outsmart the optimizer unless you have a real reason to.
Short version: There's no need to do that. The query optimizer is smart enough to decide what's the best way to select your data. Just create a btree index on group_id, ie:
CREATE INDEX my_group_idx ON my_table (group_id);
Long version: I created a script (testperf.sql) that inserts 136 rows of dummy data.
DESC my_table;
Name Null Type
-------- -------- ---------
ID NOT NULL NUMBER(4)
GROUP_ID NUMBER(4)
TEXT CLOB
There is a btree index on group_id. To ensure the index will actually be used, run this as a dba user:
EXEC DBMS_STATS.GATHER_TABLE_STATS('<YOUR USER HERE>', 'MY_TABLE', cascade=>TRUE);
Here's how many rows each group_id has and the corresponding percentage:
GROUP_ID COUNT PCT
---------------------- ---------------------- ----------------------
1 1 1
2 2 1
3 4 3
4 8 6
5 16 12
6 32 24
7 64 47
8 9 7
Note that the query optimizer will use an index only if it thinks it's a good idea - that is, you are retrieving up to a certain percentage of rows. So, if you ask it for a query plan on:
SELECT * FROM my_table WHERE group_id = 1;
SELECT * FROM my_table WHERE group_id = 7;
You will see that for the first query, it will use the index, whereas for the second query, it will perform a full table scan, since there are too many rows for the index to be effective when group_id = 7.
Now, consider a different condition - WHERE group_id = Y AND text LIKE '%blah%' (since I am not very familiar with ctxsys.context).
SELECT * FROM my_table WHERE group_id = 1 AND text LIKE '%ipsum%';
Looking at the query plan, you will see that it will use the index on group_id. Note that the order of your conditions is not important:
SELECT * FROM my_table WHERE text LIKE '%ipsum%' AND group_id = 1;
Generates the same query plan. And if you try to run the same query on group_id = 7, you will see that it goes back to the full table scan:
SELECT * FROM my_table WHERE group_id = 7 AND text LIKE '%ipsum%';
Note that stats are gathered automatically by Oracle every day (it's scheduled to run every night and on weekends), to continually improve the effectiveness of the query optimizer. In short, Oracle does its best to optimize the optimizer, so you don't have to.
I do not have an Oracle instance at hand to test, and have not used the full-text indexing in Oracle, but I have generally had good performance with inline views, which might be an alternative to the sort of index you had in mind. Is the following syntax legit when contains() is involved?
This inline view gets you the PK values of the rows in group 43:
(
select T.pkcol
from T
where group = 43
)
If group has a normal index, and doesn't have low cardinality, fetching this set should be quick. Then you would inner join that set with T again:
select * from T
inner join
(
select T.pkcol
from T
where group = 43
) as MyGroup
on T.pkcol = MyGroup.pkcol
where contains(text, '%blah%') > 0
Hopefully the optimizer would be able to use the PK index to optimize the join and then appy the contains predicate only to the group 43 rows.

SQL Index Performance

We have a table called table1 ...
(c1 int indentity,c2 datetime not null,c3 varchar(50) not null,
c4 varchar(50) not null,c5 int not null,c6 int ,c7 int)
on column c1 is primary key(clusterd Index)
on column c2 is index_2(Nonclusterd)
on column c3 is index_2(Nonclusterd)
on column c4 is index_2(Nonclusterd)
on column c5 is index_2(Nonclusterd)
It contains 10 million records. We have several procedures pointing to "table1" with different search criteria:
select from table1 where c1=blah..and c2= blah.. and c3=blah..
select from table1 where c2=blah..and c3= blah.. and c4=blah..
select from table1 where c1=blah..and c3= blah.. and c5=blah..
select from table1 where c1=blah..
select from table1 where c2=blah..
select from table1 where c3=blah..
select from table1 where c4=blah..
select from table1 where c5=blah..
What is the best way to create non-clustered index apart from above, or modify existing indexes to get good index performance and reduce the execution time?
And now to actually respond...
The trick here is that you have single-column lookups on any number of columns, as well as composite column lookups. You need to understand with what frequency the queries above are executing - for those that are run very seldom, you should probably exclude them from your indexing considerations.
You may be better off creating single NCIX's on each of the columns being queried. This would likely be the case if the number of rows being returned is very small, as the NCIX's would be able to handle the "single lookup" queries as well as the composite lookups. Alternatively, you could create single-column NCIX's in addition to covering composite indexes - again, the deciding factor being the frequency of execution and the number of results being returned.
This is somewhat tough to answer with just the information you have provided. There are other factors you need to weigh out.
For example:
How often is the table updated and what columns are updated frequently?
You'll be paying a cost on these updates due to index maintenance.
What is the cardinality of your different columns?
What queries are you executing most often and what columns appear in the where clause of those queries?
You need to first figure out what your parameters for acceptable performance are for each of your queries and work from their taking into account the things I have mentioned above.
With 10 million rows, partitioning your table could make a lot of sense here.
Have you thought about using the Full Text Search component of MSSQL. It might offer the performance that you are looking for?