Would Creating an Index speed up a query in SAS - sql

I have never created an Index before but I'm thinking it may help here. I have a SAS dataset of approx. 7million records. It is a listing of employee entries along with their respective timestamps. I am identifying if there are any subsequent entries by the same user on the same day and then noting the timestamp.
The data set (Entries) is 3 columns: Storage_ID, User_ID and EventTimestamp.
I'm thinking maybe an Index on Stoarge_ID and User_ID would help speed things along.
If they would help, how/where would I need to go about creating the index?
PROC SQL;
CREATE TABLE sub_ENTRIES AS
SELECT A.*,
(SELECT
MIN(B.EVENTTIMESTAMP)
FROM
ENTRIES B
WHERE
A.STORAGE_ID=B.STORAGE_ID
AND A.USER_ID=B.USER_ID
AND DATEPART(A.EVENTTIMESTAMP)=DATEPART(B.EVENTTIMESTAMP)
AND B.EVENTTIMESTAMP > A.EVENTTIMESTAMP
) AS NEXT_ACCESS FORMAT=DATETIME27.6
FROM
ENTRIES A
;

You can create a composite index (two or more columns) using SQL.
For example:
Proc SQL;
create index STORAGE_USER on ENTRIES (storage_id, user_id);
The general syntax is for a index key of n columns is:
create index <index-name>
on <table-name>
( <column-name-1>,
<column-name-2>,
…
<column-name-<n>>
)
The index is most effective / applicable when the query select or join criteria involves all the columns of the composite key. Using OPTION MSGLEVEL=I to have SAS log index usage.

Related

Improve insert performance when checking existing rows

I have this simple query that inserts rows from one table(sn_users_main) into another(sn_users_history).
To make sure sn_users_history only has unique rows it checks if the column query_time already exists and if it does then don't insert. query_time is kind of a session identifier that is the same for every row in sn_users_main.
This works fine but since sn_users_history is reaching 50k rows running this query takes more than 2 minutes to run which is too much. Is there anything I can do to improve performance and get the same result?
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE NOT EXISTS(SELECT snh.query_time
FROM sn_users_history snh
WHERE snh.query_time = snm.query_time) --Dont insert items into history table if they already exist
I think you are missing extra condition on user_id, when you are inserting into history table. You have to check combination of userid, querytime.
For your question, I think you are trying to reinvent the wheel. SQL Server is already having temporal tables, to suppor this historical data holding. Read about SQL Server Temporal Tables
If you want to still continue with this approach, I would suggest you to do in batches:
Create a configuration Table to hold the last processed querytime
CREATE TABLE HistoryConfig(HistoryConfigId int, HistoryTableName SYSNAME,
lastProcessedQueryTime DATETIME)
you can do incremental historical inserts
DECLARE #lastProcessedQueryTime DATETIME = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE query_time > #lastProcessedQueryTime
Now, you can update the configuration again
UPDATE HistoryConfig SET lastProcessedQueryTime = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
HistoryTableName = 'sn_users_history'
I would suggest you to create index on clustered index on UserId, Query_Time(if possible, Otherwise create non-clustered index) which will improve the performance.
Other approaches you can think of:
Create clustered index on userId, querytime in the historical table and also have userid,querytime as clustered index on the main table and perform MERGE operation.

Optimize SAS proc sql

I am joining 2 tables and creating a mini Cartesian join between them so that all businesses within a city and state are matched up, then I am using some fuzzy logic to try and match business name and street name. There are ~3 million records on the input table and ~25 million records on the output table, so it is taking an extremely long time to run. I have created indexes on all the columns being joined and all columns being used in the where statement.
My next thought was to replace the city/state names with integers but I'd be adding processing time to create those tables. Does anyone have any other thoughts on decreasing the processing time.
proc sql;
create index output_stname on tbl._output (output_stname);
create index output_namevar on tbl._output (output_namevar);
create index key on tbl._output (key);
create index city on tbl._output (city);
create index state on tbl._output (state);
create index input_stname on tbl._input (input_stname);
create index input_namevar on tbl._input (input_namevar);
create index key_input on tbl._input (key_input);
create index city_input on tbl._input (city_input);
create index state_input on tbl._input (state_input);
;
quit;
proc sql;
create table tbl._level2 as
select distinct
key_input,
name_input,
address_input,
city_input,
state_input,
zip_input,
key,
business_nm1,
address,
city,
state,
zip,
'2 - Street Name & Business Name Match' as matchtype
from tbl._input a
left join tbl._output b on a.city_input=b.city and a.state_input=b.state
where
compged(a.input_stname,b.output_stname) <= 50 and
compged(input_namevar,output_namevar) <= 50
and case
when length(strip(a.input_namevar)) <= 2 then 1
when length(strip(b.output_namevar)) <= 2 then 1
else 0
end = 0
;
quit;
I would start with a composite index on the output table:
proc sql;
create index output_stname on tbl._output (state, city, output_stname, output_namevar);
This should speed the joins. However, the select distinct is still suspicious. It is generally better to not have to use select distinct.
I would suggest not processing this with SQL. The SQL optimizer can't really optimize this very well due to the COMPGED and the CASE statements, as it doesn't really know how often those are going to be true; and the COMPGED is very expensive. As such you're going to get a very slow process in any event.
Most likely, a hash solution is best. It's hard to say without looking at the data (how many city/state pairs are there, for example - are there a huge number of unique ones, or a relatively small number?). But a hash solution will likely be faster, particularly as it avoids the index creation step, assuming you can fit the output table into the hash (or, alternately, fit the input table into the hash) in memory.

Create a unique index on a non-unique column

Not sure if this is possible in PostgreSQL 9.3+, but I'd like to create a unique index on a non-unique column. For a table like:
CREATE TABLE data (
id SERIAL
, day DATE
, val NUMERIC
);
CREATE INDEX data_day_val_idx ON data (day, val);
I'd like to be able to [quickly] query only the distinct days. I know I can use data_day_val_idx to help perform the distinct search, but it seems this adds extra overhead if the number of distinct values is substantially less than the number of rows in the index covers. In my case, about 1 in 30 days is distinct.
Is my only option to create a relational table to only track the unique entries? Thinking:
CREATE TABLE days (
day DATE PRIMARY KEY
);
And update this with a trigger every time we insert into data.
An index can only index actual rows, not aggregated rows. So, yes, as far as the desired index goes, creating a table with unique values like you mentioned is your only option. Enforce referential integrity with a foreign key constraint from data.day to days.day. This might also be best for performance, depending on the complete situation.
However, since this is about performance, there is an alternative solution: you can use a recursive CTE to emulate a loose index scan:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT day FROM data ORDER BY 1 LIMIT 1
)
UNION ALL
SELECT (SELECT day FROM data WHERE day > c.day ORDER BY 1 LIMIT 1)
FROM cte c
WHERE c.day IS NOT NULL -- exit condition
)
SELECT day FROM cte;
Parentheses around the first SELECT are required because of the attached ORDER BY and LIMIT clauses. See:
Combining 3 SELECT statements to output 1 table
This only needs a plain index on day.
There are various variants, depending on your actual queries:
Optimize GROUP BY query to retrieve latest row per user
Unused index in range of dates query
Select first row in each GROUP BY group?
More in my answer to your follow-up querstion:
Counting distinct rows using recursive cte over non-distinct index

Selecting the most optimal query

I have table in Oracle database which is called my_table for example. It is type of log table. It has an incremental column which is named "id" and "registration_number" which is unique for registered users. Now I want to get latest changes for registered users so I wrote queries below to accomplish this task:
First version:
SELECT t.*
FROM my_table t
WHERE t.id =
(SELECT MAX(id) FROM my_table t_m WHERE t_m.registration_number = t.registration_number
);
Second version:
SELECT t.*
FROM my_table t
INNER JOIN
( SELECT MAX(id) m_id FROM my_table GROUP BY registration_number
) t_m
ON t.id = t_m.m_id;
My first question is which of above queries is recommended and why? And second one is if sometimes there is about 70.000 insert to this table but mostly the number of inserted rows is changing between 0 and 2000 is it reasonable to add index to this table?
An analytical query might be the fastest way to get the latest change for each registered user:
SELECT registration_number, id
FROM (
SELECT
registration_number,
id,
ROW_NUMBER() OVER (PARTITION BY registration_number ORDER BY id DESC) AS IDRankByUser
FROM my_table
)
WHERE IDRankByUser = 1
As for indexes, I'm assuming you already have an index by registration_number. An additional index on id will help the query, but maybe not by much and maybe not enough to justify the index. I say that because if you're inserting 70K rows at one time the additional index will slow down the INSERT. You'll have to experiment (and check the execution plans) to figure out if the index is worth it.
In order to check for faster query, you should check the execution plan and cost and it will give you a fair idea. But i agree with solution of Ed Gibbs as analytics make query run much faster.
If you feel this table is going to grow very big then i would suggest partitioning the table and using local indexes. They will definitely help you to form faster queries.
In cases where you want to insert lots of rows then indexes slow down insertion as with each insertion index also has to be updated[I will not recommend index on ID]. There are 2 solutions i have think of for this:
You can drop index before insertion and then recreate it after insertion.
Use reverse key indexes. Check this link : http://oracletoday.blogspot.in/2006/09/there-is-option-to-create-index.html. Reverse key index can impact your query a bit so there will be trade off.
If you look for faster solution and there is a really need to maintain list of last activity for each user, then most robust solution is to maintain separate table with unique registration_number values and rowid of last record created in log table.
E.g. (only for demo, not checked for syntax validity, sequences and triggers omitted):
create table my_log(id number not null, registration_number number, action_id varchar2(100))
/
create table last_user_action(refgistration_number number not null, last_action rowid)
/
alter table last_user_action
add constraint pk_last_user_action primary key (registration_number) using index
/
create or replace procedure write_log(p_reg_num number, p_action_id varchar2)
is
v_row_id rowid;
begin
insert into my_log(registration_number, action_id)
values(p_reg_num, p_action_id)
returning rowid into v_row_id;
update last_user_action
set last_action = v_row_id
where registration_number = p_reg_num;
end;
/
With such schema you can simple query last actions for every user with good performance:
select
from
last_user_action lua,
my_log l
where
l.rowid (+) = lua.last_action
Rowid is physical storage identity directly addressing storage block and you can't use it after moving to another server, restoring from backups etc. But if you need such functionality it's simple to add id column from my_log table to last_user_action too, and use one or another depending on requirements.

SQL Server Index Usage with an Order By

I have a table named Workflow. There are 38M rows in the table. There is a PK on the following columns:
ID: Identity Int
ReadTime: dateTime
If I perform the following query, the PK is not used. The query plan shows an index scan being performed on one of the nonclustered indexes plus a sort. It takes a very long time with 38M rows.
Select TOP 100 ID From Workflow
Where ID > 1000
Order By ID
However, if I perform this query, a nonclustered index (on LastModifiedTime) is used. The query plan shows an index seek being performed. The query is very fast.
Select TOP 100 * From Workflow
Where LastModifiedTime > '6/12/2010'
Order By LastModifiedTime
So, my question is this. Why isn't the PK used in the first query, but the nonclustered index in the second query is used?
Without being able to fish around in your database, there are a few things that come to my mind.
Are you certain that the PK is (id, ReadTime) as opposed to (ReadTime, id)?
What execution plan does SELECT MAX(id) FROM WorkFlow yield?
What about if you create an index on (id, ReadTime) and then retry the test, or your query?
Since Id is an identity column, having ReadTime participate in the index is superfluous. The clustered key already points to the leaf data. I recommended you modify your indexes
CREATE TABLE Workflow
(
Id int IDENTITY,
ReadTime datetime,
-- ... other columns,
CONSTRAINT PK_WorkFlow
PRIMARY KEY CLUSTERED
(
Id
)
)
CREATE INDEX idx_LastModifiedTime
ON WorkFlow
(
LastModifiedTime
)
Also, check that statistics are up to date.
Finally, If there are 38 million rows in this table, then the optimizer may conclude that specifying criteria > 1000 on a unique column is non selective, because > 99.997% of the Ids are > 1000 (if your identity seed started at 1). In order for an index to considered helpful, the optimizer must conclude that < 5% of the records would be selected. You can use an index hint to force the issue (as already stated by Dan Andrews). What is the structure of the non-clustered index that was scanned?