Equivalent of a composite index across multiple tables? - sql

I have a table structure similar the following:
create table MAIL (
ID int,
FROM varchar,
SENT_DATE date
);
create table MAIL_TO (
ID int,
MAIL_ID int,
NAME varchar
);
and I need to run the following query:
select m.ID
from MAIL m
inner join MAIL_TO t on t.MAIL_ID = m.ID
where m.SENT_DATE between '07/01/2010' and '07/30/2010'
and t.NAME = 'someone#example.com'
Is there any way to design indexes such that both of the conditions can use an index? If I put an index on MAIL.SENT_DATE and an index on MAIL_TO.NAME, the database will choose to use either one of the indexes or the other, not both. After filtering by the first condition the database always has to do a full scan of the results for the second condition.

Oracle can use both indices. You just don't have the right two indices.
Consider: if the query plan uses your index on mail.sent_date first, what does it get from mail? It gets all the mail.ids where mail.sent_date is within the range you gave in your where clause, yes?
So it goes to mail_to with a list of mail.ids and the mail.name you gave in your where clause. At this point, Oracle decides that it's better to scan the table for matching mail_to.mail_ids rather than use the index on mail_to.name.
Indices on varchars are always problematic, and Oracle really prefers full table scans. But if we give Oracle an index containing the columns it really wants to use, and depending on total table rows and statistics, we can get it to use it. This is the index:
create index mail_to_pid_name on mail_to( mail_id, name ) ;
This works where an index just on name doesn't, because Oracle's not looking just for a name, but for a mail_id and a name.
Conversely, if the cost-based analyzer determines it's cheaper to go to table mail_to first, and uses your index on mail_to.name, what doe sit get? A bunch of mail_to_.mail_ids to look up in mail. It needs to find rows with those ids and certain sent_dates, so:
create index mail_id_sentdate on mail( sent_date, id ) ;
Note that in this case I've put sent_date first in the index, and id second. (This is more an intuitive thing.)
Again, the take home point is this: in creating indices, you have to consider not just the columns in your where clause, but also the columns in your join conditions.
Update
jthg: yes, it always depends on how the data is distributed. And on how many rows are in the table: if very many, Oracle will do a table scan and hash join, if very few it will do a table scan. You might reverse the order of either of the two indices. By putting sent_date first in the second index, we eliminate most needs for an index solely on sent_date.

A materialized view would allow you to index the values, assuming the stringent materialized view criteria is met.

Which criterion is more selective? The date range or the addressee? I would guess the addressee. And if that is highly selective, do not care for the date index, just let the database do the search based on the found mail ids. But index table MAIL on the id if it is not already.
On the other hand, some modern optimizers would even make use of both indexes, scanning both tables and than build a hash value of the join columns to merge the results of both. I am not absolutely sure if and when Oracle would chose this strategy. I just realized that SQL Server tends to make hash joins rather often, compared to other engines.

In situations where the requirements aren't met for a materialized view, there are these two options:
1) You can create a cross reference table, and keep this updated with triggers.
The concepts would be the same with Oracle, but i only have SQL Server installed at the moment to run the test, see this setup:
create table MAIL (
ID INT IDENTITY(1,1),
[FROM] VARCHAR(200),
SENT_DATE DATE,
CONSTRAINT PK_MAIL PRIMARY KEY (ID)
);
create table MAIL_TO (
ID INT IDENTITY(1,1),
MAIL_ID INT,
[NAME] VARCHAR (200),
CONSTRAINT PK_MAIL_TO PRIMARY KEY (ID)
);
ALTER TABLE [dbo].[MAIL_TO] WITH CHECK ADD CONSTRAINT [FK_MAILTO_MAIL] FOREIGN KEY([MAIL_ID])
REFERENCES [dbo].[MAIL] ([ID])
GO
ALTER TABLE [dbo].[MAIL_TO] CHECK CONSTRAINT [FK_MAILTO_MAIL]
GO
CREATE TABLE CompositeIndex_MailSentDate_MailToName (
[MAIL_ID] INT,
[MAILTO_ID] INT,
SENT_DATE DATE,
MAILTO_NAME VARCHAR(200),
CONSTRAINT PK_CompositeIndex_MailSentDate_MailToName PRIMARY KEY (MAILTO_ID,MAIL_ID)
)
GO
CREATE NONCLUSTERED INDEX IX_MailSent_MailTo ON dbo.CompositeIndex_MailSentDate_MailToName (SENT_DATE,MAILTO_NAME)
CREATE NONCLUSTERED INDEX IX_MailTo_MailSent ON dbo.CompositeIndex_MailSentDate_MailToName (MAILTO_NAME,SENT_DATE)
GO
CREATE TRIGGER dbo.trg_MAILTO_Insert
ON dbo.MAIL_TO
AFTER INSERT AS
BEGIN
INSERT INTO dbo.CompositeIndex_MailSentDate_MailToName ( MAIL_ID, MAILTO_ID, SENT_DATE, MAILTO_NAME )
SELECT mailTo.MAIL_ID,mailTo.ID,m.SENT_DATE,mailTo.NAME
FROM
inserted mailTo
INNER JOIN dbo.MAIL m ON m.ID = mailTo.MAIL_ID
END
GO
CREATE TRIGGER dbo.trg_MAILTO_Delete
ON dbo.MAIL_TO
AFTER DELETE AS
BEGIN
DELETE mailToDelete
FROM
dbo.MAIL_TO mailToDelete
INNER JOIN deleted ON mailToDelete.ID = deleted.ID
END
GO
CREATE TRIGGER dbo.trg_MAILTO_Update
ON dbo.MAIL_TO
AFTER UPDATE AS
BEGIN
UPDATE compositeIndex
SET
compositeIndex.MAILTO_NAME = updates.NAME
FROM
dbo.CompositeIndex_MailSentDate_MailToName compositeIndex
INNER JOIN inserted updates ON updates.ID = compositeIndex.MAILTO_ID
END
GO
CREATE TRIGGER dbo.trg_MAIL_Update
ON dbo.MAIL
AFTER UPDATE AS
BEGIN
UPDATE compositeIndex
SET
compositeIndex.SENT_DATE = updates.SENT_DATE
FROM
dbo.CompositeIndex_MailSentDate_MailToName compositeIndex
INNER JOIN inserted updates ON updates.ID = compositeIndex.MAIL_ID
END
GO
INSERT INTO dbo.MAIL ( [FROM], SENT_DATE )
SELECT 'SenderA','2018-10-01'
UNION ALL SELECT 'SenderA','2018-10-02'
INSERT INTO dbo.MAIL_TO ( MAIL_ID, NAME )
SELECT 1,'CustomerA'
UNION ALL SELECT 1,'CustomerB'
UNION ALL SELECT 2,'CustomerC'
UNION ALL SELECT 2,'CustomerD'
UNION ALL SELECT 2,'CustomerE'
SELECT * FROM dbo.MAIL
SELECT * FROM dbo.MAIL_TO
SELECT * FROM dbo.CompositeIndex_MailSentDate_MailToName
You can then use the dbo.CompositeIndex_MailSentDate_MailToName table to JOIN to the rest of your data. This is useful in environments where your rate of inserts and updates are low, but your query needs are high. So the relative overhead of implementing the triggers is small.
This has the advantage of being updated transactionally, in real time.
2) If you don't want the performance/management overhead of a trigger, and you only need this for next day reporting, you can create a view, and a nightly process which truncates the table and selects the entire view into a materialized table.
I've used this successfully to index flattened relational data requiring joins across a dozen or so tables.. reducing report times from hours to seconds. While it's an expensive query, you can set the job to run off hours if you have periods of reduced usage.

If your queries are generally for a particular month, then you could partition the data by month.

Related

Improve insert performance when checking existing rows

I have this simple query that inserts rows from one table(sn_users_main) into another(sn_users_history).
To make sure sn_users_history only has unique rows it checks if the column query_time already exists and if it does then don't insert. query_time is kind of a session identifier that is the same for every row in sn_users_main.
This works fine but since sn_users_history is reaching 50k rows running this query takes more than 2 minutes to run which is too much. Is there anything I can do to improve performance and get the same result?
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE NOT EXISTS(SELECT snh.query_time
FROM sn_users_history snh
WHERE snh.query_time = snm.query_time) --Dont insert items into history table if they already exist
I think you are missing extra condition on user_id, when you are inserting into history table. You have to check combination of userid, querytime.
For your question, I think you are trying to reinvent the wheel. SQL Server is already having temporal tables, to suppor this historical data holding. Read about SQL Server Temporal Tables
If you want to still continue with this approach, I would suggest you to do in batches:
Create a configuration Table to hold the last processed querytime
CREATE TABLE HistoryConfig(HistoryConfigId int, HistoryTableName SYSNAME,
lastProcessedQueryTime DATETIME)
you can do incremental historical inserts
DECLARE #lastProcessedQueryTime DATETIME = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE query_time > #lastProcessedQueryTime
Now, you can update the configuration again
UPDATE HistoryConfig SET lastProcessedQueryTime = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
HistoryTableName = 'sn_users_history'
I would suggest you to create index on clustered index on UserId, Query_Time(if possible, Otherwise create non-clustered index) which will improve the performance.
Other approaches you can think of:
Create clustered index on userId, querytime in the historical table and also have userid,querytime as clustered index on the main table and perform MERGE operation.

Optimization - sql :How to show all data exists in multiple tables

I have two table. I want to find all the rows in table One that exists in table Two, and back. I had the answer, but I want it faster.
Example:
Create table One (ID INT, Value INT, location VARCHAR(10))
Create table Two (ID INT, Value INT, location VARCHAR(10))
INSERT INTO One VALUES(1,2,'Hanoi')
INSERT INTO One VALUES(2,1,'Hanoi')
INSERT INTO One VALUES(1,4,'Hanoi')
INSERT INTO One VALUES(3,5,'Hanoi')
INSERT INTO Two VALUES(1,5,'Saigon')
INSERT INTO Two VALUES(4,6,'Saigon')
INSERT INTO Two VALUES(5,7,'Saigon')
INSERT INTO Two VALUES(2,8,'Saigon')
INSERT INTO Two VALUES(2,8,'Saigon')
And answers:
SELECT * FROM One WHERE ID IN (SELECT ID FROM Two)
UNION ALL
SELECT *FROM Two WHERE ID IN (SELECT ID FROM One)
With this query, the system scan the table 4 times
enter image description here
I want the system scan the table twice (table One once, table Two once).
Am I crazy?
You can try something like:
-- CREATE TABLES
IF OBJECT_ID ( 'tempdb..#One' ) IS NOT NULL
DROP TABLE #One;
IF OBJECT_ID ( 'tempdb..#Two' ) IS NOT NULL
DROP TABLE #Two;
CREATE TABLE #One (ID INT, Value INT, location VARCHAR(10))
CREATE TABLE #Two (ID INT, Value INT, location VARCHAR(10))
-- INSERT DATA
INSERT INTO #One VALUES(1,2,'Hanoi')
INSERT INTO #One VALUES(2,1,'Hanoi')
INSERT INTO #One VALUES(1,4,'Hanoi')
INSERT INTO #One VALUES(3,5,'Hanoi')
INSERT INTO #Two VALUES(1,5,'Saigon')
INSERT INTO #Two VALUES(4,6,'Saigon')
INSERT INTO #Two VALUES(5,7,'Saigon')
INSERT INTO #Two VALUES(2,8,'Saigon')
INSERT INTO #Two VALUES(2,8,'Saigon')
-- CREATE INDEX
CREATE NONCLUSTERED INDEX IX_One ON #One (ID) INCLUDE (Value, location)
CREATE NONCLUSTERED INDEX IX_Two ON #Two (ID) INCLUDE (Value, location)
-- SELECT DATA
SELECT o.ID
,o.Value
,o.location
FROM #One o
WHERE EXISTS (SELECT 1 FROM #Two t WHERE o.ID = t.ID)
UNION ALL
SELECT t.ID
,t.Value
,t.location
FROM #Two t
WHERE EXISTS (SELECT 1 FROM #One o WHERE t.ID = o.ID)
but it depends how "big" data you have. If the data is really big (millions of rows) and you are running Enterprise version of SQL Server, you may consider using columnstore indexes.
The reason you're scanning the tables twice is because you are reading from Table X and looking up the corresponding value from table Y. And once that's finished you do the same but starting from table Y and then looking for matches in Table Y. After that, both results are combined and returned to the caller.
In a way, that's not a bad thing although if tables are 'wide' and contain a lot of columns you don't need then you are doing a lot of IO for no good reason. Additionally, in your example, the quest for a matching ID in the other table requires scanning the whole table because there is no 'logic' to the ID field. It simply is a list of values. To speed up things you should add an index on the ID field that helps the system find a particular ID value MUCH MUCH faster. Additionally, this also limits the amount of data that needs to be read for the lookup phase: the server will only read from the index which only contains the ID values (**) and not all the other, unneeded fields.
To be honest, I find your requirement a bit strange, but I'm guessing it's mostly due to simplification as to make it understandable here on SO. My first reaction was to suggest to use a JOIN between both tables, but since the ID fields are non-unique this then results in duplicates! To work around that I added a DISTINCT but then things slowed down severely. In the end, doing just the WHERE ID IN (...) turned out to be the most efficient approach.
Adding indexes on the ID field made it faster although the effect wasn't as big as I expected, probably because there are few other fields and the gain in IO is negligible (read: it all fits in memory even though I tried this on 5 million rows).
FYI: Personally I prefer the construction WHERE EXISTS() over WHERE IN (...) but they're both equivalent and actually produced the exact same query plan.
(**: aside from the indexed fields, every index also contains the clustered index -- which usually is the Primary Key of the table -- fields in its leaf data. For more information Kimberly L. Tripp has some interesting articles about indexes and how they work.)

redshift select distinct returns repeated values

I have a database where each object property is stored in a separate row. The attached query does not return distinct values in a redshift database but works as expected when testing in any mysql compatible database.
SELECT DISTINCT distinct_value
FROM
(
SELECT
uri,
( SELECT DISTINCT value_string
FROM `test_organization__app__testsegment` AS X
WHERE X.uri = parent.uri AND name = 'hasTestString' AND parent.value_string IS NOT NULL ) AS distinct_value
FROM `test_organization__app__testsegment` AS parent
WHERE
uri IN ( SELECT uri
FROM `test_organization__app__testsegment`
WHERE name = 'types' AND value_uri_multivalue = 'Document'
)
) AS T
WHERE distinct_value IS NOT NULL
ORDER BY distinct_value ASC
LIMIT 10000 OFFSET 0
This is not a bug and behavior is intentional, though not straightforward.
In Redshift, you can declare constraints on the tables but Redshift doesn't enforce them, i.e. it allows duplicate values if you insert them. The only difference here is that when you run SELECT DISTINCT query against a column that doesn't have a primary key declared it will scan the whole column and get unique values, and if you run the same on a column that has primary key constraint it will just return the output without performing unique list filtering. This is how you can get duplicate entries if you insert them.
Why is this done? Redshift is optimized for large datasets and it's much faster to copy data if you don't need to check constraint validity for every row that you copy or insert. If you want you can declare a primary key constraint as a part of your data model but you will need to explicitly support it by removing duplicates or designing ETL in a way there are no such.
More information with specific examples in this Heap blog post Redshift Pitfalls And How To Avoid Them
Perhaps You can solve this by using appropriate joins.
for example i have duplicate values in table 1 and i want values of table 1 by joining it to table 2 and there is some logic behind joining two tables according to your conditions.
so i can do something like this!!
select distinct table1.col1 from table1 left outer join table2 on table1.col1 = table2.col1
this worked for me very well and i got unique values from table1 and could remove dublicates

Selecting the most optimal query

I have table in Oracle database which is called my_table for example. It is type of log table. It has an incremental column which is named "id" and "registration_number" which is unique for registered users. Now I want to get latest changes for registered users so I wrote queries below to accomplish this task:
First version:
SELECT t.*
FROM my_table t
WHERE t.id =
(SELECT MAX(id) FROM my_table t_m WHERE t_m.registration_number = t.registration_number
);
Second version:
SELECT t.*
FROM my_table t
INNER JOIN
( SELECT MAX(id) m_id FROM my_table GROUP BY registration_number
) t_m
ON t.id = t_m.m_id;
My first question is which of above queries is recommended and why? And second one is if sometimes there is about 70.000 insert to this table but mostly the number of inserted rows is changing between 0 and 2000 is it reasonable to add index to this table?
An analytical query might be the fastest way to get the latest change for each registered user:
SELECT registration_number, id
FROM (
SELECT
registration_number,
id,
ROW_NUMBER() OVER (PARTITION BY registration_number ORDER BY id DESC) AS IDRankByUser
FROM my_table
)
WHERE IDRankByUser = 1
As for indexes, I'm assuming you already have an index by registration_number. An additional index on id will help the query, but maybe not by much and maybe not enough to justify the index. I say that because if you're inserting 70K rows at one time the additional index will slow down the INSERT. You'll have to experiment (and check the execution plans) to figure out if the index is worth it.
In order to check for faster query, you should check the execution plan and cost and it will give you a fair idea. But i agree with solution of Ed Gibbs as analytics make query run much faster.
If you feel this table is going to grow very big then i would suggest partitioning the table and using local indexes. They will definitely help you to form faster queries.
In cases where you want to insert lots of rows then indexes slow down insertion as with each insertion index also has to be updated[I will not recommend index on ID]. There are 2 solutions i have think of for this:
You can drop index before insertion and then recreate it after insertion.
Use reverse key indexes. Check this link : http://oracletoday.blogspot.in/2006/09/there-is-option-to-create-index.html. Reverse key index can impact your query a bit so there will be trade off.
If you look for faster solution and there is a really need to maintain list of last activity for each user, then most robust solution is to maintain separate table with unique registration_number values and rowid of last record created in log table.
E.g. (only for demo, not checked for syntax validity, sequences and triggers omitted):
create table my_log(id number not null, registration_number number, action_id varchar2(100))
/
create table last_user_action(refgistration_number number not null, last_action rowid)
/
alter table last_user_action
add constraint pk_last_user_action primary key (registration_number) using index
/
create or replace procedure write_log(p_reg_num number, p_action_id varchar2)
is
v_row_id rowid;
begin
insert into my_log(registration_number, action_id)
values(p_reg_num, p_action_id)
returning rowid into v_row_id;
update last_user_action
set last_action = v_row_id
where registration_number = p_reg_num;
end;
/
With such schema you can simple query last actions for every user with good performance:
select
from
last_user_action lua,
my_log l
where
l.rowid (+) = lua.last_action
Rowid is physical storage identity directly addressing storage block and you can't use it after moving to another server, restoring from backups etc. But if you need such functionality it's simple to add id column from my_log table to last_user_action too, and use one or another depending on requirements.

how to select the newly added rows in a table efficiently?

I need to periodically update a local cache with new additions to some DB table. The table rows contain an auto-increment sequential number (SN) field. The cache keeps this number too, so basically I just need to fetch all rows with SN larger than the highest I already have.
SELECT * FROM table where SN > <max_cached_SN>
However, the majority of the attempts will bring no data (I just need to make sure that I have an absolutely up-to-date local copy). So I wander if this will be more efficient:
count = SELECT count(*) from table;
if (count > <cache_size>)
// fetch new rows as above
I suppose that selecting by an indexed numeric field is quite efficient, so I wander whether using count has benefit. On the other hand, this test/update will be done quite frequently and by many clients, so there is a motivation to optimize it.
this test/update will be done quite frequently and by many clients
this could lead to unexpected race competition for cache generation
I would suggest
upon new addition to your table add the newest id into a queue table
using like crontab to trigger the cache generation by checking queue table
upon new cache generated, delete the id from queue table
as you stress majority of the attempts will bring no data, the above will only trigger where there is new addition
and the queue table concept, even can expand for update and delete
I believe that
SELECT * FROM table where SN > <max_cached_SN>
will be faster, because select count(*) may call table scan. Just for clarification, do you never delete rows from this table?
SELECT COUNT(*) may involve a scan (even a full scan), while SELECT ... WHERE SN > constant can effectively use an index by SN, and looking at very few index nodes may suffice. Don't count items if you don't need the exact total, it's expensive.
You don't need to use SELECT COUNT(*)
There is two solution.
You can use a temp table that has one field that contain last count of your table, and create new Trigger after insert on your table and inc temp table field in Trigger.
You can use a temp table that has one field that contain last SN of your table is cached and create new Trigger after insert on your table and update temp table field in Trigger.
not much to this really
drop table if exists foo;
create table foo
(
foo_id int unsigned not null auto_increment primary key
)
engine=innodb;
insert into foo values (null),(null),(null),(null),(null),(null),(null),(null),(null);
select * from foo order by foo_id desc limit 10;
insert into foo values (null),(null),(null),(null),(null),(null),(null),(null),(null);
select * from foo order by foo_id desc limit 10;