SQL range conditions less than, greater than and between - sql

What I would like to accomplish is; query if 'email ocr in' & 'universal production' rows in the "documents created column" field, total the same amount as "email OCR" 'documents_created. If not, pull that batch. finally if the attachment count is less than 7 entries after the email ocr in & universal production files are pulled then return said result
current query below:
use N
SELECT id,
type,
NAME,
log_time ,
start_time ,
documents_created ,
pages_created,
processed,
processed_time 
FROM N_LF_OCR_LOG
WHERE
-- Log time is current day
log_time between  CONVERT(date, getdate()) AND CONVERT(datetime,floor(CONVERT(float,getdate()))) + '23:59:00' 
-- Documents created is NULL or non zero
AND (documents_created IS NULL OR documents_created <> 0)
or  ( documents_created is null and log_time between  CONVERT(date, getdate()) AND CONVERT(datetime,floor(CONVERT(float,getdate()))) + '23:59:00')
-- Filter for specific types
AND type IN ('Email OCR In',
'Universal Production')
-- Filter to rows where number of pages and documents created are not equal
AND documents_created <2 and pages_created >2
ORDER BY log_time
,id asc
,processed_time asc
any idea how to incorporate that? Im a novice. thanks

When creating an index, you just specify the columns to be indexed. There is no difference in creating an index for a range query or an exact match. You can add multiple columns to the same index so all columns can benefit from the index, because only one index per table at the time can be selected to support a query.
You could create an index just covering your where-clause:
alter table N_LF_OCR_LOG add index test1(log_time, documents_created, type, pages_created);
Or also add the required columns for the ordering into the index. The ordering of the columns in the index is important and must be the same as for the ordering in the query:
alter table N_LF_OCR_LOG add index test1(log_time, id, processed_time, documents_created, type, pages_created);
Or add a covering index that also contains the returned columns so you do not have to load any values from your tables and can answer to complete query by just using the index. This gives the best response time for the query. But the index takes up more space on the disk.
alter table N_LF_OCR_LOG add index test1(log_time, id, processed_time, documents_created, type, pages_created, NAME, start_time, processed);
Use the explain keyword infront of your query to see how good your index performs.

Related

OrientDB: slow query, need help creating index to speed it up

I'm using an SQL query to retrieve money transactions from my OrientDB database (v2.1.16)
The query is running slowly and I'd like to know how to create the index that will speed it up.
The query is:
SELECT timestamp, txId
FROM MoneyTransaction
WHERE (
out("MoneyTransactionAccount").in("AccountMoneyProfile")[accountId] = :accountId
AND moneyType = :moneyType
AND :registerType IN registerQuantities.keys()
)
ORDER BY timestamp DESC, #rid DESC
I also have another variant that resumes the list from a specific point in time:
SELECT timestamp, txId
FROM MoneyTransaction
WHERE (
out("MoneyTransactionAccount").in("AccountMoneyProfile")[accountId] = :accountId
AND moneyType = :moneyType
AND :registerType IN registerQuantities.keys()
)
AND timestamp <= :cutoffTimestamp
AND txId NOT IN :cutoffTxIds
ORDER BY timestamp DESC, #rid DESC
The difficulty I have is trying to figure out how to create an index with the more complex fields, namely the accountId field which doesn't reside within the same vertex, and the registerType field which is to be found within an EMBEDDEDMAP field.
Which index would you create to speed up this query? Or how would you rewrite this query?
My structure is as follows:
[Account] --> (1 to 1) AccountMoneyProfile --> [MoneyProfile]
[MoneyTransaction] --> (n to 1) MoneyTransactionAccount --> [MoneyProfile]
Important fields:
Account.accountId STRING
MoneyTransaction.registerQuantities EMBEDDEDMAP
MoneyTransaction.timestamp DATETIME
The account I'm fetching right now has about 500 MoneyTransaction vertices attached to it.
about the index choice, it depends by the amounts of your dataset:
If the dataset isn't very large, you could use an SB-TREE index because they maintain sorting and allow range operations;
If the dataset instead is very large, you could use an HASH INDEX which is more functional on large numbers and consumes less resources than other indexes, but it doesn't support range operations.
In your case you could create, for example, an SB-TREE UNIQUE INDEX on the accountId (e.g. Account.accountId) and rewrite your query in a way that the target query directly matches the index and so that it reads fewer records as possible. Example:
SELECT timestamp, txId
FROM (
SELECT expand(out("AccountMoneyProfile").in("MoneyTransactionAccount"))
FROM Account
WHERE accountId = :accountId
)
WHERE moneyType = :moneyType AND :registerType IN registerQuantities.keys()
ORDER BY timestamp DESC, #rid DESC
In this way you directly select the Account records you're looking for (by using the index previously created) and then you can retrieve only the connected MoneyTransaction records.
You can find more detailed information about indexes in the OrientDB official documentation.
Another way, based on the fact that you specified that MoneyProfile class doesn't contains important data (if I've understood well), could be to change the structure to make the search more direct. E.g.:
Before:
After (I've previously created a new AccountMoneyTransaction edge class):
Hope to have been helpful

Create index for lower case of a to z characters

I have a database column which allow us to store lower case a to z characters and 'Space'. How we can create an index with more specific expressions.
We are need of this specific index to improve the 'Order by' clause performance issue.
The performance problem here is, when we do 'Order by' for large number of table column it creates problem. If the order by column is date or integer then it is faster but not for varchar type column.
We want to make the query faster by add specific index to the varchar column or make another decision
First, I suggest to take a look to this page: http://www.postgresql.org/docs/9.4/static/indexes-expressional.html
Second, is the order by on the field, or on the expression? I mean, if your order by is:
ORDER BY col1
You just need to index like:
CREATE INDEX idx_table_col1 ON yourtable (col1);
If your ORDER BY is:
ORDER BY lower(col1)
Then:
CREATE INDEX idx_lower_col1_table ON yourtable (lower(col1));
Anyway, to improve your question, I suggest to:
Show your query
Show execution plan retrived with EXPLAIN ANALYZE
Show your table and indexes

Create a unique index on a non-unique column

Not sure if this is possible in PostgreSQL 9.3+, but I'd like to create a unique index on a non-unique column. For a table like:
CREATE TABLE data (
id SERIAL
, day DATE
, val NUMERIC
);
CREATE INDEX data_day_val_idx ON data (day, val);
I'd like to be able to [quickly] query only the distinct days. I know I can use data_day_val_idx to help perform the distinct search, but it seems this adds extra overhead if the number of distinct values is substantially less than the number of rows in the index covers. In my case, about 1 in 30 days is distinct.
Is my only option to create a relational table to only track the unique entries? Thinking:
CREATE TABLE days (
day DATE PRIMARY KEY
);
And update this with a trigger every time we insert into data.
An index can only index actual rows, not aggregated rows. So, yes, as far as the desired index goes, creating a table with unique values like you mentioned is your only option. Enforce referential integrity with a foreign key constraint from data.day to days.day. This might also be best for performance, depending on the complete situation.
However, since this is about performance, there is an alternative solution: you can use a recursive CTE to emulate a loose index scan:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT day FROM data ORDER BY 1 LIMIT 1
)
UNION ALL
SELECT (SELECT day FROM data WHERE day > c.day ORDER BY 1 LIMIT 1)
FROM cte c
WHERE c.day IS NOT NULL -- exit condition
)
SELECT day FROM cte;
Parentheses around the first SELECT are required because of the attached ORDER BY and LIMIT clauses. See:
Combining 3 SELECT statements to output 1 table
This only needs a plain index on day.
There are various variants, depending on your actual queries:
Optimize GROUP BY query to retrieve latest row per user
Unused index in range of dates query
Select first row in each GROUP BY group?
More in my answer to your follow-up querstion:
Counting distinct rows using recursive cte over non-distinct index

Creating index on timestamp column for query which uses year function

I have a HISTORY table with 9 million records. I need to find year-wise, month-wise records created. I was using query no 1, However it timed out several times.
SELECT
year(created) as year,
MONTHNAME(created) as month,
count(*) as ymcount
FROM
HISTORY
GROUP BY
year(created), MONTHNAME(created);
I decided to add where year(created), this time the query took 30 mins (yes it takes so long) to execute.
SELECT
year(created) as year,
MONTHNAME(created) as month,
count(*) as ymcount
FROM
HISTORY
WHERE
year(created) = 2010
GROUP BY
year(created), MONTHNAME(created) ;
I was planning to add an index on created timestamp column, however before doing so, I need the opinion (since its going to take a long time to index such a huge table).
Will adding an index on created(timestamp) column improve performance, considering year function is used on the column?
An index won't really help because you have formed the query such that it must perform a complete table scan, index or no index. You have to form the where clause so it is in the form:
where field op constant
where field is, of course, your field; op is = <= => <> between in, etc. and constant is either a direct constant, 42, or an operation that can be executed once and the result cached, getdate().
Like this:
where created >= DateFromParts( #year, 1, 1 )
and created < DateFromParts( #year + 1, 1, 1 )
The DateFromParts function will generate a value which remains in effect for the duration of the query. If created is indexed, now the optimizer will be able to seek to exactly where the correct dates start and tell when the last date in the range has been processed and it can stop. You can keep year(created) everywhere else -- just get rid of it from the where clause.
This is called sargability and you can google all kinds of good information on it.
P.S. This is in Sql Server format but you should be able to calculate "beginning of specified year" and "beginning of year after specified year" in whatever DBMS you're using.
An index will be used, when it helps narrow down the number of rows read.
It will also be used, when it avoids reading the table at all. This is the case, when the index contains all the columns referenced in the query.
In your case the only column referenced is created, so adding an index on this column should help reducing the necessary reads and improve the overall runtime of your query. However, if created is the only column in the table, the index won't change anything in the first query, because it doesn't reduce the number of pages to be read.
Even with a large table, you can test, if an index makes a difference. You can copy only part of the rows to a new table and compare the execution plans on the new table with and without an index, e.g.
insert into testhistory
select *
from history
fetch first 100000 rows only
You want what's known as a Calendar Table (the particular example uses SQL Server, but the solution should be adaptable). Then, you want lots of indices on it (since writes are few, and this is a primary dimension table for analysis).
Assuming you have a minimum Calendar Table that looks like this:
CREATE TABLE Calendar (isoDate DATE,
dayOfMonth INTEGER,
month INTEGER,
year INTEGER);
... with an index over [dayOfMonth, month, year, isoDate], your query can be re-written like this:
SELECT Calendar.year, Calendar.month,
COUNT(*) AS ymCount
FROM Calendar
JOIN History
ON History.created >= Calendar.isoDate
AND History.created < Calendar.isoDate + 1 MONTH
WHERE Calendar.dayOfMonth = 1
GROUP BY Calendar.year, Calendar.month
The WHERE Calendar.dayOfMonth = 1 is automatically limiting results to 12-per-year. The start of the range is trivially located with the index (given the SARGable data), and the end of the range as well (yes, doing math on a column generally disqualifies indices... on the side the math is used. If the optimizer is at all smart it's going to going to gen a virtual intermediate table containing the start/end of range).
So, index-based (and likely index-only) access for the query. Learn to love indexed dimension tables, that can be used for range queries (Calendar Tables being one of the most useful).
I'll assume you are using SQL Server based on your tags.
Yes, the index will make your query faster.
I recommend only using the 'created' column as a key for the index and to not include any additional columns from the History table because they will be unused and only result in more reads than what is necessary.
And of course, be mindful when you create indexes on tables that have a lot of INSERT, UPDATE, DELETE activity as your new index will make these actions more expensive when being performed on the table.
As been stated before, in your case, an index won't be used because the index is created on the column 'created' and you are querying on 'year(created)'.
What you can do is add two generated columns year_gen = year(create) and month_gen = MONTHNAME(created) to your table and index these two columns. The DB2 Query Optimizer will automatically use these two generated columns and it will also use the indices created on these columns.
The code should be something like (but not 100% sure since I have no DB2 to test)
SET INTEGRITY FOR HISTORY OFF CASCADE DEFERRED #
ALTER TABLE HISTORY ADD COLUMN YEAR_GEN SMALLINT GENERATED ALWAYS AS (YEAR(CREATE)),
ADD COLUMN MONTH_GEN VARCHAR(20) GENERATED ALWAYS AS (YEAR(CREATE)) #
SET INTEGRITY FOR HISTORY IMMEDIATE CHECKED FORCE GENERATED #
CREATE INDEX HISTORY_YEAR_IDX ON HISTORY YEAR_GEN ASC CLUSTER #
CREATE INDEX HISTORY_MONTH_IDX ON HISTORY YEAR_GEN ASC #
Just a sidenote: the set integrity off is mandatory to add generated columns. Your table is inaccessible untill you reset the integrity to checked and you force the re-calculation of the generated columns (this might take a while in your case).
Setting integrity off without cascade deferred will set every table with a foreign key to the HISTORY table to OFF too. You will have to manually reset the integrity of these tables too. If I remember correctly, using cascade deferred in combination with incomming foreign keys may cause DB2 to set the integrity of your table to 'checked by user'.

Mysql improve SELECT speed

I'm currently trying to improve the speed of SELECTS for a MySQL table and would appreciate any suggestions on ways to improve it.
We have over 300 million records in the table and the table has the structure tag, date, value. The primary key is a combined key of tag and date. The table contains information for about 600 unique tags most containing an average of about 400,000 rows but can range from 2000 to over 11 million rows.
The queries run against the table are:
SELECT date,
value
FROM table
WHERE tag = "a"
AND date BETWEEN 'x' and 'y'
ORDER BY date
....and there are very few if any INSERTS.
I have tried partitioning the data by tag into various number of partitions but this seems to have little increase in speed.
take time to read my answer here: (has similar volumes to yours)
500 millions rows, 15 million row range scan in 0.02 seconds.
MySQL and NoSQL: Help me to choose the right one
then amend your table engine to innodb as follows:
create table tag_date_value
(
tag_id smallint unsigned not null, -- i prefer ints to chars
tag_date datetime not null, -- can we make this date vs datetime ?
value int unsigned not null default 0, -- or whatever datatype you require
primary key (tag_id, tag_date) -- clustered composite PK
)
engine=innodb;
you might consider the following as the primary key instead:
primary key (tag_id, tag_date, value) -- added value save some I/O
but only if value isnt some LARGE varchar type !
query as before:
select
tag_date,
value
from
tag_date_value
where
tag_id = 1 and
tag_date between 'x' and 'y'
order by
tag_date;
hope this helps :)
EDIT
oh forgot to mention - dont use alter table to change engine type from mysiam to innodb but rather dump the data out into csv files and re-import into a newly created and empty innodb table.
note i'm ordering the data during the export process - clustered indexes are the KEY !
Export
select * into outfile 'tag_dat_value_001.dat'
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
from
tag_date_value
where
tag_id between 1 and 50
order by
tag_id, tag_date;
select * into outfile 'tag_dat_value_002.dat'
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
from
tag_date_value
where
tag_id between 51 and 100
order by
tag_id, tag_date;
-- etc...
Import
import back into the table in correct order !
start transaction;
load data infile 'tag_dat_value_001.dat'
into table tag_date_value
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
(
tag_id,
tag_date,
value
);
commit;
-- etc...
What is the cardinality of the date field (that is, how many different values appear in that field)? If the date BETWEEN 'x' AND 'y' is more limiting than the tag = 'a' part of the WHERE clause, try making your primary key (date, tag) instead of (tag, date), allowing date to be used as an indexed value.
Also, be careful how you specify 'x' and 'y' in your WHERE clause. There are some circumstances in which MySQL will cast each date field to match the non-date implied type of the values you compare to.
I would do two things - first throw some indexes on there around tag and date as suggested above:
alter table table add index (tag, date);
Next break your query into a main query and sub-select in which you are narrowing your results down when you get into your main query:
SELECT date, value
FROM table
WHERE date BETWEEN 'x' and 'y'
AND tag IN ( SELECT tag FROM table WHERE tag = 'a' )
ORDER BY date
Your query is asking for a few things - and with that high # of rows, the look of the data can change what the best approach is.
SELECT date, value
FROM table
WHERE tag = "a"
AND date BETWEEN 'x' and 'y'
ORDER BY date
There are a few things that can slow down this select query.
A very large result set that has to be sorted (order by).
A very large result set. If tag and date are in the index (and let's assume that's as good as it gets) every result row will have to leave the index to lookup the value field. Think of this like needing the first sentence of each chapter of a book. If you only needed to know the chapter names, easy - you can get it from the table of contents, but since you need the first sentence you have to go to the actual chapter. In certain cases, the optimizer may choose just to flip through the entire book (table scan in query plan lingo) to get those first sentences.
Filtering by the wrong where clause first. If the index is in the order tag, date... then tag should (for a majority of your queries) be the more stringent of the two columns. So basically, unless you have more tags than dates (or maybe than dates in a typical date range), then dates should be the first of the two columns in your index.
A couple of recommendations:
Consider if it's possible to truncate some of that data if it's too old to care about most of the time.
Try playing with your current index - i.e. change the order of the items in it.
Do away with your current index and replace it with a covering index (has all 3 fields in it)
Run some EXPLAIN's and make sure it's using your index at all.
Switch to some other data store (mongo db?) or otherwise ensure this monster table is kept as much in memory as possible.
I'd say your only chance to further improve it is a covering index with all three columns (tag, data, value). That avoids the table access.
I don't think that partitioning can help with that.
I would guess that adding an index on (tag, date) would help:
alter table table add index (tag, date);
Please post the result of an explain on this query (EXPLAIN SELECT date, value FROM ......)
I think that the value column is at the bottom of your performance issues. It is not part of the index so we will have table access. Further I think that the ORDER BY is unlikely to impact the performance so severely since it is part of your index and should be ordered.
I will argument my suspicions for the value column by the fact that the partitioning does not really reduce the execution time of the query. May you execute the query without value and further give us some results as well as the EXPLAIN? Do you really need it for each row and what kind of column is it?
Cheers!
Try inserting just the needed dates into a temporary table and the finishing with a select on the temporary table for the tags and ordering.
CREATE temporary table foo
SELECT date, value
FROM table
WHERE date BETWEEN 'x' and 'y' ;
ALTER TABLE foo ADD INDEX index( tag );
SELECT date, value
FROM foo
WHERE tag = "a"
ORDER BY date;
if that doesn't work try creating foo off the tag selection instead.
CREATE temporary table foo
SELECT date, value
FROM table
WHERE tag = "a";
ALTER TABLE foo ADD INDEX index( date );
SELECT date, value
FROM foo
WHERE date BETWEEN 'x' and 'y'
ORDER BY date;