OrientDB: slow query, need help creating index to speed it up - sql

I'm using an SQL query to retrieve money transactions from my OrientDB database (v2.1.16)
The query is running slowly and I'd like to know how to create the index that will speed it up.
The query is:
SELECT timestamp, txId
FROM MoneyTransaction
WHERE (
out("MoneyTransactionAccount").in("AccountMoneyProfile")[accountId] = :accountId
AND moneyType = :moneyType
AND :registerType IN registerQuantities.keys()
)
ORDER BY timestamp DESC, #rid DESC
I also have another variant that resumes the list from a specific point in time:
SELECT timestamp, txId
FROM MoneyTransaction
WHERE (
out("MoneyTransactionAccount").in("AccountMoneyProfile")[accountId] = :accountId
AND moneyType = :moneyType
AND :registerType IN registerQuantities.keys()
)
AND timestamp <= :cutoffTimestamp
AND txId NOT IN :cutoffTxIds
ORDER BY timestamp DESC, #rid DESC
The difficulty I have is trying to figure out how to create an index with the more complex fields, namely the accountId field which doesn't reside within the same vertex, and the registerType field which is to be found within an EMBEDDEDMAP field.
Which index would you create to speed up this query? Or how would you rewrite this query?
My structure is as follows:
[Account] --> (1 to 1) AccountMoneyProfile --> [MoneyProfile]
[MoneyTransaction] --> (n to 1) MoneyTransactionAccount --> [MoneyProfile]
Important fields:
Account.accountId STRING
MoneyTransaction.registerQuantities EMBEDDEDMAP
MoneyTransaction.timestamp DATETIME
The account I'm fetching right now has about 500 MoneyTransaction vertices attached to it.

about the index choice, it depends by the amounts of your dataset:
If the dataset isn't very large, you could use an SB-TREE index because they maintain sorting and allow range operations;
If the dataset instead is very large, you could use an HASH INDEX which is more functional on large numbers and consumes less resources than other indexes, but it doesn't support range operations.
In your case you could create, for example, an SB-TREE UNIQUE INDEX on the accountId (e.g. Account.accountId) and rewrite your query in a way that the target query directly matches the index and so that it reads fewer records as possible. Example:
SELECT timestamp, txId
FROM (
SELECT expand(out("AccountMoneyProfile").in("MoneyTransactionAccount"))
FROM Account
WHERE accountId = :accountId
)
WHERE moneyType = :moneyType AND :registerType IN registerQuantities.keys()
ORDER BY timestamp DESC, #rid DESC
In this way you directly select the Account records you're looking for (by using the index previously created) and then you can retrieve only the connected MoneyTransaction records.
You can find more detailed information about indexes in the OrientDB official documentation.
Another way, based on the fact that you specified that MoneyProfile class doesn't contains important data (if I've understood well), could be to change the structure to make the search more direct. E.g.:
Before:
After (I've previously created a new AccountMoneyTransaction edge class):
Hope to have been helpful

Related

SQL range conditions less than, greater than and between

What I would like to accomplish is; query if 'email ocr in' & 'universal production' rows in the "documents created column" field, total the same amount as "email OCR" 'documents_created. If not, pull that batch. finally if the attachment count is less than 7 entries after the email ocr in & universal production files are pulled then return said result
current query below:
use N
SELECT id,
type,
NAME,
log_time ,
start_time ,
documents_created ,
pages_created,
processed,
processed_time 
FROM N_LF_OCR_LOG
WHERE
-- Log time is current day
log_time between  CONVERT(date, getdate()) AND CONVERT(datetime,floor(CONVERT(float,getdate()))) + '23:59:00' 
-- Documents created is NULL or non zero
AND (documents_created IS NULL OR documents_created <> 0)
or  ( documents_created is null and log_time between  CONVERT(date, getdate()) AND CONVERT(datetime,floor(CONVERT(float,getdate()))) + '23:59:00')
-- Filter for specific types
AND type IN ('Email OCR In',
'Universal Production')
-- Filter to rows where number of pages and documents created are not equal
AND documents_created <2 and pages_created >2
ORDER BY log_time
,id asc
,processed_time asc
any idea how to incorporate that? Im a novice. thanks
When creating an index, you just specify the columns to be indexed. There is no difference in creating an index for a range query or an exact match. You can add multiple columns to the same index so all columns can benefit from the index, because only one index per table at the time can be selected to support a query.
You could create an index just covering your where-clause:
alter table N_LF_OCR_LOG add index test1(log_time, documents_created, type, pages_created);
Or also add the required columns for the ordering into the index. The ordering of the columns in the index is important and must be the same as for the ordering in the query:
alter table N_LF_OCR_LOG add index test1(log_time, id, processed_time, documents_created, type, pages_created);
Or add a covering index that also contains the returned columns so you do not have to load any values from your tables and can answer to complete query by just using the index. This gives the best response time for the query. But the index takes up more space on the disk.
alter table N_LF_OCR_LOG add index test1(log_time, id, processed_time, documents_created, type, pages_created, NAME, start_time, processed);
Use the explain keyword infront of your query to see how good your index performs.

SQL pagination based on last record retrieved

I need to implement pagination which is semi-resilient to data changing between paginations. The standard pagination relies on SQL's LIMIT and OFFSET, however offset has potential to become inaccurate as new data points are created or their ranking shifts in the sort.
One idea is to hold onto the last data point requested from the API and get the following elements. I don't really know SQL (we're using postgres), but this is my (certainly flawed) attempt at doing something like that. I am trying to store the position of the last element as 'rownum' and then use it in the following query.
WITH rownum AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY rank ASC, id) AS rownum
WHERE id = #{after_id}
FROM items )
SELECT * FROM items
OFFSET rownum
ORDER BY rank ASC, id
LIMIT #{pagination_limit}
I can see some issues with this, like if the last item changes significantly in rank. If anyone can think of another way to do this, that would be great. But I would like to confine it to a single DB query if possible since this is the applications most frequently hit API.
Your whole syntax doesn't quite work. OFFSET comes after ORDER BY. FROM comes before WHERE etc.
This simpler query would do what I think your code is supposed to do:
SELECT *
FROM items
WHERE (rank, id) > (
SELECT (rank, id)
FROM items
WHERE id = #{after_id}
)
ORDER BY rank, id
LIMIT #{pagination_limit};
Comparing the composite type (rank, id) guarantees identical sort order.
Make sure you have two indexes:
A multicolumn index on (rank, id).
Another one on just (id) - you probably have a pk constraint on the column doing that already. (A multicolumn index with leading id would do the job as well.)
More about indexes:
Is a composite index also good for queries on the first field?
If rank is not volatile it would be more efficient to parameterize it additionally instead of retrieving it dynamically - but the volatility of rank seems to be the point of your deliberations ...
I now think the best way to solve this problem is by storing the datetime of the original query and filtering out results after that moment on subsequent queries, thus ensuring the offset is mostly correct. Maybe a persistent database could be used to ensure that the data is at the same state it was when the original query was made.

Creating index on timestamp column for query which uses year function

I have a HISTORY table with 9 million records. I need to find year-wise, month-wise records created. I was using query no 1, However it timed out several times.
SELECT
year(created) as year,
MONTHNAME(created) as month,
count(*) as ymcount
FROM
HISTORY
GROUP BY
year(created), MONTHNAME(created);
I decided to add where year(created), this time the query took 30 mins (yes it takes so long) to execute.
SELECT
year(created) as year,
MONTHNAME(created) as month,
count(*) as ymcount
FROM
HISTORY
WHERE
year(created) = 2010
GROUP BY
year(created), MONTHNAME(created) ;
I was planning to add an index on created timestamp column, however before doing so, I need the opinion (since its going to take a long time to index such a huge table).
Will adding an index on created(timestamp) column improve performance, considering year function is used on the column?
An index won't really help because you have formed the query such that it must perform a complete table scan, index or no index. You have to form the where clause so it is in the form:
where field op constant
where field is, of course, your field; op is = <= => <> between in, etc. and constant is either a direct constant, 42, or an operation that can be executed once and the result cached, getdate().
Like this:
where created >= DateFromParts( #year, 1, 1 )
and created < DateFromParts( #year + 1, 1, 1 )
The DateFromParts function will generate a value which remains in effect for the duration of the query. If created is indexed, now the optimizer will be able to seek to exactly where the correct dates start and tell when the last date in the range has been processed and it can stop. You can keep year(created) everywhere else -- just get rid of it from the where clause.
This is called sargability and you can google all kinds of good information on it.
P.S. This is in Sql Server format but you should be able to calculate "beginning of specified year" and "beginning of year after specified year" in whatever DBMS you're using.
An index will be used, when it helps narrow down the number of rows read.
It will also be used, when it avoids reading the table at all. This is the case, when the index contains all the columns referenced in the query.
In your case the only column referenced is created, so adding an index on this column should help reducing the necessary reads and improve the overall runtime of your query. However, if created is the only column in the table, the index won't change anything in the first query, because it doesn't reduce the number of pages to be read.
Even with a large table, you can test, if an index makes a difference. You can copy only part of the rows to a new table and compare the execution plans on the new table with and without an index, e.g.
insert into testhistory
select *
from history
fetch first 100000 rows only
You want what's known as a Calendar Table (the particular example uses SQL Server, but the solution should be adaptable). Then, you want lots of indices on it (since writes are few, and this is a primary dimension table for analysis).
Assuming you have a minimum Calendar Table that looks like this:
CREATE TABLE Calendar (isoDate DATE,
dayOfMonth INTEGER,
month INTEGER,
year INTEGER);
... with an index over [dayOfMonth, month, year, isoDate], your query can be re-written like this:
SELECT Calendar.year, Calendar.month,
COUNT(*) AS ymCount
FROM Calendar
JOIN History
ON History.created >= Calendar.isoDate
AND History.created < Calendar.isoDate + 1 MONTH
WHERE Calendar.dayOfMonth = 1
GROUP BY Calendar.year, Calendar.month
The WHERE Calendar.dayOfMonth = 1 is automatically limiting results to 12-per-year. The start of the range is trivially located with the index (given the SARGable data), and the end of the range as well (yes, doing math on a column generally disqualifies indices... on the side the math is used. If the optimizer is at all smart it's going to going to gen a virtual intermediate table containing the start/end of range).
So, index-based (and likely index-only) access for the query. Learn to love indexed dimension tables, that can be used for range queries (Calendar Tables being one of the most useful).
I'll assume you are using SQL Server based on your tags.
Yes, the index will make your query faster.
I recommend only using the 'created' column as a key for the index and to not include any additional columns from the History table because they will be unused and only result in more reads than what is necessary.
And of course, be mindful when you create indexes on tables that have a lot of INSERT, UPDATE, DELETE activity as your new index will make these actions more expensive when being performed on the table.
As been stated before, in your case, an index won't be used because the index is created on the column 'created' and you are querying on 'year(created)'.
What you can do is add two generated columns year_gen = year(create) and month_gen = MONTHNAME(created) to your table and index these two columns. The DB2 Query Optimizer will automatically use these two generated columns and it will also use the indices created on these columns.
The code should be something like (but not 100% sure since I have no DB2 to test)
SET INTEGRITY FOR HISTORY OFF CASCADE DEFERRED #
ALTER TABLE HISTORY ADD COLUMN YEAR_GEN SMALLINT GENERATED ALWAYS AS (YEAR(CREATE)),
ADD COLUMN MONTH_GEN VARCHAR(20) GENERATED ALWAYS AS (YEAR(CREATE)) #
SET INTEGRITY FOR HISTORY IMMEDIATE CHECKED FORCE GENERATED #
CREATE INDEX HISTORY_YEAR_IDX ON HISTORY YEAR_GEN ASC CLUSTER #
CREATE INDEX HISTORY_MONTH_IDX ON HISTORY YEAR_GEN ASC #
Just a sidenote: the set integrity off is mandatory to add generated columns. Your table is inaccessible untill you reset the integrity to checked and you force the re-calculation of the generated columns (this might take a while in your case).
Setting integrity off without cascade deferred will set every table with a foreign key to the HISTORY table to OFF too. You will have to manually reset the integrity of these tables too. If I remember correctly, using cascade deferred in combination with incomming foreign keys may cause DB2 to set the integrity of your table to 'checked by user'.

SQL Server slow select from large table

I have a table with about 20+ million records.
Structure is like:
EventId UNIQUEIDENTIFIER
SourceUserId UNIQUEIDENTIFIER
DestinationUserId UNIQUEIDENTIFIER
CreatedAt DATETIME
TypeId INT
MetaId INT
Table is receiving about 100k+ records each day.
I have indexes on each column except MetaId, as it is not used in 'where' clauses
The problem is when i want to pick up eg. latest 100 records for desired SourceUserId
Query sometimes takes up to 4 minutes to execute, which is not acceptable.
Eg.
SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND
(
TypeId IN (2, 3, 4)
OR
(TypeId = 60 AND SrcMemberId != DstMemberId)
)
ORDER BY CreatedAt DESC
I can't do partitioning etc as I am using Standard version of SQL Server and Enterprise is too expensive.
I also think that the table is quite small to be that slow.
I think the problem is with ORDER BY clause as db must go through much bigger set of data.
Any ideas how to make it quicker ?
Perhaps relational database is not a good idea for that kind of data.
Data is always being picked up ordered by CreatedAt DESC
Thank you for reading.
PabloX
You'll likely want to create a composite index for this type of query - when the query runs slowly it is most likely choosing to scan down an index on the CreatedAt column and perform a residual filter on the SourceUserId value, when in reality what you want to happen is to jump directly to all records for a given SourceUserId ordered properly - to achieve this, you'll want to create a composite index primarily on SourceUserId (performing an equality check) and secondarily on CreateAt (to preserve the order within a given SourceUserId value). You may want to try adding the TypeId in as well, depending on the selectivity of this column.
So, the 2 that will most likely give the best repeatable performance (try them out and compare) would be:
Index on (SourceUserId, CreatedAt)
Index on (SourceUserId, TypeId, CreatedAt)
As always, there are also many other considerations to take into account with determining how/what/where to index, as Remus discusses in a separate answer one big consideration is covering the query vs. keeping lookups. Additionally you'll need to consider write volumes, possible fragmentation impact (if any), singleton lookups vs. large sequential scans, etc., etc.
I have indexes on each column except
MetaId
Non-covering indexes will likely hit the 'tipping point' and the query would revert to a table scan. Just adding an index on every column because it is used in a where clause does not equate good index design. To take your query for example, a good 100% covering index would be:
INDEX ON (SourceUserId , CreatedAt) INCLUDE (TypeId, SrcMemberId, DstMemberId)
Following index is also usefull, altough it still going to cause lookups:
INDEX ON (SourceUserId , CreatedAt) INCLUDE (TypeId)
and finaly an index w/o any included column may help, but is just as likely will be ignored (depends on the column statistics and cardinality estimates):
INDEX ON (SourceUserId , CreatedAt)
But a separate index on SourceUSerId and one on CreatedAt is basically useless for your query.
See Index Design Basics.
The fact that the table has indexes built on GUID values, indicates a possible series of problems that would affect performance:
High index fragmentation: since new GUIDs are generated randomly, the index cannot organize them in a sequential order and the nodes are spread unevenly.
High number of page splits: the size of a GUID (16 bytes) causes many page splits in the index, since there's a greater chance than a new value wont't fit in the remaining space available in a page.
Slow value comparison: comparing two GUIDs is a relatively slow operation because all 33 characters must be matched.
Here a couple of resources on how to investigate and resolve these problems:
How to Detect Index Fragmentation in SQL Server 2000 and 2005
Reorganizing and Rebuilding Indexes
How Using GUIDs in SQL Server Affect Index Performance
I would recomend getting the data in 2 sep var tables
INSERT INTO #Table1
SELECT * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND
(
TypeId IN (2, 3, 4)
)
INSERT INTO #Table2
SELECT * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND
(
(TypeId = 60 AND SrcMemberId != DstMemberId)
)
then apply a unoin from the selects, ordered and top. Limit the data from the get go.
I suggest using a UNION:
SELECT TOP 100 x.*
FROM (SELECT a.*
FROM EVENTS a
WHERE a.typeid IN (2, 3, 4)
UNION ALL
SELECT b.*
FROM EVENTS b
WHERE b.typeid = 60
AND b.srcmemberid != b.dstmemberid) x
WHERE x.sourceuserid = '15b534b17-5a5a-415a-9fc0-7565199c3461'
We've realised a minor gain by moving to a BIGINT IDENTITY key for our event table; by using that as a clustered primary key, we can cheat and use that for date ordering.
I would make sure CreatedAt is indexed properly
you could split the query in two with an UNION to avoid the OR (which can cause your index not to be used), something like
SElect * FROM(
SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND TypeId IN (2, 3, 4)
UNION SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND TypeId = 60 AND SrcMemberId != DstMemberId
)
ORDER BY CreatedAt DESC
Also, check that the uniqueidentifier indexes are not CLUSTERED.
If there are 100K records added each day, you should check your index fragmentation.
And rebuild or reorganize it accordingly.
More info :
SQLauthority

How to find the *position* of a single record in a limited, arbitrarily ordered record set?

MySQL
Suppose you want to retrieve just a single record by some id, but you want to know what its position would have been if you'd encountered it in a large ordered set.
Case in point is a photo gallery. You land on a single photo, but the system must know what its offset is in the entire gallery.
I suppose I could use custom indexing fields to keep track of positions, but there must be a more graceful way in SQL alone.
So, first you create a virtual table with the position # ordered by whatever your ORDER BY is, then you select the highest one from that set. That's the position in the greater result set. You can run into problems if you don't order by a unique value/set of values...
If you create an index on (photo_gallery_id, date_created_on) it may do an index scan (depending on the distribution of photos), which ought to be faster than a table scan (provided your gallery_id isn't 90% of the photos or whatnot).
SELECT #row := 0;
SELECT MAX( position )
FROM ( SELECT #row := #row + 1 AS position
FROM photos
WHERE photo_gallery_id = 43
AND date_created_on <= 'the-date-time-your-photo-was'
ORDER BY date_created_on ) positions;
Not really. I think Oracle gives you a "ROWID" or something like that, but most don't give you one. A custom ordering, like a column in your database that tells you want position the entry in the gallery is good because you can never be sure that SQL will put things in the table in the order you think they should be in.
As you are not specific about what database you're using, in SQL Server 2005 you could use
SELECT
ROW_NUMBER() OVER (ORDER BY PhotoID)
, PhotoID
FROM dbo.Photos
You don't say what DBMS you are using, and the "solution" will vary accordingly. In Oracle you could do this (but I would urge you not to!):
select photo, offset
from
( select photo
, row_number() over (partition by gallery_id, order by photo_seq) as offset
from photos
)
where id = 123
That query will select all photos (full table scan) and then pick out the one you asked for - not a performant query!
I would suggest if you really need this information it should be stored.
Assuming the position is determined solely by the id, would it not be as simple as counting all records with a smaller id value?:
select
po.[id]
...
((select count(pi.[id]) from photos pi where pi.[id] < po.[id]) + 1) as index
...
from photos po
...
I'm not sure what the performance implications of such a query would be, but I would think returning a lot of records could be a problem.
You must understand the difference between a "application key" and a "technical key".
The technical key exists for the sole purpose to make an item unique. It's usually in INTEGER or BIGINT, generated (identity, whatever). This key is used to locate objects in the database, quickly figure out of an object has already been persisted (IDs must be > 0, so an object with the default ID == 0 is not in the DB, yet), etc.
The application key is something which you need to make sense of an object within the context of your application. In this case, it's the ordering of the photos in the gallery. This has no meaning whatsoever for the database.
Think ordered list: This is the default in most languages. You have a set of items, accessed by an index. For a database, this index is an application key since sets in the database are unordered (or rather the database doesn't guarantee any ordering unless you specify ORDER BY). For the very same reason, paging through results from a query is such a pain: Databases really don't like the idea of "position".
So what you must do is add an index row (i.e. an INTEGER which says at which position in the gallery your image is; not a database index for quicker access, even though you should create an index on this column ...) and maintain that. For every insertion, you must UPDATE index = index + 1 where index >= insertion_point, etc.
Yes, it sucks. The only solution I know of: Use an ORM framework which solves this for you.
There's no need for an extra table, why not just count the records instead?
You know the order in which they are displayed (which can vary), but you know it.
You also know the ID of the current record; let's say it's ordered on date:
The offset of the record, is the total number of records counted with a date < that date.
SELECT COUNT(1) FROM ... WHERE date < "the-date"
This gives you the number you can use as the offset for the other queries...