How to model and query objects in relational databases? - sql

I have a complex database scheme for a dictionary. Each object (essentially a translation) is similar to this:
Entry {
keyword;
examples;
tags;
Translations;
}
with
Translation {
text;
tags;
examples;
}
and
Example {
text;
translation;
phonetic_script;
}
i.e. tags (i.e. grammar) can belong to either the keyword itself, or the translation (grammar of the foreign language), and similar examples can belong either to the translation itself (i.e. explaining the foreign word) or the the text in the entry. I ended up with such kind of relational design:
entries(id,keyword,)
tags(tag)
examples(id,text,...)
entrytags(entry_id,tag)
entryexamples(entry_id,example_id)
translations(id,belongs_to_entry,...)
translationtags(transl_id, tag)
translationexamples(transl_id,example_id)
My main task is querying this database. Say I search for "foo", my current way of handling is:
query all entries with foo, get ids A
foreach id in A
query all examples belonging to id
query all tags belonging to id
query all translations belonging to A, store their ids in B
foreach tr_id in B
query all tags belonging to tr_id
query all examples belonging to tr_id
to rebuild my objects. This looks cumbersome to me, and is slow. I do not see how i could significantly improve this by using joins, or otherwise. I have a hard time modeling these objects to relations in the database. Is this a proper design?
How can I make this more efficient to improve query time?

Each query being called in the loop takes at a minimum a certain base duration of time to execute, even for trivial queries. Many environment factors contribute to what this duration is but for now let's assume it's 10 milliseconds. If the first query matches 100 entries then there are at a minimum 301 total queries being called, each taking 10 ms, for a total of 3 seconds. The number of loop iterations varies which can contribute to a substantial variation in the performance.
Restructuring the queries with joins will create more complex queries but the total number of queries being called can be reduced down to a fixed number, 4 in the queries below. Suppose now that each query takes 50 ms to execute now that it is more complex and the total duration becomes 200 ms, a substantial decrease from 3000 ms.
The 4 queries show below should come close to achieving the desired result. There are other ways to write the queries such as using subquery or including the tables in the FROM clause but these show how to do it with JOINs. The condition entries.keyword = 'foo' is used to represent the condition in the original query to select the entries.
It is worth noting that if the foo condition on entries is very expensive to compute then other optimizations may be needed to further improve performance. In these examples the condition is a simple comparison which is quick to lookup in an index but using LIKE which may require a full table scan may not work well with these queries.
The following query selects all examples matching the original query. The condition from the original query is expressed as a WHERE clause on the entries.keyword column.
SELECT entries.id, examples.text
FROM entries
INNER JOIN entryexamples
ON (entries.id = entryexamples.entry_id)
INNER JOIN examples
ON (entryexamples.example_id = examples.id)
WHERE entries.keyword = 'foo';
This query selects tags matching the original query. Only two joins are used in this case because the entrytags.tag column is what is needed and joining with tags would only provide the same value.
SELECT entries.id, entrytags.tag
FROM entries
INNER JOIN entrytags
ON (entries.id = entrytags.entry_id)
WHERE entries.keyword = 'foo'';
This query selects the translation tags for the original query. This is similar to the previous query to select the entrytags but another layer of joins is used here for the translations.
SELECT entries.id, translationtags.tag
FROM entries
INNER JOIN translations
ON (entries.id = translations.belongs_to_entry)
INNER JOIN translationtags
ON (translations.id = translationtags.transl_id)
WHERE entries.keyword = 'foo';
The final query does the same as the first query for the examples but also includes the additional joins. It's getting to be a lot of joins but in general should perform significantly better than looping through and executing individual queries.
SELECT entries.id, examples.text
FROM entries
INNER JOIN translations
ON (entries.id = translations.belongs_to_entry)
INNER JOIN translationexamples
ON (translations.id = translationexamples.transl_id)
INNER JOIN examples
ON (translationexamples.example_id = examples.id)
WHERE entries.keyword = 'foo';

Related

How can this change be making my query slow (OR vs UNION) and can I fix it?

I've just been debugging a slow SQL query.
It's a join between 2 tables, with a WHERE clause conditioning on either a property of 1 table OR the other.
If I re-write it as a UNION then it's suddenly 2 orders of magnitude faster, even though those 2 queries produce identical outputs:
DECLARE #UserId UNIQUEIDENTIFIER = '0019813D-4379-400D-9423-56E1B98002CB'
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId) OR Bookings.MixedDealBroker in (#UserId))
--Execution time: ~4000ms
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId))
UNION
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (Bookings.MixedDealBroker in (#UserId))
--Execution time: ~70ms
This seems rather surprising to me! I would have expected the SQL compiler to be entirely capable of identifying that the 2nd form was equivalent and would have used that compilation approach if it were available.
Some context notes:
I've checked and IN (#UserId) vs = #UserId makes no difference.
Nor does JOIN vs LEFT JOIN.
Those tables each have 100,000s records, and the filter cuts it down to ~100.
In the slow version it seems to be reading every row of both tables.
So:
Does anyone have any ideas for how this comes about.
What (if anything) can I do to fix the performance without just re-writing the query as a series of UNIONs (not viable for a variety of reasons.)
=-=-=-=-=-=-=
Execution Plans:
This is a common limitation of SQL engines, not just in SQL Server, but also other database systems as well. The OR complicates the predicate enough that the execution plan selected isn't always ideal. This probably relates to the fact that only one index can be seeked into per instance of a table object at a time (for the most part), or in your specific case, your OR predicate is across two different tables, and other factors with how SQL engines are designed.
By using a UNION clause, you now have two instances of the Bookings table referenced, which can individually be seeked on separately in the most efficient way possible. That allows the SQL Engine to pick a better execution plan to serve you query.
This is pretty much just one of those things that are the way they are because that's just the way it is, and you need to remember the UNION clause workaround for future encounters of this kind of performance issue.
Also, in response to your comment:
I don't understand how the difference can affect the EP, given that the 2 different "phrasings" of the query are identical?
A new execution plan is generated every time one doesn't exist in the plan cache for a given query, essentially. The way the Engine determines if a plan for a query is already cached is based on the exact hashing of that query statement, so even an extra space character at the end of the query can result in a new plan being generated. Theoretically that plan can be different. So a different written query (despite being logically the same) can surely result in a different execution plan.
There are other reasons a plan can change on re-generation too, such as different data and statistics of that data, in the tables referenced in the query between executions. But these reasons don't really apply to your question above.
As already stated, the OR condition prevents the database engine from efficiently using the indexes in a single query. Because the OR condition spans tables, I doubt that the Tuning Advisor will come up with anything useful.
If you have a case where the query you have posted is part of a larger query, or the results are complex and you do not want to repeat code, you can wrap your initial query in a Common Table Expression (CTE) or a subquery and then feed the combined results into the remainder of your query. Sometimes just selecting one or more PKs in your initial query will be sufficient.
Something like:
SELECT <complex select list>
FROM (
SELECT Bookings.ID AS BookingsID, BookingPricings.ID AS BookingPricingsID
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId))
UNION
SELECT Bookings.ID AS BookingsID, BookingPricings.ID AS BookingPricingsID
FROM Bookings B
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (Bookings.MixedDealBroker in (#UserId))
) PRE
JOIN Bookings B ON B.ID = PRE.BookingsID
JOIN BookingPricings BP ON BP.ID = PRE.BookingPricingsID
<more joins>
WHERE <more conditions>
Having just the IDs in your initial select make the UNION more efficient. The UNION can also be changed to a yet more-efficient UNION ALL with careful use of additional conditions, such as AND Bookings.MixedDealBroker <> #UserId in the second part, to avoid overlapping results.

Abap subquery Where Cond [duplicate]

I have a requirement to pull records, that do not have history in an archive table. 2 Fields of 1 record need to be checked for in the archive.
In technical sense my requirement is a left join where right side is 'null' (a.k.a. an excluding join), which in abap openSQL is commonly implemented like this (for my scenario anyways):
Select * from xxxx //xxxx is a result for a multiple table join
where xxxx~key not in (select key from archive_table where [conditions] )
and xxxx~foreign_key not in (select key from archive_table where [conditions] )
Those 2 fields are also checked against 2 more tables, so that would mean a total of 6 subqueries.
Database engines that I have worked with previously usually had some methods to deal with such problems (such as excluding join or outer apply).
For this particular case I will be trying to use ABAP logic with 'for all entries', but I would still like to know if it is possible to use results of a sub-query to check more than than 1 field or use another form of excluding join logic on multiple fields using SQL (without involving application server).
I have tested quite a few variations of sub-queries in the life-cycle of the program I was making. NOT EXISTS with multiple field check (shortened example below) to exclude based on 2 keys works in certain cases.
Performance acceptable (processing time is about 5 seconds), although, it's noticeably slower than the same query when excluding based on 1 field.
Select * from xxxx //xxxx is a result for a multiple table inner joins and 1 left join ( 1-* relation )
where NOT EXISTS (
select key from archive_table
where key = xxxx~key OR key = XXXX-foreign_key
)
EDIT:
With changing requirements (for more filtering) a lot has changed, so I figured I would update this. The construct I marked as XXXX in my example contained a single left join ( where main to secondary table relation is 1-* ) and it appeared relatively fast.
This is where context becomes helpful for understanding the problem:
Initial requirement: pull all vendors, without financial records in 3
tables.
Additional requirements: also exclude based on alternative
payers (1-* relationship). This is what example above is based on.
More requirements: also exclude based on alternative payee (*-* relationship between payer and payee).
Many-to-many join exponentially increased the record count within the construct I labeled XXXX, which in turn produces a lot of unnecessary work. For instance: a single customer with 3 payers, and 3 payees produced 9 rows, with a total of 27 fields to check (3 per row), when in reality there are only 7 unique values.
At this point, moving left-joined tables from main query into sub-queries and splitting them gave significantly better performance.
than any smarter looking alternatives.
select * from lfa1 inner join lfb1
where
( lfa1~lifnr not in ( select lifnr from bsik where bsik~lifnr = lfa1~lifnr )
and lfa1~lifnr not in ( select wyt3~lifnr from wyt3 inner join t024e on wyt3~ekorg = t024e~ekorg and wyt3~lifnr <> wyt3~lifn2
inner join bsik on bsik~lifnr = wyt3~lifn2 where wyt3~lifnr = lfa1~lifnr and t024e~bukrs = lfb1~bukrs )
and lfa1~lifnr not in ( select lfza~lifnr from lfza inner join bsik on bsik~lifnr = lfza~empfk where lfza~lifnr = lfa1~lifnr )
)
and [3 more sets of sub queries like the 3 above, just checking different tables].
My Conclusion:
When exclusion is based on a single field, both not in/not exits work. One might be better than the other, depending on filters you use.
When exclusion is based on 2 or more fields and you don't have many-to-many join in main query, not exists ( select .. from table where id = a.id or id = b.id or... ) appears to be the best.
The moment your exclusion criteria implements a many-to-many relationship within your main query, I would recommend looking for an optimal way to implement multiple sub-queries instead (even having a sub-query for each key-table combination will perform better than a many-to-many join with 1 good sub-query, that looks good).
Anyways, any additional insight into this is welcome.
EDIT2: Although it's slightly off topic, given how my question was about sub-queries, I figured I would post an update. After over a year I had to revisit the solution I worked on to expand it. I learned that proper excluding join works. I just failed horribly at implementing it the first time.
select header~key
from headers left join items on headers~key = items~key
where items~key is null
if it is possible to use results of a sub-query to check more than
than 1 field or use another form of excluding join logic on multiple
fields
No, it is not possible to check two columns in subquery, as SAP Help clearly says:
The clauses in the subquery subquery_clauses must constitute a scalar
subquery.
Scalar is keyword here, i.e. it should return exactly one column.
Your subquery can have multi-column key, and such syntax is completely legit:
SELECT planetype, seatsmax
FROM saplane AS plane
WHERE seatsmax < #wa-seatsmax AND
seatsmax >= ALL ( SELECT seatsocc
FROM sflight
WHERE carrid = #wa-carrid AND
connid = #wa-connid )
however you say that these two fields should be checked against different tables
Those 2 fields are also checked against two more tables
so it's not the case for you. Your only choice seems to be multi-join.
P.S. FOR ALL ENTRIES does not support negation logic, you cannot just use some sort of NOT IN FOR ALL ENTRIES, it won't be that easy.

PostgreSQL: SQL JOIN with "Lookahead" Condition in ON or WHERE Clause

I am not sure how best to describe the problem I have, but it feels very much like an SQL query need for a lookahead condition such as those in a regular expression :). Pardon my verbosity as I try to find a way to express this problem.
I have a relatively complex schema of medical services, patient info, patient payment types (including insurances, workers comp info, etc), and medical sales commission earnings. Naturally there are sales reporting tools. What follows is a very oversimplified version of a query involving varying JOIN conditional cases. Most of this query is just context for the clause my question focuses on, the 2nd-to-last WHERE condition (which defines JOIN conditions):
SELECT vendor_services.*, patients.*, payments.*, vendor_products.*, products.*, vendor_product_commissions.*, vendor_commissions.*, commission_earners.*, users.*
FROM vendor_services
JOIN patients ON patients.vendor_id = vendor_services.vendor_id
JOIN payments ON payments.patient_id = patiends.id
JOIN payment_types ON payment_types.id = payments.payment_type_id
JOIN vendor_products ON vendor_products.id = vendor_services.vendor_product_id
JOIN products ON products.id = vendor_products.product_id
JOIN vendor_product_commissions ON vendor_product_commissions.vendor_product_id = vendor_products.id
JOIN vendor_commissions ON vendor_commissions.id = vendor_product_commissions.vendor_commissions.id
LEFT JOIN commission_earners ON commission_earners.id = vendor_commissions.commission_earners_id
JOIN users ON commission_earners.user_id = users.id
WHERE
vendor_services.state != 'In Progress'
AND
vendor_services.date BETWEEN :datetime_1 AND :datetime_2
AND
vendor_commissions.start_date > :datetime_1
AND
vendor_commissions.end_date < :datetime_2
AND
vendor_product_commissions.payment_type = payment_types.type
AND
payments.transaction_type = 'Paid'
GROUP BY
....
Again, this is very oversimplified: the SELECT clause is far more complex, as are the GROUP BY and ORDER clauses, performing CASE statements and aggregate calculations, etc. I have left out many other tables which represent other systems within the overall application, and focused just on the data and clauses that are relevant. My question is in regards to a needed change to this particular WHERE condition regarding the following JOIN:
WHERE ... AND vendor_product_commissions.payment_type = payment_types.type
There has been an introduction of a new possible vendor_product_commissions.payment_type value that is not a member of the payment_types.type values. With the SQL query exactly as is, it no longer selects rows in most cases, as much of the LEFT table will be using this new value. When adding an OR clause, then duplicate rows are selected when only one row should be selected:
WHERE ... AND vendor_product_commissions.payment_type = payment_types.type OR vendor_product_commissions.payment_type = 'DEFAULTVALUE'
What I need is to JOIN only on the row where vendor_product_commissions.payment_type = payment_types.type, unless that produces NULL, in which case I need to perform the JOIN on vendor_product_commissions.payment_type = 'DEFAULTVALUE'.
I can do this programatically with the ORM code surrounding this query, but that is very inefficient for a very large reporting system (essentially, query first for the specific type, then if none returned, query again for the "default" type).
I dont believe this feature exists in PostgreSQL, but thats why I am describing it as a "lookahead JOIN" problem - I need to have a sort of CASE statement that if the first JOIN condition produces a NULL relation, then perform the subsequent JOIN (OR) condition to match against this newly introduced value ('DEFAULTVALUE'). Can this be done in raw SQL? Or do I need to break this whole query apart and perform the selection of services and related data, and then programatically / iteratively relate it (via ORM/application language code) to the sales commission data? I have a strong hunch that the query can be modified to do this, but without being knowledgeable of a particular label or term for this problem, I am having a hard time searching for a possible SQL-based solution.
This is for a Ruby on Rails 4 application, using ActiveRecord, though the SQL JOIN statements are all in plaintext / strings since AR doesnt provide methods for LEFT JOIN (again, there are more and more types of JOIN statements than those listed above). I am not sure if Rails is relevant to my question, but I figured I would mention it.

Optimize a query that is using multiple left joins on the same tables

I've come across a query that is taking "too long". The query has 50+ left joins between 10 or so tables. To give a brief overview of the database model, the tables joined are tables that store data for a particular data type (ex: date_fields, integer_fields, text_fields, etc.) and each has a column for the value, a "datafield" id, and a ticket id. The query is built programmatically based on an association table between a "ticket" and its "data fields".
The join statements look something like the following:
...FROM tickets t
LEFT JOIN ticket_text_fields t001 ON(t.id=t001.ticket_id AND t001.textfield_id=7)
...
LEFT JOIN ticket_date_fields t056 ON(t.id=t056.ticket_id AND t056.datafield_id=434)
When using explain on the query shows the following:
1 SIMPLE t ref idx_dataset_id idx_dataset_id 5 const 2871 Using where; Using temporary; Using filesort
1 SIMPLE t001 ref idx_ticket_id,idx_datafield_id idx_ticket_id 5 t.id 5
...
1 SIMPLE t056 ref idx_ticket_id,idx_datafield_id idx_ticket_id 5 t.id 8
What direction can I take to tune this query? All the indexes seem to be in place. Perhaps the t table (tickets) row number (2871) should be reduced. How many left joins is too much? Should the datafield tables be joined only once and then queried each for the data that is required?
You're using a variation of the terrible antipattern called Entity-Attribute-Value. You're storing attributes on separate rows, so if you want to reconstruct something that looks like a conventional row of data, you need to make one join per attribute.
It's not surprising this creates a query with 50 joins. This is far too many for most databases to run efficiently (you haven't identified which database you're using). Eventually you'll want a few more attributes and you might exceed some architectural limit of the database on the number of joins it can do.
The solution is: don't reconstruct the row in SQL.
Instead, query the attributes as multiple rows, instead of trying to combine them onto a single row.
SELECT ... FROM tickets t
INNER JOIN ticket_text_fields f ON t.id=f.ticket_id
WHERE f.textfield_id IN (7, 8, 9, ...)
UNION ALL
SELECT ... FROM tickets t
INNER JOIN ticket_date_fields d ON t.id=d.ticket_id
WHERE d.datafield_id IN (434, 435, 436, ...)
Then you have to write a function in your application to loop over the resulting rowset, and collect the attributes one by one into an object in application space, so then you can use it as if it's a single entity.
for the clearer query i would use something like this:
SELECT ... FROM tickets as t
JOIN ticket_text_fields as txt ON t.id = txt.ticket_id
JOIN ticket_date_fields as dt ON t.id = dt.ticket_id
WHERE txt.textfield_id IN (...)
AND dt.datefield_id IN (...)
Joins would be probably LEFT, but it depends on the structure of your data.
There is no union in the query and there are only two joins

Sub-query Optimization Talk with an example case

I need advises and want to share my experience about Query Optimization. This week, I found myself stuck in an interesting dilemma.
I'm a novice person in mySql (2 years theory, less than one practical)
Environment :
I have a table that contains articles with a column 'type', and another table article_version that contain a date where an article is added in the DB, and a third table that contains all the article types along with types label and stuffs...
The 2 first tables are huge (800000+ fields and growing daily), the 3rd one is naturally small sized. The article tables have a lot of column, but we will only need 'ID' and 'type' in articles and 'dateAdded' in article_version to simplify things...
What I want to do :
A Query that, for a specified 'dateAdded', returns the number of articles for each types (there is ~ 50 types to scan).
What was already in place is 50 separate count, one for each document types oO ( not efficient, long(~ 5sec in general), ).
I wanted to do it all in one query and I came up with that :
SELECT type,
(SELECT COUNT(DISTINCT articles.ID)
FROM articles
INNER JOIN article_version
ON article_version.ARTI_ID = legi_arti.ID
WHERE type = td.NEW_ID
AND dateAdded = '2009-01-01 00:00:00') AS nbrArti
FROM type_document td
WHERE td.NEW_ID != ''
GROUP BY td.NEW_ID;
The external select (type_document) allow me to get the 55 types of documents I need.
The sub-Query is counting the articles for each type_document for the given date '2009-01-01'.
A common result is like :
* type * nbrArti *
*************************
* 123456 * 23 *
* 789456 * 5 *
* 16578 * 98 *
* .... * .... *
* .... * .... *
*************************
This query get the job done, but the join in the sub-query is making this extremely slow, The reason, if I'm right, is that a join is made by the server for each types, so 50+ times, this solution is even more slower than doing the 50 queries independently for each types, awesome :/
A Solution
I came up with a solution myself that drastically improve the performance with the same result, I just created a view corresponding to the subQuery, making the join on ids for each types... And Boom, it's f.a.s.t.
I think, correct me if I'm wrong, that the reason is the server only runs the JOIN statement once.
This solution is ~5 time faster than the solution that was already there, and ~20 times faster than my first attempt. Sweet
Questions / thoughts
With yet another view, I'll now need to check if I don't loose more than win when documents get inserted...
Is there a way to improve the original Query, by getting the JOIN statement out of the sub-query? (And getting rid of the view)
Any other tips/thoughts? (In Server Optimizing for example...)
Apologies for my approximating English, it'is not my primary language.
You cannot create a single index on (type, date_added), because these fields are in different tables.
Without the view, the subquery most probably selects article as a leading table and the index on type which is not very selective.
By creating the view, you force the subquery to calculate the sums for all types first (using a selective the index on date) and then use a JOIN BUFFER (which is fast enough for only 55 types).
You can achieve similar results by rewriting your query as this:
SELECT new_id, COALESCE(cnt, 0) AS cnt
FROM type_document td
LEFT JOIN
(
SELECT type, COUNT(DISTINCT article_id) AS cnt
FROM article_versions av
JOIN articles a
ON a.id = av.article_id
WHERE av.date = '2009-01-01 00:00:00'
GROUP BY
type
) q
ON q.type = td.new_id
Unfortunately, MySQL is not able to do table spools or hash joins, so to improve the performance you'll need to denormalize your tables: add type to article_version and create a composite index on (date, type).