I am wondering which is the best way to query data from a large database.
Say I have a requirement to get all a list of all users who live in the United States, along with their orders and the products belonging to their orders. For simplicity, we have a user table, which has a CountryId in that table... and then an Order table, with a userId.. then maybe an OrderProduct table to list many products to an order (and many orders can contain the same product).
My question is, would it be better to maybe create a temp table by
SELECT userId FROM dbo.User WHERE countrId = #CountryId
We now have the relevant users in a temp table.
Then, do a
Select p.ProductDescription ...
From #TempTable tmp
INNER JOIN Order o
ON o.UserId = tmp.UserId
INNER JOIN OrderProduct op
ON op.OrderID = o.OrderId
INNER JOIN Product p
ON p.ProductId = op.ProductId
So, what I am doing is getting the users I need.... and moving that into a temp table, then using that temp table to filter the data for the main query.
Or, is it as efficient, if not more, to just do it all in one...
Select ... from User u
INNER JOIN Order o
....
WHERE u.UserId = #UserId
?
In general you want to write your entire request in one query, because that gives the database query optimizer the best possible chance of coming up with the most efficient possibility. With a decent database, it will generally do a wonderful job at this, and any effort on your part to help it along is more likely to hurt than help.
If the database is insufficiently fast, first look into things like whether you have the right indexes, tuning your database, etc. Those are the most common causes of problems and should clear up most remaining problems quite promptly.
Only after you have given the database every chance to get the right answer in the right way, should you consider trying to use temp tables to force a particular query plan. (There are other reasons to use temp tables. But for getting good query plans, it should be a last resort.)
There is an old pair of rules about optimization that applies here in spades.
Don't.
(For experts only.) Not yet.
You could create a view that holds the data you need.
A view is created like this:
CREATE VIEW view_name AS
SELECT column_name(s)
FROM table_name
WHERE condition
You can then query the view you created just as you would query a table.
Related
Consider a products table, where some info like TITLE and BRAND are kept in a huge "translations" table in multiple languages.
tblTranslations
ROW_ID
COL_NAME
LANG
VALUE
Since which languages will be used is dynamic, keeping titles for desired languages in the main table like TITLE_EN, TITLE_FR is not an option. So we have complicated queries like below:
SELECT
P.ID,
(...)
T1.VALUE AS TITLE,
T2.VALUE AS BRAND
FROM
tblProducts P
-- first join for TITLEs in the selected language
LEFT JOIN
tblTranslations T1 ON T1.ROW_ID = P.ID
AND T1.COL_NAME = 'TITLE'
AND T1.LANG = '{$selectedLang}'
-- second join for BRANDs
LEFT JOIN
tblTranslations T2 ON T2.ROW_ID = P.ID
AND T2.COL_NAME = 'BRAND'
AND T2.LANG = '{$selectedLang}'
WHERE (...)
This was an overly simplified example, our real life queries have many other tables joined for dynamic attributes etc., and this makes our websites start crawling on the ground.
My question: is it a better approach to dynamically create a table and use it for SELECTs only? This table will be updated when the main data is updated, and would be re-created when a new language is added.
tblProductsDynamic
ID
TITLE_EN
TITLE_FR
BRAND_EN
BRAND_FR
(...)
SELECT
ID,
TITLE_{$selectedLang} AS TITLE,
BRAND_{$selectedLang} AS BRAND
FROM
tblProductsDynamic
Will this horizontally expanded table give more performance since it lacks all the tiresome joins?
Obviously, pre-computing values is going to be a performance win when you query the data. You have to balance that against the cost:
Maintaining the unpivoted table is going to be somewhat expensive.
It is going to be especially expensive when a new language is added.
You may have data integrity problems, caused by lags in the construction of the summary table.
Maintaining the triggers, stored procedures or whatever for the summary table complicates the code base.
That said, your left joins should not be particularly expensive with the right indexes. I would first want to investigate a solution using the base tables you have described. Only then would I think about options for summarizing the data.
I'm trying to optimize a query. Basically, there are 3 parts to a transaction that can be repeated. I log all communications, but want to get the "freshest" of the 3 parts. The 3 parts are all linked through a single intermediate table (unfortunately) which is what is slowing this whole thing down (too much normalization?).
There is the center of the "star" "Transactions", then the center spokes (all represened by "TransactionDetails", which refer to the hub using "Transactions" primary key, then the outer spokes (PPGDetails, TicketDetails and CompletionDetails), all of which refer to "TransactionDetails" buy it's primary key.
Each of "PPGDetails", "TicketDetails" and "CompletionDetails" will have exactly one row in "TransactionDetails" that they link to, by primary key. There can be many of each of these pairs of objects per transaction.
So, in order to get the most recent TicketDetails for a transaction, I use this view:
CREATE VIEW [dbo].[TicketTransDetails] AS
select *
from TicketDetails tkd
join (select MAX(TicketDetail_ID) as TicketDetail_ID
from TicketDetails temp1
join TransactionDetails temp2
on temp1.TransactionDetail_ID = temp2.TransactionDetail_ID
group by temp2.Transaction_ID) qq
on tkd.TicketDetail_ID = qq.TicketDetail_ID
join TransactionDetails td
on tkd.TransactionDetail_ID = td.TransactionDetail_ID
GO
The other 2 detail types have similar views.
Then, to get all of the transaction details I want, one row per transaction, I use:
select *
from Transactions t
join CompletionTransDetails cpd
on t.Transaction_ID = cpd.Transaction_ID
left outer join TicketTransDetails tkd
on t.Transaction_ID = tkd.Transaction_ID
left outer join PPGTransDetails ppd
on t.Transaction_ID = ppd.Transaction_ID
where cpd.DateAndTime between '2/1/2017' and '3/1/2017'
It is by design that I want ONLY transactions that have at least 1 "CompletionDetail", but 0 or more "PPGDetail" or "TicketDetail".
This query returns the correct results, but takes 40 seconds to execute, on decent server hardware, and a "Merge Join (Left Outer Join)" immediately before the "SELECT" returns takes 100% of the execution plan time.
If I take out the join to either PPGTransDetails or TicketTransDetails in the final query, it brings the execution time down to ~20 seconds, so a marked improvement, but still doing a Merge Join over a significant number of records (many extraneous, I assume).
When just a single transaction is selected (via where clause), the query only takes about 4 seconds, and the query, then, has a final step of "Nested Loops" which also takes a large portion of the time (96%). I would like this query to take less than a second.
Since the views don't have a primary key, I assume that is causing the Merge Join to proceed. That said, I am having trouble creating a query that emulates this functionality - much less one that is more efficient.
Can anyone help me recognize what I may be missing?
Thanks!
--mobrien118
Edit: Adding more info -
Here is the effective data model:
Essentially, for a single transaction, there can be MANY PPGDetails, TicketDetails and CompletionDetails, but each one will have it's own TransactionDetails (they are one-to-one, but not enforced in the model, just in software).
There are currently:
1,619,307 "Transactions"
3,564518 "TransactionDetails"
512,644 "PPGDetails"
1,471,826 "TicketDetails"
1,580,043 "CompletionDetails"
There are currently no foreign key constraints or indexes set up on these items.
First a quick remark:
which also takes a large portion of the time (96%).
This is a bit of a (common) misconception. The 96% there is an estimate on how much resources said 'block' will need. It by no means indicates that 96% of the time inside the query was spent on it. I've had situations where stuff that took over half of the query time-wise were attributed virtually no cost.
Additionally, you seem to be assuming that when you query/join to the view that the system will first prepare the data from the view and then later on will use that result to further 'work out the query'. This is not the case, the system will 'expand' the view and do a 'combined' query, taking everything into account.
For us to understand what's going on you'll need to provide us with the query plan (.sqlplan if you use SqlSentry Plan Explorer), it's that or a full explanation on the table layout, indexes, foreign keys, etc... and a bit of explanation on the data (total rows, expected matches between tables, etc...)
PS: even though everybody seems to be touting 'hash joins' as the solution to everything, nested loops and merge joins often are more efficient.
(trying to understand your queries, is this view equivalent to your view?)
[edit: incorrect view removed to avoid confusion]
Second try: (think I have it right this time)
CREATE VIEW [dbo].[TicketTransDetails] AS
SELECT td.Transaction_ID, tkd.*
FROM TicketDetails tkd
JOIN TransactionDetails td
ON td.TransactionDetail_ID = tkd.TransactionDetail_ID
JOIN (SELECT MAX(TicketDetail_ID) as max_TicketDetail_ID, temp2.Transaction_ID
FROM TicketDetails temp1
JOIN TransactionDetails temp2
ON temp1.TransactionDetail_ID = temp2.TransactionDetail_ID
GROUP BY temp2.Transaction_ID) qq
ON qq.max_TicketDetail_ID = tkd.TicketDetail_ID
AND qq.TransactionDetail_ID = td.Transaction_ID
It might not be any faster when querying the entire table, but it should be when fetching specific records from the Transactions table.
Indexing-wise you probably want a unique index on TicketDetails (TransactionDetail_ID, TicketDetail_ID)
You'll need similar constructs for the other tables off course.
Thinking it through a bit further I think this would work too:
CREATE VIEW [dbo].[TicketTransDetails]
AS
SELECT *
FROM (
SELECT td.Transaction_ID,
TicketDetail_ID_rownr = ROW_NUMBER() OVER (PARTITION BY td.Transacion_ID ORDER BY tkd.TicketDetail_ID DESC),
tkd.*
FROM TicketDetails tkd
JOIN TransactionDetails td
ON td.TransactionDetail_ID = tkd.TransactionDetail_ID
) xx
WHERE TicketDetail_ID_rownr = 1 -- we want the "first one from the end" only
It looks quite a bit more readable but I'm not sure it would be faster or not... you'll have to compare timings and query plans.
I have two tables. VEHICLES and OWNERSHIP. I am trying to make a query that will give me a list of all VEHICLES NOT in the OWNERSHIP table. I basically need a report of my available VEHICLE inventory. I tried this query:
SELECT VEHICLE.*
FROM VEHICLE, OWNERSHIP
WHERE (VEHICLE.VEH_ID <> OWNERSHIP.VEH_ID);
Im getting:
When I do an equal I get all vehicles which are listed in the ownership so that works. But the NOT Equal does not. Any ideas?
Try
SELECT VEHICLE.*
FROM VEHICLE
WHERE NOT EXISTS
(SELECT NULL FROM OWNERSHIP WHERE VEHICLE.VEH_ID= OWNERSHIP.VEH_ID);
The NOT EXISTS approach can be slow if your tables contain many rows. An alternative approach which can be much faster is to use a LEFT JOIN with a WHERE clause to return only the rows where the right-hand join field is Null.
SELECT VEHICLE.*
FROM
VEHICLE AS v
LEFT JOIN OWNERSHIP AS o
ON v.VEH_ID = o.VEH_ID
WHERE o.VEH_ID Is Null;
You could use Access' "Find Unmatched Query Wizard" to create a similar query.
If both tables are small you probably won't notice a difference. But it should be easy to check whether the difference is noticeable. And this approach will serve you better if your tables grow substantially over time.
We have a table value function that returns a list of people you may access, and we have a relation between a search and a person called search result.
What we want to do is that wan't to select all people from the search and present them.
The query looks like this
SELECT qm.PersonID, p.FullName
FROM QueryMembership qm
INNER JOIN dbo.GetPersonAccess(1) ON GetPersonAccess.PersonID = qm.PersonID
INNER JOIN Person p ON p.PersonID = qm.PersonID
WHERE qm.QueryID = 1234
There are only 25 rows with QueryID=1234 but there are almost 5 million rows total in the QueryMembership table. The person table has about 40K people in it.
QueryID is not a PK, but it is an index. The query plan tells me 97% of the total cost is spent doing "Key Lookup" witht the seek predicate.
QueryMembershipID = Scalar Operator (QueryMembership.QueryMembershipID as QM.QueryMembershipID)
Why is the PK in there when it's not used in the query at all? and why is it taking so long time?
The number of people total 25, with the index, this should be a table scan for all the QueryMembership rows that have QueryID=1234 and then a JOIN on the 25 people that exists in the table value function. Which btw only have to be evaluated once and completes in less than 1 second.
if you want to avoid "key lookup", use covered index
create index ix_QueryMembership_NameHere on QueryMembership (QueryID)
include (PersonID);
add more column names, that you gonna select in include arguments.
for the point that, why PK's "key lookup" working so slow, try DBCC FREEPROCCACHE, ALTER INDEX ALL ON QueryMembership REBUILD, ALTER INDEX ALL ON QueryMembership REORGANIZE
This may help if your PK's index is disabled, or cache keeps wrong plan.
You should define indexes on the tables you query. In particular on columns referenced in the WHERE and ORDER BY clauses.
Use the Database Tuning Advisor to see what SQL Server recommends.
For specifics, of course you would need to post your query and table design.
But I have to make a couple of points here:
You've already jumped to the conclusion that the slowness is a result of the ORDER BY clause. I doubt it. The real test is whether or not removing the ORDER BY speeds up the query, which you haven't done. Dollars to donuts, it won't make a difference.
You only get the "log n" in your big-O claim when the optimizer actually chooses to use the index you defined. That may not be happening because your index may not be selective enough. The thing that makes your temp table solution faster than the optimizer's solution is that you know something about the subset of data being returned that the optimizer does not (specifically, that it is a really small subset of data). If your indexes are not selective enough for your query, the optimizer can't always reasonably assume this, and it will choose a plan that avoids what it thinks could be a worst-case scenario of tons of index lookups, followed by tons of seeks and then a big sort. Oftentimes, it chooses to scan and hash instead. So what you did with the temp table is often a way to solve this problem. Often you can narrow down your indexes or create an indexed view on the subset of data you want to work against. It all depends on the specifics of your wuery.
You need indexes on your WHERE and ORDER BY clauses. I am not an expert but I would bet it is doing a table scan for each row. Since your speed issue is resolved by Removing the INNER JOIN or the ORDER BY I bet the issue is specifically with the join. I bet it is doing the table scan on your joined table because of the sort. By putting an index on the columns in your WHERE clause first you will be able to see if that is in fact the case.
Have you tried restructuring the query into a CTE to separate the TVF call? So, something like:
With QueryMembershipPerson
(
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
)
Select PersonId, Fullname
From QueryMembershipPerson As QMP
Join dbo.GetPersonAccess(1) As PA
On PA.PersonId = QMP.PersonId
EDIT: Btw, I'm assuming that there is an index on PersonId in both the QueryMembership and the Person table.
EDIT What about two table expressions like so:
With
QueryMembershipPerson As
(
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
)
, With PersonAccess As
(
Select PersonId
From dbo.GetPersonAccess(1)
)
Select PersonId, Fullname
From QueryMembershipPerson As QMP
Join PersonAccess As PA
On PA.PersonId = QMP.PersonId
Yet another solution would be a derived table like so:
Select ...
From (
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
) As QueryMembershipPerson
Join dbo.GetPersonAccess(1) As PA
On PA.PersonId = QueryMembershipPerson.PersonId
If pushing some of the query into a temp table and then joining on that works, I'd be surprised that you couldn't combine that concept into a CTE or a query with a derived table.
I've been trying to come up with a good design pattern for mapping data contained in relational databases to the business objects I've created but I keep hitting a wall.
Consider the following tables:
TYPE: typeid, description
USER: userid, username, usertypeid->TYPE.typeid, imageid->IMAGE.imageid
IMAGE: imageid, location, imagetypeid->TYPE.typeid
I would like to gather all the information regarding a specific user. Creating a query for this isn't too difficult.
SELECT u.*, ut.*, i.*, it.* FROM user u
INNER JOIN type ut ON ut.typeid = u.usertypeid
INNER JOIN image i ON i.imageid = u.imageid
INNER JOIN type it ON it.typeid = i.imagetypeid
WHERE u.userid = #userid
The problem is that the field names collide and then I'm forced to alias every single field which gets out of hand very quickly.
Does anyone have a decent design pattern for this kind of thing?
I've thought about retrieving multiple results from a single stored procedure and then using a dataset to iterate through each one but I'm worried that some performance issues might bite me in the butt later. For example instead of the above query something like:
SELECT u.*, t.* FROM user u
INNER JOIN type t ON t.typeid = u.usertypeid
WHERE u.userid = #userid;
SELECT i.*, t.* FROM image i
INNER JOIN type t ON t.typeid = i.imagetypeid
INNER JOIN user u ON u.imageid = i.imageid
WHERE u.userid = #userid;
Does that seem like a decent solution? Can anyone foresee any issues with this approach?
Never use the SQL * wildcard in production code. Always spell out all the columns you want to retrieve.
Then aliasing some of them doesn't seem like such a huge amount of extra work.
Re your comment asking for background and reasoning:
Sometimes you don't really need every column from all tables, and fetching them can be needlessly costly (especially for large strings and blobs). There is no SQL syntax for "all columns except the following exceptions."
You can't alias columns that you fetch using the wildcard. Once you need to alias any of the columns, you need to expand the wildcard to list all the columns explicitly.
If the table structure changes, e.g. columns are renamed, reordered, dropped, or added, then the wildcard fetches them all, by position as defined in the tables. This may seem like a convenience, but not when your application depends on columns being in the result set by a given name or in a given position. You can get mysterious bugs where your application displays columns in the wrong order (if referencing columns by position), or shows them as blank (if referencing columns by name).
However, if the SQL query names columns explicitly, you can employ the "Fail Early" principle. This helps debugging, because it leads you directly to the SQL query that needs to be edited to account for the schema change.