Sqlite is not using indexes

Sqlite is not using indexes - sql

I face the following issues witn SQLite.
I have some view, which is supposed to normalize the table (convert names to numbers). For such purpose it requires many of left joins and becomes really slow without index. However, the indexes I create are not used by query, which I can see from query plan.
The view.
SELECT
STORENAME.ID AS _STORENAME_ID,
ARTICLEID.ID AS _ARTICLEID_ID,
ARTICLENAME.ID AS _ARTICLENAME_ID,
SECTION.ID AS _SECTION_ID,
....
SALESQUANTITY,
SALESAMOUNT
FROM
_SO_BUFFER_SHACOOP AS MAIN
LEFT JOIN _t_so_branch AS STORENAME ON MAIN.STORENAME = STORENAME.NAME
LEFT JOIN _t_so_itemnumber AS ARTICLEID ON MAIN.ARTICLEID = ARTICLEID.NAME
LEFT JOIN _t_so_itemname AS ARTICLENAME ON MAIN.ARTICLENAME = ARTICLENAME.NAME
LEFT JOIN _t_so_sections AS SECTION ON MAIN.SECTION = SECTION.NAME
...
WHERE
NOT (
salesamount is null or (ARTICLENAME is null and (storename is null or storename = 'TOTAL'))
)
When I am trying to explain this query plan, I can see that:
SQLite is using SCAN TABLE instead of SEARCH TABLE for _t_so_itemnumber (ARTICLEID). It is really weird since I have 2 indexes - one on MAIN table and one on _t_so_itemnumber table (check indexes below)
I can see that sqlite is not using any of my indexes, it uses automatic indexes only.
I have the following indexes (among many others):
CREATE INDEX _ix_tsoitemnumber__itemname on _t_so_itemnumber(name)
CREATE INDEX _ix_sobufffershacoop_articleid ON _SO_BUFFER_SHACOOP(ARTICLEID)
I am not sure what's wrong here - why it is ignoring those indexes which were created to cover this join?

Related

SQL INNER JOIN vs WHERE IN big performance difference

I have read multiple sources and still don't understand where a big difference comes from in a query I have for Microsoft SQL Server.
I need to count different alerts linked to vehicles (IdMateriel is synonymous to the id of the vehicle) based on types (CodeAlerte), state (Etat), and a Top true/false column, but beofre the different counts I need to select the data.
TLDR : There are two parameters, the current date as an SQL DATETIME, and the VARCHAR(MAX) string of entity codes separated by commas, which I split using STRING_SPLIT to use them either in WHERE IN clause or INNER JOIN. Using it in the first clause is ~10x faster than the second clause although it seems equivalent to me. Why?
First, the queries are based on the view created as follows:
CREATE OR ALTER VIEW [dbo].[AlertesVehicules]
WITH SCHEMABINDING
AS
SELECT dbo.Alerte.IdMateriel, dbo.Materiel.EntiteGestion, dbo.Alerte.IdTypeAlerte,
dbo.TypeAlerte.CodeAlerte, dbo.TypeAlerte.TopAlerteMajeure, dbo.Alerte.Etat,
Vehicule.Top, Vehicule.EtatVehicule,COUNT_BIG(*) AS COUNT
FROM dbo.Alerte
INNER JOIN dbo.Materiel on dbo.Alerte.IdMateriel= dbo.Materiel.Id
INNER JOIN dbo.Vehicule on dbo.Vehicule.Id= dbo.Materiel.Id
INNER JOIN dbo.TypeAlerte on dbo.Alerte.IdTypeAlerte = dbo.TypeAlerte.Id
WHERE dbo.Materiel.EntiteGestion is NOT NULL
AND dbo.TypeAlerte.CodeAlerte IN ('P07','P08','P09','P11','P12','P13','P14')
GROUP BY dbo.Alerte.IdMateriel, dbo.Materiel.EntiteGestion, dbo.Alerte.IdTypeAlerte,
dbo.TypeAlerte.CodeAlerte, dbo.TypeAlerte.TopAlerteMajeure, dbo.Alerte.Etat,
Vehicule.Top, Vehicule.EtatVehicule
GO
CREATE UNIQUE CLUSTERED INDEX IX_AlerteVehicule
ON dbo.AlertesVehicules (EntiteGestion,IdMateriel,CodeAlerte,Etat,TopAlerteMajeure);
GO
This first version of the query takes ~100ms:
SELECT DISTINCT a.IdMateriel, a.CodeAlerte, a.Etat, a.Top INTO #tmpTabAlertes
FROM dbo.AlertesVehicules a
LEFT JOIN tb_AFFECTATION_SECTION ase ON a.IdMateriel = ase.ID_Vehicule
INNER JOIN (SELECT value AS entiteGestion FROM STRING_SPLIT(#entiteGestion, ',')) eg
ON a.EntiteGestion = eg.entiteGestion
WHERE
a.CodeAlerte IN ('P08','P09')
AND #currentDate <= ISNULL(ase.DateFin, #currentDate)
According to SQL Sentry Plan Explorer, the actual execution plan starts with an index seek taking ~30% of the time on dbo.Alerte with the predicate Alerte.IdTypeAlerte=TypeAlerte.Id, outputting 369 000 rows of Etat, IdMateriel and IdTypeAlerte, which it then directly filters down to 7 742 based on predicate PROBE(Opt_Bitmapxxxx,Alerte.IdMateriel), and then inner joins taking ~25% of the time with the 2 results of another index seek of TypeAlerte but with predicates TypeAlerte.CodeAlerte = N'P08' and =N'P09'. So just these two parts take > 50ms, but I don't understand why there are so many initial results.
The second version takes ~10ms :
SELECT DISTINCT a.dMateriel, a.CodeAlerte, a.Etat, a.Top INTO #tmpTab
FROM dbo.AlertesVehicules a
LEFT JOIN tb_AFFECTATION_SECTION ase ON a.IdMateriel = ase.ID_Vehicule
WHERE
a.EntiteGestion IN (SELECT value FROM STRING_SPLIT(#entiteGestion, ','))
AND a.CodeAlerte IN ('P08','P09')
AND (#currentDate <= ISNULL(ase.DateFin, #currentDate))
For this one, SQL Sentry Plan Explorer starts with a View Clustered Index Seek directly on the view AlertesVehicules with Seek Predicates AlertesVehicule.EntiteGestion > Exprxxx1 and <Exprxxx2and Predicate AlertesVehicules.CodeAlerte=N'P08' and =N'P09'
Why are those two treated so differently when it seems to me that they are exactly equivalent?
For references, here are some threads I already looked into, but didn't seem to find an explanation in (except that there shouldn't be a difference):
SQL JOIN vs IN performance?
Performance gap between WHERE IN (1,2,3,4) vs IN (select * from STRING_SPLIT('1,2,3,4',','))

What is index do i need?

I have problems with the performance of this query. If I remove Order by section all work well. But I really want it. I tried to use many indexes but have not any results. Can you help me pls?
SELECT *
FROM "refuel_request" AS "refuel_request"
LEFT OUTER JOIN "user" AS "user" ON "refuel_request"."user_id" = "user"."user_id"
LEFT OUTER JOIN "bill_qr" AS "bill_qr" ON "refuel_request"."bill_qr_id" = "bill_qr"."bill_qr_id"
LEFT OUTER JOIN "car" AS "order.car" ON "refuel_request"."car_id" = "order.car"."car_id"
LEFT OUTER JOIN "refuel_request_status" AS "refuel_request_status" ON "refuel_request"."refuel_request_status_id" = "refuel_request_status"."refuel_request_status_id"
WHERE
refuel_request."refuel_request_status_id" IN ( '1', '2', '3')
ORDER BY "refuel_request".created_at desc
LIMIT 10
There is explain of this query
EXPLAIN (ANALYZE, BUFFERS)
Primary Keys and/or Foreign Keys
pk_refuel_request_id
refuel_request_bill_qr_id_fkey
refuel_request_user_id_fkey

All outer joind tables are 1:n related to refuel_request. This means your query is looking for the last ten created refuel requests with status 1 to 3.
You are outer joining the tables, because not every reful_request is related to a user, a bill_qr, a car, and a status. Or you outer join mistakenly. Anyway, none of the joins changes the number of retrieved rows; it's still one row per refuel request. In order to join the other tables' rows the DBMS just needs their primary key indexes. Nothing to worry about.
The only thing we must care about is finding the top reful_request rows for the statuses you are interested in as quickly as possible.
Use a partial index that only contains data for the statuses in question. The column you index is the created_at column, so as to get the top 10 immediately.
CREATE INDEX idx ON refuel_request (created_at DESC)
WHERE refuel_request_status_id IN (1, 2, 3);
Partial indexes are explained here: https://www.postgresql.org/docs/current/indexes-partial.html

You cannot have an index that supports both the WHERE condition and the ORDER BY, because you are using IN and not =.
The fastest option is to split the query into three parts, so that each part compares refuel_request.refuel_request_status_id with =. Combine these three queries with UNION ALL. Each of the queries has ORDER BY and LIMIT 10, and you wrap the whole thing in an outer query that has another ORDER BY and LIMIT 10.
Then you need these indexes:
CREATE INDEX ON refuel_request (refuel_request_status_id, created_at);
CREATE INDEX ON "user" (user_id);
CREATE INDEX ON bill_qr (bill_qr_id);
CREATE INDEX ON car (car_id);
CREATE INDEX ON refuel_request_status (refuel_request_status_id);

You need at least the indexes for the joins (do you really need LEFT joins?)
LEFT OUTER JOIN "user" AS "user" ON "refuel_request"."user_id" = "user"."user_id"
So, refuel_request.user_id must be in the index
LEFT OUTER JOIN "bill_qr" AS "bill_qr" ON "refuel_request"."bill_qr_id" =
LEFT OUTER JOIN "car" AS "order.car" ON "refuel_request"."car_id" =
bill_qr_id and car_id too
LEFT OUTER JOIN "refuel_request_status" AS "refuel_request_status" ON "refuel_request"."refuel_request_status_id" =
and refuel_request_status_id
WHERE
refuel_request."refuel_request_status_id" IN ( '1', '2', '3')
refuel_request_status_id must be the first key in the index as we need it in the WHERE
ORDER BY "refuel_request".created_at desc
and then created_at since it's in the ORDER clause. This will not improve performances per se, but will allow to run the ORDER BY without requiring access to the table data, the same reason why we put the other non-WHERE columns in there. Of course a partial index is even better, we shift the WHERE in the partiality clause and use created_at for the rest (the LIMIT 10 now means that we can do without the extra columns in the index, since retrieving three 1:N rows costs very little; in a different situation we might find it useful to keep those extra columns).
So one index that contains, in this order:
refuel_request_status_id, created_at, bill_qr_id, car_id too, user_id
^ WHERE ^ ORDER ^ used by the JOINS
However, do you really need a SELECT *? I believe you'd get better performances if you only included the fields you're really going to use.

The most effective index for this query would be on refuel_request (refuel_request_status_id, created_at DESC) so that both the main filtering and the ordering can be done using the index. You also want indexes on the columns you're joining, but those tables are small and inconsequential at the moment. In any case, the index I suggest isn't actually going to help much with the performance pain points you're having right now. Here are some suggestions:
Don't use SELECT * unless you really need all of the columns from all of these tables you're joining. Specifying only the necessary columns means postgres can load less data into memory, and work over it faster.
Postgres is spending a lot of time on the joins, joining about a million rows each time, when you're really only interested in ten of those rows. We can encourage it do the order/limit first by rearranging the query somewhat:
WITH refuel_request_subset AS MATERIALIZED (
SELECT *
FROM refuel_request
WHERE refuel_request_status_id IN ('1', '2', '3')
ORDER BY created_at DESC
LIMIT 10
)
SELECT *
FROM refuel_request_subset AS refuel_request
LEFT OUTER JOIN user ON refuel_request.user_id = user.user_id
LEFT OUTER JOIN bill_qr ON refuel_request.bill_qr_id = bill_qr.bill_qr_id
LEFT OUTER JOIN car AS "order.car" ON refuel_request.car_id = "order.car".car_id
LEFT OUTER JOIN refuel_request_status ON refuel_request.refuel_request_status_id = refuel_request_status.refuel_request_status_id;
Note: This assumes that the LEFT JOINS will not add rows to the result set, as is the case with your current dataset.
This trick only really works if you have a fixed number of IDs, but you can do the refuel_request_subset query separately for each ID and then UNION the results, as opposed to using the IN operator. That would allow postgres to fully use the index mentioned above.

JSONB Joining with the Contains Operator

I have two tables, ClaimPaymentHistory and RemittanceHistory which I am currently joining with the following query.
select rh."EventHistory"
from "ClaimPaymentHistory" ph, jsonb_array_elements(ph."EventHistory") payments
inner join "RemittanceHistory" rh
on payments->> 'rk' = rh."RemittanceRefKey"::text
where ph."ClaimRefKey" = #ClaimRefKey
I wanted to improve this query using the following index:
CREATE INDEX claim_payment_history_gin_idx ON "ClaimPaymentHistory"
USING gin ("EventHistory" jsonb_path_ops)
But I don't appear to get any improvement with this. However, I can see this index being leveraged if I query the EventHistory column of that table using the #> operator, for example like so:
select * from "ClaimPaymentHistory" where "EventHistory" #> '[{"rk": 637453920516771103}]';
So my question is, am I able to create a join using that contains operator? I've been playing with the syntax but can't get anything to work.
If I am unable to create a join with that operator, what would be my best options for indexing?

That index could be used if you wrote the query like this:
select rh."EventHistory"
from "RemittanceHistory" rh join "ClaimPaymentHistory" ph
on ph."EventHistory" #> jsonb_build_array(jsonb_build_object('rk',rh."RemittanceRefKey"))
where ph."ClaimRefKey" = 5;
However, this unlikely to have good performance unless "RemittanceHistory" has few rows in it.
...what would be my best options for indexing?
The obvious choice, if you don't have them already, would be regular (btree) indexes on rh."RemittanceRefKey" and ph."ClaimRefKey".
Also, look at (and show us) the EXPLAIN (ANALYZE, BUFFERS) for the original query you want to make faster.

I wound up refactoring the table structure. Instead of a join through RemittanceRefKey I added a JSONB column to RemittanceHistory called ClaimRefKeys. This is simply an array of integer values and now I can lookup the desired rows with:
select "EventHistory" from "RemittanceHistory" where "ClaimRefKeys" #> #ClaimRefKey;
This combined with the following index gives pretty fantastic performance.
CREATE INDEX remittance_history_claimrefkeys_gin_idx ON "RemittanceHistory" USING gin ("ClaimRefKeys" jsonb_path_ops);

SubQuery vs TempTable before Merge

I have a complex query that I want to use as the Source of a Merge into a table. This will be executed over millions of rows. Currently I am trying to apply constraints to the data by inserting it into a temp table before the merge.
The operations are:
Filter out duplicate data.
Join some tables to pull in additional data
Insert into the temp table.
Here is the query.
-- Get all Orders that aren't in the system
WITH Orders AS
(
SELECT *
FROM [Staging].Orders o
WHERE NOT EXISTS
(
SELECT 1
FROM Maps.VendorBOrders vbo
JOIN OrderFact of
ON of.Id = vbo.OrderFactId
AND InternalOrderId = o.InternalOrderId
AND of.DataSetId = o.DataSetId
AND of.IsDelete = 0
)
)
INSERT INTO #VendorBOrders
(
CustomerId
,OrderId
,OrderTypeId
,TypeCode
,LineNumber
,FromDate
,ThruDate
,LineFromDate
,LineThruDate
,PlaceOfService
,RevenueCode
,BillingProviderId
,Cost
,AdjustmentTypeCode
,PaymentDenialCode
,EffectiveDate
,IDRLoadDate
,RelatedOrderId
,DataSetId
)
SELECT
vc.CustomerId
,OrderId
,OrderTypeId
,TypeCode
,LineNumber
,FromDate
,ThruDate
,LineFromDate
,LineThruDate
,PlaceOfService
,RevenueCode
,bp.Id
,Cost
,AdjustmentTypeCode
,PaymentDenialCode
,EffectiveDate
,IDRLoadDate
,ro.Id
,o.DataSetId
FROM
Orders o
-- Join related orders to match orders sharing same instance
JOIN Maps.VendorBRelatedOrder ro
ON ro.OrderControlNumber = o.OrderControlNumber
AND ro.EquitableCustomerId = o.EquitableCustomerId
AND ro.DataSetId = o.DataSetId
JOIN BillingProvider bp
ON bp.ProviderNPI = o.ProviderNPI
-- Join on customers and fail if the customer doesn't exist
LEFT OUTER JOIN [Maps].VendorBCustomer vc
ON vc.ExtenalCustomerId = o.ExtenalCustomerId
AND vc.VendorId = o.VendorId;
I am wondering if there is anything I can do to optimize it for time. I have tried using the DB Engine Tuner, but this query takes 100x more CPU Time than the other queries I am running. Is there anything else that I can look into or can the query not be improved further?

CTE is just syntax
That CTE is evaluated (run) on that join
First just run it as a select statement (no insert)
If the select is slow then:
Move that CTE to a #TEMP so it is evaluated once and materialized
Put an index (PK if applicable) on the three join columns
If the select is not slow then it is insert time on #VendorBOrders
Fist only create PK and sort the insert on the PK so as not to fragment that clustered index
Then AFTER the insert is complete build any other necessary indexes

Generally when I do speed testing I perform checks on the parts of SQL to see where the problem lies. Turn on the 'Execution plan' and see where a lot of the time is going. Also if you want to just do the quick and dirty highlight your CTE and run just that. Is that fast, yes, move on.
I have at times found a single index being off throws off a whole complex logic of joins by merely having the database do one part of something large and then finding that piece.
Another idea is that if you have a fast tempdb on a production environment or the like, dump your CTE to a temp table as well. Index on that and see if that speeds things up. Sometimes CTE's, table variables, and temp tables lose some performance at joins. I have found that creating an index on a partial object will improve performance at times but you are also putting more load on the tempdb to do this, so keep that in mind.

SQL Server search filter and order by performance issues

We have a table value function that returns a list of people you may access, and we have a relation between a search and a person called search result.
What we want to do is that wan't to select all people from the search and present them.
The query looks like this
SELECT qm.PersonID, p.FullName
FROM QueryMembership qm
INNER JOIN dbo.GetPersonAccess(1) ON GetPersonAccess.PersonID = qm.PersonID
INNER JOIN Person p ON p.PersonID = qm.PersonID
WHERE qm.QueryID = 1234
There are only 25 rows with QueryID=1234 but there are almost 5 million rows total in the QueryMembership table. The person table has about 40K people in it.
QueryID is not a PK, but it is an index. The query plan tells me 97% of the total cost is spent doing "Key Lookup" witht the seek predicate.
QueryMembershipID = Scalar Operator (QueryMembership.QueryMembershipID as QM.QueryMembershipID)
Why is the PK in there when it's not used in the query at all? and why is it taking so long time?
The number of people total 25, with the index, this should be a table scan for all the QueryMembership rows that have QueryID=1234 and then a JOIN on the 25 people that exists in the table value function. Which btw only have to be evaluated once and completes in less than 1 second.

if you want to avoid "key lookup", use covered index
create index ix_QueryMembership_NameHere on QueryMembership (QueryID)
include (PersonID);
add more column names, that you gonna select in include arguments.
for the point that, why PK's "key lookup" working so slow, try DBCC FREEPROCCACHE, ALTER INDEX ALL ON QueryMembership REBUILD, ALTER INDEX ALL ON QueryMembership REORGANIZE
This may help if your PK's index is disabled, or cache keeps wrong plan.

You should define indexes on the tables you query. In particular on columns referenced in the WHERE and ORDER BY clauses.
Use the Database Tuning Advisor to see what SQL Server recommends.

For specifics, of course you would need to post your query and table design.
But I have to make a couple of points here:
You've already jumped to the conclusion that the slowness is a result of the ORDER BY clause. I doubt it. The real test is whether or not removing the ORDER BY speeds up the query, which you haven't done. Dollars to donuts, it won't make a difference.
You only get the "log n" in your big-O claim when the optimizer actually chooses to use the index you defined. That may not be happening because your index may not be selective enough. The thing that makes your temp table solution faster than the optimizer's solution is that you know something about the subset of data being returned that the optimizer does not (specifically, that it is a really small subset of data). If your indexes are not selective enough for your query, the optimizer can't always reasonably assume this, and it will choose a plan that avoids what it thinks could be a worst-case scenario of tons of index lookups, followed by tons of seeks and then a big sort. Oftentimes, it chooses to scan and hash instead. So what you did with the temp table is often a way to solve this problem. Often you can narrow down your indexes or create an indexed view on the subset of data you want to work against. It all depends on the specifics of your wuery.

You need indexes on your WHERE and ORDER BY clauses. I am not an expert but I would bet it is doing a table scan for each row. Since your speed issue is resolved by Removing the INNER JOIN or the ORDER BY I bet the issue is specifically with the join. I bet it is doing the table scan on your joined table because of the sort. By putting an index on the columns in your WHERE clause first you will be able to see if that is in fact the case.

Have you tried restructuring the query into a CTE to separate the TVF call? So, something like:
With QueryMembershipPerson
(
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
)
Select PersonId, Fullname
From QueryMembershipPerson As QMP
Join dbo.GetPersonAccess(1) As PA
On PA.PersonId = QMP.PersonId
EDIT: Btw, I'm assuming that there is an index on PersonId in both the QueryMembership and the Person table.
EDIT What about two table expressions like so:
With
QueryMembershipPerson As
(
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
)
, With PersonAccess As
(
Select PersonId
From dbo.GetPersonAccess(1)
)
Select PersonId, Fullname
From QueryMembershipPerson As QMP
Join PersonAccess As PA
On PA.PersonId = QMP.PersonId
Yet another solution would be a derived table like so:
Select ...
From (
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
) As QueryMembershipPerson
Join dbo.GetPersonAccess(1) As PA
On PA.PersonId = QueryMembershipPerson.PersonId
If pushing some of the query into a temp table and then joining on that works, I'd be surprised that you couldn't combine that concept into a CTE or a query with a derived table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas