Optimizing Mysql Table Indexing for Substring Queries

Optimizing Mysql Table Indexing for Substring Queries - sql

I have a MySQL indexing question for you guys.
I've got a very large table (~100Million Records) in MySQL that contains information about files. Most of the Queries I do on it involve substring operations on the file path column.
Here's the table ddl:
CREATE TABLE `filesystem_data`.`$tablename` (
`file_id` INT( 14 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`file_name` VARCHAR( 256 ) NOT NULL ,
`file_share_name` VARCHAR ( 100 ) NOT NULL,
`file_path` VARCHAR( 900 ) NOT NULL ,
`file_size` BIGINT( 14 ) NOT NULL ,
`file_tier` TINYINT(1) UNSIGNED NULL,
`file_last_access` DATETIME NOT NULL ,
`file_last_change` DATETIME NOT NULL ,
`file_creation` DATETIME NOT NULL ,
`file_extension` VARCHAR( 50 ) NULL ,
INDEX ( `file_path`, `file_share_name` )
) ENGINE = MYISAM
};
So for example ill have a row with a file_path like:
'\\Server100\share2\Home\Zenshai\My Documents\'
And I'll extract the User's name (Zenshai in this example) with something like
SELECT substring_index(substring_index(fp.file_path,'\\',6),'\\',-1) as Username
FROM (SELECT '\\\\Server100\\share2\\Home\\Zenshai\\My Documents\\' as file_path) fp
It gets a bit ugly, but that's not really my concern right now.
What I'd like some advice on is what kind of index (if any at all) can help speed up these types of queries on this table. Any other suggestions are welcome too.
Thanks.
PS. Although the table gets very large there is enough space for indexes.

You cannot use indices with your current table design.
You may add a column called USERNAME, fill it in the INSERT/UPDATE trigger with the expression you use in SELECT, and search on this column.
P. S. Just curious, you really have 100 mln+ files on your server?

I'd create a tiny (columns, not record count) subtable that would have the file path broken out and stored like so:
FK_TO_PARENT PATH_PART
1 Server100
1 share2
1 Home
1 Zenshai
1 My Documents
And then just index PATH_PART. Of course if the parent table is 100 Million plus, then this would be going into the billions of records.

Related

Does Adding Indexes speed up String Wildcard % searches?

We are conducting a wildcard search on a database table with column string. Does creating a non-clustered index on columns help with wildcard searches? Will this improve performance?
CREATE TABLE [dbo].[Product](
[ProductId] [int] NOT NULL,
[ProductName] [varchar](250) NOT NULL,
[ModifiedDate] [datetime] NOT NULL,
...
CONSTRAINT [PK_ProductId] PRIMARY KEY CLUSTERED
(
[ProductId] ASC
)
)
Proposed Index:
CREATE NONCLUSTERED INDEX [IX_Product_ProductName] ON [dbo].[Product] [ProductName])
for this query
select * from dbo.Product where ProductName like '%furniture%'
Currently using Microsoft SQL Server 2019.

Creating a normal index will not help(*), but a full-text index will, though you would have to change your query to something like this:
select * from dbo.Product where ProductName CONTAINS 'furniture'
(* -- well, it can be slightly helpful, in that it can reduce a scan over every row and column in your table into a scan over merely every row and only the relevant columns. However, it will not achieve the orders of magnitude performance boost that we normally expect from indexes that turn scans into single seeks.)

For a double ended wildcard search as shown, an index cannot help you by restricting the rows SQL Server has to look at - a full table scan will be carried out. But it can help with the amount of data that has to be retrieved from disk.
Because in ProductName like '%furniture%', ProductName could start or end with any string, so no index can reduce the rows that have to be inspected.
However if a row in your Product table is 1,000 characters and you have 10,000 rows, you have to load that much data. But if you have an index on ProductName, and ProductName is only 50 characters, then you only have to load 10,000 * 50 rather than 10,000 * 1000.
Note: If the query was a single ended wildcard search with % at end of 'furniture%', then the proposed index would certainly help.

First you can use FTS to search words into sentences even partially (beginning by).
For those ending by or for those containing you can use a rotative indexing technic:
CREATE TABLE T_WRD
(WRD_ID BIGINT IDENTITY PRIMARY KEY,
WRD_WORD VARCHAR(64) COLLATE Latin1_General_100_BIN NOT NULL UNIQUE,
WRD_DROW AS REVERSE(WRD_WORD) PERSISTED NOT NULL UNIQUE,
WRD_WORD2 VARCHAR(64) COLLATE Latin1_General_100_CI_AI NOT NULL) ;
GO
CREATE TABLE T_WORD_ROTATE_STRING_WRS
(WRD_ID BIGINT NOT NULL REFERENCES T_WRD (WRD_ID),
WRS_ROTATE SMALLINT NOT NULL,
WRD_ID_PART BIGINT NOT NULL REFERENCES T_WRD (WRD_ID),
PRIMARY KEY (WRD_ID, WRS_ROTATE));
GO
CREATE OR ALTER TRIGGER E_I_WRD
ON T_WRD
FOR INSERT
AS
SET NOCOUNT ON;
-- splitting words
WITH R AS
(
SELECT WRD_ID, TRIM(WRD_WORD) AS WRD_WORD, 0 AS ROTATE
FROM INSERTED
UNION ALL
SELECT WRD_ID, RIGHT(WRD_WORD, LEN(WRD_WORD) -1), ROTATE + 1
FROM R
WHERE LEN(WRD_WORD) > 1
)
SELECT *
INTO #WRD
FROM R;
-- inserting missing words
INSERT INTO T_WRD (WRD_WORD, WRD_WORD2)
SELECT WRD_WORD, LOWER(WRD_WORD) COLLATE SQL_Latin1_General_CP1251_CI_AS
FROM #WRD
WHERE WRD_WORD NOT IN (SELECT WRD_WORD
FROM T_WRD);
-- inserting cross reference words
INSERT INTO T_WORD_ROTATE_STRING_WRS
SELECT M.WRD_ID, ROTATE, D.WRD_ID
FROM #WRD AS M
JOIN T_WRD AS D
ON M.WRD_WORD = D.WRD_WORD
WHERE NOT EXISTS(SELECT 1/0
FROM T_WORD_ROTATE_STRING_WRS AS S
WHERE S.WRD_ID = M.WRD_ID
AND S.WRS_ROTATE = ROTATE);
GO
Then now you can insert into the first table all the words you want from your sentences and finding it by ending of partially in querying those two tables...
As an example, word:
WITH
T AS (SELECT 'électricité' AS W)
INSERT INTO T_WRD
SELECT W, LOWER(CAST(W AS VARCHAR(64)) COLLATE SQL_Latin1_General_CP1251_CI_AS) AS W2
FROM T;
You can now use :
SELECT * FROM T_WRD;
SELECT * FROM T_WORD_ROTATE_STRING_WRS;
To find those partial words

It depends on the optimizer. Like usually requires a full table scan. if the optimizer can scan an index for matches than it will do an index scan which is faster than a full table scan.
if the optimizer does not select an index scan you can force it to use an index. You must measure performance times to determine if using an index scan decreases search time
Use with (index(index_name)) to force an index scan e.g.
select * from t1 with (index(t1i1)) where v1 like '456%'
SQL Server Index - Any improvement for LIKE queries?
If you use %search% pattern, the optimizer will always perform a full table scan.
Another technique for speeding up searches is to use substrings and exact match searches.

Yes, the part before the first % is matched against the index. Of course however, if your pattern starts with %, then a full scan will be performed instead.

SQL Server Query intermittent performance Issue

Recently we have run into performance issues with a particular query on SQL Server (2016). The problem I'm seeing is that the performance issues are incredibly inconsistent and I'm not sure how to improve this.
The table details:
CREATE TABLE ContactRecord
(
ContactSeq BIGINT NOT NULL
, ApplicationCd VARCHAR(2) NOT NULL
, StartDt DATETIME2 NOT NULL
, EndDt DATETIME2
, EndStateCd VARCHAR(3)
, UserId VARCHAR(10)
, UserTypeCd VARCHAR(2)
, LineId VARCHAR(3)
, CallingLineId VARCHAR(20)
, DialledLineId VARCHAR(20)
, ChannelCd VARCHAR(2)
, SubChannelCd VARCHAR(2)
, ServicingAgentCd VARCHAR(7)
, EucCopyTimestamp VARCHAR(30)
, PRIMARY KEY (ContactSeq)
, FOREIGN KEY (ApplicationCd) REFERENCES ApplicationType(ApplicationCd)
, FOREIGN KEY (EndStateCd) REFERENCES EndStateType(EndStateCd)
, FOREIGN KEY (UserTypeCd) REFERENCES UserType(UserTypeCd)
)
CREATE TABLE TransactionRecord
(
TransactionSeq BIGINT NOT NULL
, ContactSeq BIGINT NOT NULL
, TransactionTypeCd VARCHAR(3) NOT NULL
, TransactionDt DATETIME2 NOT NULL
, PolicyId VARCHAR(10)
, ProductId VARCHAR(7)
, EucCopyTimestamp VARCHAR(30)
, Detail VARCHAR(1000)
, PRIMARY KEY (TransactionSeq)
, FOREIGN KEY (ContactSeq) REFERENCES ContactRecord(ContactSeq)
, FOREIGN KEY (TransactionTypeCd) REFERENCES TransactionType(TransactionTypeCd)
)
Current record counts:
ContactRecord 20million
TransactionRecord 90million
My query is:
select
UserId,
max(StartDt) as LastLoginDate
from
ContactRecord
where
ContactSeq in
(
select
ContactSeq
from
TransactionRecord
where
ContactSeq in
(
select
ContactSeq
from
ContactRecord
where
UserId in
(
'1234567890',
'1234567891' -- Etc.
)
)
and TransactionRecord.TransactionTypeCd not in
(
'122'
)
)
and ApplicationCd not in
(
'1',
'4',
'5'
)
group by
UserId;
Now the query isn't great and could be improved using joins, however it does fundamentally work.
The problem I'm having is that our data job takes an input of roughly 7100 userIds. These are then broken up into groups of 500. For each 500 these are used in the IN clause in this query. The first 14 executions of this query with 500 items in the IN clause execute fine. Results are returned in roughly 15-20 seconds for each.
The issue is with the remaining 100 give or take for the last execution of this query. It never seems to complete. It just hangs. In our data job it is timing out after 10 minutes. I have no idea why. I'm not an expert with SQL Server so I'm not really sure how to debug this. I have executed each sub query independently and then replaced the contents of the sub query with the returned data. Doing this for each sub query works fine.
Any help is really appreciated here as I'm at a loss to how this works so consistently with larger amounts of parameters but just doesn't work with only a fraction.
EDIT
I've got three example of execution plans here. Please note that each of these were executed on a test server and all executed almost instantly as there is very little data on this test equivalent.
This is the execution plan for 500 arguments which executes fine in production, returning in roughly 15-20 seconds:
This is the execution plan for 119 arguments which is timing out in our data job after 10 minutes:
This is the execution plan for 5 arguments which executes fine. This query is not explicitly being executed in the data job but just for comparison:
In all instances SSMS has given the following warning:
/*
Missing Index Details from SQLQuery2.sql
The Query Processor estimates that implementing the following index could improve the query cost by 26.3459%.
*/
/*
USE [CloasIvr]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[TransactionRecord] ([TransactionTypeCd])
INCLUDE ([ContactSeq])
GO
*/
Is this the root cause to this problem?

Without seeing what's going on, it is hard to know exactly what's going on - especially with the ones that are failing. Execution plans for 'good' runs can help a bit but we're just guessing what's going wrong in bad runs.
My initial guess (similar to my comment) is that the estimates of what it expects is very wrong, and it creates a plan that is very bad.
Your TransactionRecord table in particular, with the detail column that is 1000 characters, would could have big issues with an unexpected large number of nested loops.
Indexes
The first thing I would suggest is indexing - particularly to a) only include a subset of the data you need for these, and b) to have them ordered in a useful manner.
I suggest the following two indexes would appear to help
CREATE INDEX IX_ContactRecord_User ON ContactRecord
(UserId, ContactSeq)
INCLUDE (ApplicationCD, Startdt);
CREATE INDEX IX_TransactionRecord_ContactSeq ON TransactionRecord
(ContactSeq, TransactionTypeCd);
These are both 'covering indexes', as well as being sorted in ways that can help.
Alternatively, you could replace the first one with a slightly modified version (sorting first on ContactSeq) but I think the above version would be more useful.
CREATE INDEX IX_ContactRecord_User2 ON ContactRecord
(ContactSeq)
INCLUDE (ApplicationCD, Startdt, UserId);
Also, regarding the index on TransactionRecord - if this is the only query that would be using that index, you could improve it by creating the following index instead
CREATE INDEX IX_TransactionRecord_ContactSeq_Filtered ON TransactionRecord
(ContactSeq, TransactionTypeCd)
WHERE (TransactionTypeCD <> '122');
The above is a filtered index that matches what's specified in the WHERE clause of your statement. The big thing about this is that it has already a) removed the records where the type <> '122', and b) has sorted the records already on ContactSeq so it's then easy to look them up.
By the way - given you asked about adding indexes on Foreign Keys on principle - the use of these really depends on how you read the data. If you are only ever referring to the referenced table (e.g., you have an FK to a status table, and only ever use it to report, in English, the statuses) then an index on the original table's Status_ID wouldn't help. On the other hand, if you want to find all the rows with Status_ID = 4, then it would help.
To help understanding indexes, I strongly recommend Brent Ozar's How to think like an SQL Server Engine - it really helped me to understand how indexes work in practice.
Use a sorted temp table
This may help but is unlikely to be the primary fix. If you pre-load the relevant UserIDs into a temporary table (with a primary key on UserID) then it may help with the relevant JOIN. It may also be easier for you to modify each run rather than have to modify the middle of the query.
CREATE TABLE #Users (UserId VARCHAR(10) PRIMARY KEY);
INSERT INTO #Users (UserID) VALUES
('1234567890'),
('1234567891');
Then replace the middle section of your query with
where
ContactSeq in
(
select
ContactSeq
from
ContactRecord CR
INNER JOIN #Users U ON CR.UserID = U.UserID
)
and TransactionRecord.TransactionTypeCd not in
(
'122'
)
Simplify the query
I had a go at simplifying the query, and got it to this:
select CR.UserId,
max(CR.StartDt) as LastLoginDate
from ContactRecord CR
INNER JOIN TransactionRecord TR ON CR.ContactSeq = TR.ContactSeq
where TR.TransactionTypeCd not in ('122')
AND CR.ApplicationCd not in ('1', '4', '5')
AND CR.UserId in ('1234567890', '1234567891') -- etc
group by UserId;
or alternatively (with the temp table)
select CR.UserId,
max(CR.StartDt) as LastLoginDate
from ContactRecord CR
INNER JOIN #Users U ON CR.UserID = U.UserID
INNER JOIN TransactionRecord TR ON CR.ContactSeq = TR.ContactSeq
where TR.TransactionTypeCd not in ('122')
AND CR.ApplicationCd not in ('1', '4', '5')
group by UserId;
One advantage of simplifying the query, is that it also helps SQL Server get good estimates; which in turn help it get good execution plans.
Of course, you would need to test that the above returns exactly the same records in your circumstances - I don't have a data set to test on, so I cannot be 100% sure these simplified versions match the original.

Optimize SQL Query (If possible) using CONVERT(INT, SUBSTRING( and LEN FUNCTION

My situation is like that :
I have these tables:
CREATE TABLE [dbo].[HeaderResultPulser]
(
[Id] BIGINT IDENTITY (1, 1) NOT NULL,
[ReportNumber] CHAR(255) NOT NULL,
[ReportDescription] CHAR(255) NOT NULL,
[CatalogNumber] NCHAR(255) NOT NULL,
[WorkerName] NCHAR(255) DEFAULT ('') NOT NULL,
[LastCalibrationDate] DATETIME NOT NULL,
[NextCalibrationDate] DATETIME NOT NULL,
[MachineNumber] INT NOT NULL,
[EditTime] DATETIME NOT NULL,
[Age] NCHAR(255) DEFAULT ((1)) NOT NULL,
[Current] INT DEFAULT ((-1)) NOT NULL,
[Time] BIGINT DEFAULT ((-1)) NOT NULL,
[MachineName] NVARCHAR(MAX) DEFAULT ('') NOT NULL,
[BatchNumber] NVARCHAR(MAX) DEFAULT ('') NOT NULL,
CONSTRAINT [PK_HeaderResultPulser]
PRIMARY KEY CLUSTERED ([Id] ASC)
);
CREATE TABLE [dbo].[ResultPulser]
(
[Id] BIGINT IDENTITY (1, 1) NOT NULL,
[ReportNumber] CHAR(255) NOT NULL,
[BatchNumber] CHAR(255) NOT NULL,
[DateTime] DATETIME NOT NULL,
[Ocv] FLOAT(53) NOT NULL,
[OcvMin] FLOAT(53) NOT NULL,
[OcvMax] FLOAT(53) NOT NULL,
[Ccv] FLOAT(53) NOT NULL,
[CcvMin] FLOAT(53) NOT NULL,
[CcvMax] FLOAT(53) NOT NULL,
[Delta] BIGINT NOT NULL,
[DeltaMin] BIGINT NOT NULL,
[DeltaMax] BIGINT NOT NULL,
[CurrentFail] BIT DEFAULT ((0)) NOT NULL,
[NumberInTest] INT NOT NULL
);
For every row in HeaderResultPulser I have multiple rows in ResultPulser
my key is the [HeaderResultPulser].[ReportNumber] to get a list of data in ResultPulser, and for every a lot of row with the same [ResultPulser].[ReportNumber]
It has multiple [ResultPulser].[NumberInTest] values
For example: in the ResultPulser table the data can look like this:
ReportNumber | NumberInTest
-------------+-------------
0000006211 | 1
0000006211 | 2
0000006211 | 3
0000006211 | 4
0000006211 | 5
0000006211 | 6
0000006212 | 1
0000006212 | 2
0000006212 | 3
0000006212 | 4
0000006212 | 5
NumberInTest can be 200, 500, 10000 and sometime even more..
The report number column contains two the first 7 chars are a number of machine and the rest is an incrementing number.
For example, 0000006212 is [0000006][212] == [the machine number][the incrementing number]
My query for example :
select
[HeaderResultPulser].[ReportNumber],
max(NumberInTest) as TotalCells
from
ResultPulser, HeaderResultPulser
where
((([ResultPulser].[ReportNumber] like '0000006%' and
CONVERT(INT, SUBSTRING([ResultPulser].[ReportNumber], 8, LEN([ResultPulser].[ReportNumber]))) BETWEEN '211' AND '815')
and ([HeaderResultPulser].[ReportNumber] = [ResultPulser].[ReportNumber])))
group by
[HeaderResultPulser].[ReportNumber]
Actually I want to get all the rows on the machine number 0000006 that number was 211 to 815 (include both)
This query takes about 6-7 seconds
There is a lot of data (in the hundreds of millions and billions and in the future can be more and can be much more in table ResultPulser), and it can get Tens of thousands of rows in HeaderResultPulser table
And In getting receive I only receive on select a few hundred in the worst case a thousand or about two thousand if I want to go far... but (in numbers) to get the max(NumberInTest) from ResultPulser I take about (It can get to a few millions of rows)
There is any way to optimize my query? Or when It's so much data it's just must this time? (That just the way it is)

The way you are doing joins is no longer standard. It's also hard to read, and dangerous if you ever need to use left joins. Instead of joining this way:
select *
from T1, T2
where T1.column = T2.column
Use ANSI-92 join syntax instead:
select *
from T1
join T2 on T1.column = T2.column
You said that your "key" was ReportNumber. Why isn't that declared in your schema? It sounds like you want a unique constraint on HeaderResultPulser.ReportNumber, and a foreign key on the the ReportPulser table, such that ReportNumber references HeaderResultPulser (ReportNumber)
Since your report number column seems to contain two different values, your table is not in First Normal Form. This is making things difficult for you. Why not split the two parts of the "report number" into two different columns when the data is entered? This will significantly improve your query performance, because you no longer need to perform an expression against the data in the table at query time to separate the ReportNumber into atomic values.
Your comment says that the first 7 characters of the ReportNumber are the MachineNumber. But you already have MachineNumber in the HeaderReportPulser table. So why not just add a separate column for Increment? If you still need ReportNumber to exist as a column, you can make it a calculated column, as the concatenation of MachineNumber and Increment.
If you don't want to touch the "existing" schema, we can do a similar thing in reverse. Your query will not be completely sargable unless you can do something to the schema, because you have to perform some kind of expression on the data in the ReportNumber column. But maybe you have the option to use a calculated column to do this up front:
alter table HeaderReportPulser
add Increment as right(ReportNumber, len(rtrim(ReportNumber)) - 7);
Now we have the increment as a column in its own right. But it's still being calculated at query time, because it's not persisted. We can make it persisted:
alter table HeaderReportPulser
add Increment as right(ReportNumber, len(rtrim(ReportNumber)) - 7) persisted;
We can also index a computed column. Since your required expression is deterministic and precise (see Indexes on Computed Columns), we don't actually have to mark it as persisted:
alter table HeaderReportPulser
add Increment as right(ReportNumber, len(rtrim(ReportNumber)) - 7);
create index ix_headerreportpulser_increment on HeaderReportPulser(Increment);
You could do a similar set of operations to create the Increment and MachineNumber on the ReportPulser table. If you always want to use both values, create an index on the combination of (MachineNumber, Increment)

The biggest performance gain might be eliminating the outer group by by using a correlated subquery or lateral join:
select hrp.[ReportNumber],
(select max(rp.NumberInTest)
from ResultPulser rp
where rp.ReportNumber = hrp.ReportNumber and
right(rp.ReportNumber, 3) between '211' and '815'
) as TotalCells
from HeaderResultPulser hrp
where hrp.ReportNumber like '0000006%';
Your logic looks like it only wants the last three characters of the ReportNumber, so I simplified the logic. I'm not 100% that is the case -- it just seems reasonable. Regardless, there is no need to convert the values to integers and then compare as strings. And similar logic can be used even for longer report numbers.
You also want an index on ResultPulser(ReportNumber, NumberInTest) :
create index idx_resultpulser_reportnumber_numberintest on ResultPulser(ReportNumber, NumberInTest)
EDIT:
Actually, I notice that the report number matches between the two tables. So this seems simplest:
select hrp.[ReportNumber],
(select max(rp.NumberInTest)
from ResultPulser rp
where rp.ReportNumber = hrp.ReportNumber
) as TotalCells
from HeaderResultPulser hrp
where hrp.ReportNumber >= '0000006211' and
hrp.ReportNumber <= '0000006815';
You still want to be sure you have the above index on ResultPulser.
If the ReportNumber is not a fixed 10 digits, then you can use:
where hrp.ReportNumber >= '0000006211' and
hrp.ReportNumber <= '0000006815' and
len(hrp.ReportNumber) = 10
This should also use the index and return exactly what you want.

Performance Optimization of any query depends on many factors including environment you are hosting and running your query. Hardware and Software play important part in optimization of heavy running database queries. In your case you can look into following things:
USE ANSI 92 JOIN syntax instead of default cross join
e.g
select *
from T1
join T2 on T1.column = T2.column
Put indexes on columns like
[ReportNumber]
[NumberInTest]
Note: You may need index for each column in the join area which is not primary key.
Remember use of MAX is always heavy and that could be the main problem in your query.
Finally you can further look into optimizing your query syntax using following online tool where you can specify your actual query and environment you are using:
https://www.eversql.com/
Hope it help you.

If you really want to optimize performance, I propose to add a bit of logic beyond SQL structures.
Is it possible that particular value of ReportNumber is present in table ResultPulser, but not in table HeaderResultPulser? If not, and I ssupose so, there is no reason to join table HeaderResultPulser.
Then, I propose to take advantage from fact, that the condition on ReportNumber can be expressed equivalently without dividing in substrings. For your example, the condition
([ResultPulser].[ReportNumber] like '0000006%' and
CONVERT(INT, SUBSTRING([ResultPulser].[ReportNumber], 8,
LEN([ResultPulser].[ReportNumber]))) BETWEEN '211' AND '815')
is equivalent to:
([ResultPulser].[ReportNumber] BETWEEN '0000006211' and '0000006815')
So the proposal is:
Create index on table ResultPulser(ReportNumber, NumberInTest)
Use selections similar to this:
select ReportNumber, max(NumberInTest) as TotalCells
from ResultPulser
where
ReportNumber BETWEEN '0000006211' and '0000006815'
group by
ReportNumber
(Please, add brackets or double quotes and capitalizations as necessary for MS SQL Server and your taste)
I would expect that good database will execute this query by index-only access, and it will be optimal from execution point of view.
Performance depends on not only on execution path, but also on setup and hardware. Please, make sure that your database has enough cache and fast disk accesses. Also concurrent load is very important.
Simple splitting the field ReportNumber into [the machine number] and [the incrementing number] will probably not improve performance of the query in form proposed by me. But it may be very convenient for other forms of access (other WHERE classes). And it will reflect the structure of the case. Even more important: It will release you from imposed limits. Currently, you have 3 digits for the [the incrementing number]. Are you sure, it will never be necessary to have more than 999 of them for single [the machine number]?
Why the field ReportNumber has type char(255), when only 10 characters are used? char(255) has fixed length, so it will be terrible wasting of space. Only database compression can help. Used space has strong influence on performance – Please, consider the above remark about the database cache.
If both these fields, [the machine number], [the incrementing number], are intergers, why not split ReportNumber and use integer type for them?
Side remark: Field names suggest that you search the total number of rows in table ResultPulser, which belong to single entry in table HeaderResultPulser. The proposed query will deliver this, only if numbers in NumberInTest are consecutive, without gaps. If this is not supplied, you have to count them rather than seek the maximum.

Faster Sqlite insert from another table

I have an Sqlite DB which I am doing updates on and its very slow. I am wondering if I am doing it the best way or is there a faster way. My tables are:
create table files(
fileid integer PRIMARY KEY,
name TEXT not null,
sha256 TEXT,
created INT,
mtime INT,
inode INT,
nlink INT,
fsno INT,
sha_id INT,
size INT not null
);
create table fls2 (
fileid integer PRIMARY KEY,
name TEXT not null UNIQUE,
size INT not null,
sha256 TEXT not null,
fs2,
fs3,
fs4,
fs7
);
Table 'files' is actually in an attached DB named ttb. I am then doing this:
UPDATE fls2
SET fs3 = (
SELECT inode || 'X' || mtime || 'X' || nlink
FROM
ttb.files
WHERE
ttb.files.fsno = 3
AND
fls2.name = ttb.files.name
AND
fls2.sha256 = ttb.files.sha256
);
So the idea is, fls2 has values in 'name' which are also present in ttb.files.name. In ttb.files there are other parameters which I want to insert into the corresponding rows in fls2. The query works but I assume the matching up of the two tables is taking the time, and I wonder if theres a more efficient way to do it. There are indexes on each column in fls2 but none on files. I am doing it as a transaction, and pragma journal = memory (although sqlite seems to be ignoring that because a journal file is being created).
It seems slow, so far about 90 minutes for around a million rows in each table.
One CPU is pegged so I assume its not disk bound.
Can anyone suggest a better way to structure the query?
EDIT: EXPLAIN QUERY PLAN
|--SCAN TABLE fls2
`--CORRELATED SCALAR SUBQUERY 1
`--SCAN TABLE files
Not sure what that means though. It carries out the SCAN TABLE files for each SCAN TABLE fls2 hit?
EDIT2:
Well blimey, Crtl-C the query which had been running 2.5 hours at that point, exit Sqlite, run sqlite with the files DB, create index (sha256, name) - 1 minute or so. Exit that, run Sqlite with the main DB. Explain shows that now the latter scan is done with the index. Run the update - takes 150 seconds. Compared to >150 minutes, thats a heck of a speed up. Thanks for the assistance.
TIA, Pete

There are indexes on each column in fls2
Indexes are used for faster selection. They slow down inserts and updates. Maybe removing the one for fls2.fs3 helps?

Not an expert on sqlite, but on some databases it is more performant to insert the data into temporary table, delete them, then insert them from the temp table.
Insert into tmptab
Select fileid,
name,
size,
sha256,
fs2,
inode || 'X' || mtime || 'X' || nlink,
fs4,
fs7
From fls2
Inner join files on
fls2.name = ttb.files.name
AND
fls2.sha256 = ttb.files.sha256
delete from
Fls2 where exists (select 1 from tmptab where tmptab.<primary key> = fls2.<primary key>)
Insert into fls2 select * from tmptab

Fastest way to find not null filed in sqlite

I have sqlite table
CREATE TABLE IF NOT EXISTS [app_status](
[id] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL ,
[status] TEXT DEFAULT NULL
)
This table is having multiple records like
1 "success"
2 NULL
where NULL is sqlite NULL
What is the fastest way to find out if table at-least one row where status IS NOT NULL?
Can I create some index or something else which I can use to count Not NULL fields ?
I have written following query
SELECT 1 \
FROM [app_status]\
WHERE [status] IS NOT NULL
But it is taking 3 ms to 50 ms. I want to further optimize this time. How can I do that ?

Add an index on that column if it doesn't already have one, and/or limit your select.
select 1 from [app_status] where [status] is not null limit 1;
You don't need to go through the whole table if you're just checking if the column contains at least one non-null field.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas