Why Postgresql searches Text index faster than Int index? - sql

CREATE TABLE index_test
(
id int PRIMARY KEY NOT NULL,
text varchar(2048) NOT NULL,
value int NOT NULL
);
CREATE INDEX idx_index_value ON index_test ( value );
CREATE INDEX idx_index_value_and_text ON index_test ( value, text );
CREATE INDEX idx_index_text_and_value ON index_test ( text, value );
CREATE INDEX idx_index_text ON index_test ( text );
The table is populated with 10000 random rows, 'value' column has integers from 0 to 100, 'text' column has random 128 bit md5 hash. Sorry for using bad column names.
My searches are:
select * from index_test r where r.value=56;
select * from index_test r where r.value=56 and r.text='dfs';
select * from index_test r where r.text='sdf';
Anytime I make some search...
if only indexes on 'text' and/or 'value' columns are presented
if combined ('text' and 'value' together) indexes are presented
... so, anytime I see the following picture:
The search for integer column 'value' is
slower
is combined from 2 searches: *Bitmap Heap Scan on index_test* and *Bitmap Index Scan on idx_index_value*
The search for varchar column 'text' is
faster
always using an index scan
Why searching for String is easier than searching for Integer?
Why the the search plans differ in that way?
Is there any similar situations when this effect can be reproduced and can be helpful for developers?

As the text is a hash, unique by definition, there will be one only row in the 10k rows of the table matching that text.
The 56 value will exist about 100 times inside the 10k rows and it will be scattered all over the table. So the planner goes first to the index and find the pages where those rows are. Then it visits each of those scattered pages to retrieve the rows.

Related

Does Adding Indexes speed up String Wildcard % searches?

We are conducting a wildcard search on a database table with column string. Does creating a non-clustered index on columns help with wildcard searches? Will this improve performance?
CREATE TABLE [dbo].[Product](
[ProductId] [int] NOT NULL,
[ProductName] [varchar](250) NOT NULL,
[ModifiedDate] [datetime] NOT NULL,
...
CONSTRAINT [PK_ProductId] PRIMARY KEY CLUSTERED
(
[ProductId] ASC
)
)
Proposed Index:
CREATE NONCLUSTERED INDEX [IX_Product_ProductName] ON [dbo].[Product] [ProductName])
for this query
select * from dbo.Product where ProductName like '%furniture%'
Currently using Microsoft SQL Server 2019.
Creating a normal index will not help(*), but a full-text index will, though you would have to change your query to something like this:
select * from dbo.Product where ProductName CONTAINS 'furniture'
(* -- well, it can be slightly helpful, in that it can reduce a scan over every row and column in your table into a scan over merely every row and only the relevant columns. However, it will not achieve the orders of magnitude performance boost that we normally expect from indexes that turn scans into single seeks.)
For a double ended wildcard search as shown, an index cannot help you by restricting the rows SQL Server has to look at - a full table scan will be carried out. But it can help with the amount of data that has to be retrieved from disk.
Because in ProductName like '%furniture%', ProductName could start or end with any string, so no index can reduce the rows that have to be inspected.
However if a row in your Product table is 1,000 characters and you have 10,000 rows, you have to load that much data. But if you have an index on ProductName, and ProductName is only 50 characters, then you only have to load 10,000 * 50 rather than 10,000 * 1000.
Note: If the query was a single ended wildcard search with % at end of 'furniture%', then the proposed index would certainly help.
First you can use FTS to search words into sentences even partially (beginning by).
For those ending by or for those containing you can use a rotative indexing technic:
CREATE TABLE T_WRD
(WRD_ID BIGINT IDENTITY PRIMARY KEY,
WRD_WORD VARCHAR(64) COLLATE Latin1_General_100_BIN NOT NULL UNIQUE,
WRD_DROW AS REVERSE(WRD_WORD) PERSISTED NOT NULL UNIQUE,
WRD_WORD2 VARCHAR(64) COLLATE Latin1_General_100_CI_AI NOT NULL) ;
GO
CREATE TABLE T_WORD_ROTATE_STRING_WRS
(WRD_ID BIGINT NOT NULL REFERENCES T_WRD (WRD_ID),
WRS_ROTATE SMALLINT NOT NULL,
WRD_ID_PART BIGINT NOT NULL REFERENCES T_WRD (WRD_ID),
PRIMARY KEY (WRD_ID, WRS_ROTATE));
GO
CREATE OR ALTER TRIGGER E_I_WRD
ON T_WRD
FOR INSERT
AS
SET NOCOUNT ON;
-- splitting words
WITH R AS
(
SELECT WRD_ID, TRIM(WRD_WORD) AS WRD_WORD, 0 AS ROTATE
FROM INSERTED
UNION ALL
SELECT WRD_ID, RIGHT(WRD_WORD, LEN(WRD_WORD) -1), ROTATE + 1
FROM R
WHERE LEN(WRD_WORD) > 1
)
SELECT *
INTO #WRD
FROM R;
-- inserting missing words
INSERT INTO T_WRD (WRD_WORD, WRD_WORD2)
SELECT WRD_WORD, LOWER(WRD_WORD) COLLATE SQL_Latin1_General_CP1251_CI_AS
FROM #WRD
WHERE WRD_WORD NOT IN (SELECT WRD_WORD
FROM T_WRD);
-- inserting cross reference words
INSERT INTO T_WORD_ROTATE_STRING_WRS
SELECT M.WRD_ID, ROTATE, D.WRD_ID
FROM #WRD AS M
JOIN T_WRD AS D
ON M.WRD_WORD = D.WRD_WORD
WHERE NOT EXISTS(SELECT 1/0
FROM T_WORD_ROTATE_STRING_WRS AS S
WHERE S.WRD_ID = M.WRD_ID
AND S.WRS_ROTATE = ROTATE);
GO
Then now you can insert into the first table all the words you want from your sentences and finding it by ending of partially in querying those two tables...
As an example, word:
WITH
T AS (SELECT 'électricité' AS W)
INSERT INTO T_WRD
SELECT W, LOWER(CAST(W AS VARCHAR(64)) COLLATE SQL_Latin1_General_CP1251_CI_AS) AS W2
FROM T;
You can now use :
SELECT * FROM T_WRD;
SELECT * FROM T_WORD_ROTATE_STRING_WRS;
To find those partial words
It depends on the optimizer. Like usually requires a full table scan. if the optimizer can scan an index for matches than it will do an index scan which is faster than a full table scan.
if the optimizer does not select an index scan you can force it to use an index. You must measure performance times to determine if using an index scan decreases search time
Use with (index(index_name)) to force an index scan e.g.
select * from t1 with (index(t1i1)) where v1 like '456%'
SQL Server Index - Any improvement for LIKE queries?
If you use %search% pattern, the optimizer will always perform a full table scan.
Another technique for speeding up searches is to use substrings and exact match searches.
Yes, the part before the first % is matched against the index. Of course however, if your pattern starts with %, then a full scan will be performed instead.

Faster Sqlite insert from another table

I have an Sqlite DB which I am doing updates on and its very slow. I am wondering if I am doing it the best way or is there a faster way. My tables are:
create table files(
fileid integer PRIMARY KEY,
name TEXT not null,
sha256 TEXT,
created INT,
mtime INT,
inode INT,
nlink INT,
fsno INT,
sha_id INT,
size INT not null
);
create table fls2 (
fileid integer PRIMARY KEY,
name TEXT not null UNIQUE,
size INT not null,
sha256 TEXT not null,
fs2,
fs3,
fs4,
fs7
);
Table 'files' is actually in an attached DB named ttb. I am then doing this:
UPDATE fls2
SET fs3 = (
SELECT inode || 'X' || mtime || 'X' || nlink
FROM
ttb.files
WHERE
ttb.files.fsno = 3
AND
fls2.name = ttb.files.name
AND
fls2.sha256 = ttb.files.sha256
);
So the idea is, fls2 has values in 'name' which are also present in ttb.files.name. In ttb.files there are other parameters which I want to insert into the corresponding rows in fls2. The query works but I assume the matching up of the two tables is taking the time, and I wonder if theres a more efficient way to do it. There are indexes on each column in fls2 but none on files. I am doing it as a transaction, and pragma journal = memory (although sqlite seems to be ignoring that because a journal file is being created).
It seems slow, so far about 90 minutes for around a million rows in each table.
One CPU is pegged so I assume its not disk bound.
Can anyone suggest a better way to structure the query?
EDIT: EXPLAIN QUERY PLAN
|--SCAN TABLE fls2
`--CORRELATED SCALAR SUBQUERY 1
`--SCAN TABLE files
Not sure what that means though. It carries out the SCAN TABLE files for each SCAN TABLE fls2 hit?
EDIT2:
Well blimey, Crtl-C the query which had been running 2.5 hours at that point, exit Sqlite, run sqlite with the files DB, create index (sha256, name) - 1 minute or so. Exit that, run Sqlite with the main DB. Explain shows that now the latter scan is done with the index. Run the update - takes 150 seconds. Compared to >150 minutes, thats a heck of a speed up. Thanks for the assistance.
TIA, Pete
There are indexes on each column in fls2
Indexes are used for faster selection. They slow down inserts and updates. Maybe removing the one for fls2.fs3 helps?
Not an expert on sqlite, but on some databases it is more performant to insert the data into temporary table, delete them, then insert them from the temp table.
Insert into tmptab
Select fileid,
name,
size,
sha256,
fs2,
inode || 'X' || mtime || 'X' || nlink,
fs4,
fs7
From fls2
Inner join files on
fls2.name = ttb.files.name
AND
fls2.sha256 = ttb.files.sha256
delete from
Fls2 where exists (select 1 from tmptab where tmptab.<primary key> = fls2.<primary key>)
Insert into fls2 select * from tmptab

Where Clause Index Scan - Index Seek

I have below table:
CREATE TABLE Test
(
Id int IDENTITY(1,1) NOT NULL,
col1 varchar(37) NULL,
testDate datetime NULL
)
insert Test
select null
go 700000
select cast(NEWID() as varchar(37))
go 300000
And below indexes:
create clustered index CIX on Test(ID)
create nonclustered index IX_RegularIndex on Test(col1)
create nonclustered index IX_RegularDateIndex on Test(testDate)
When I query on my table:
SET STATISTICS IO ON
select * from Test where col1=NEWID()
select * from Test where TestDate=GETDATE()
First is making index scan whereas the second index seek. I expect that both of them must make index seek. Why does the first make index scan?
There is an implicit convert generated becuase the NEWID() function returns a value which is of uniqueidentifier datatype, and that is different than your VARCHAR datatype declared for the column.
Just try hovering your mouse over the SELECT part of the plan, where there is a "warning" sign.
Due to the fact that there is a mismatch between compared datatypes, the optimizer can't look at statistics and estimate how many rows with that NEWID() value there are in the table.
And because of the implicit convert, the optimizer thus decides that it is better to go and get all the rows (thus the SCAN), then pass them through the FILTER operation, where it does a conversion of the value of Col1 to a uniqueidentifier datatype and then removing the additional rows that do not match the filter condition.
As opposed to GETDATE() which returns a datetime value, which is of the same datatype as your testDate column, so no datatype conversion is needed and values can be compared as they are.

Find position(s) in array matching a given sub-array

Given this table:
CREATE TABLE datasets.travel(path integer[], path_timediff double precision[]);
INSERT INTO datasets.travel
VALUES (array[50,49,49,49,49,50], array[NULL,438,12,496,17,435]);
I am looking for some kind of function or query in the PostgreSQL that for a given input array[49,50] will find the matching consecutive index values in path which is [5,6] and the corresponding element in path_timediff which is 435 in the example (array index 6).
My ultimate purpose is to find all such occurrences of [49,50] in path and all the corresponding elements in path_timediff. How can I do that?
Assuming you have a primary key in your table you did not show:
CREATE TABLE datasets.travel (
travel_id serial PRIMARY KEY
, path integer[]
, path_timediff float8[]
);
Here is one way with generate_subscripts() in a LATERAL join:
SELECT t.travel_id, i+1 AS position, path_timediff[i+1] AS timediff
FROM (SELECT * FROM datasets.travel WHERE path #> ARRAY[49,50]) t
, generate_subscripts(t.path, 1) i
WHERE path[i:i+1] = ARRAY[49,50];
This finds all matches, not just the first.
i+1 works for a sub-array of length 2. Generalize with i + array_length(sub_array, 1) - 1.
The subquery is not strictly necessary, but can use a GIN index on (path) for a fast pre-selection:
(SELECT * FROM datasets.travel WHERE path #> ARRAY[49,50])
Related:
How to access array internal index with postgreSQL?
Parallel unnest() and sort order in PostgreSQL
PostgreSQL unnest() with element number

Full-Text Search Not working (FREETEXT - CONTAINS)

Freetext not return all of the words from table. And Contains not work
I have a one row wich include in mycolumn="Life of a King"
I tried 2 method;
First "contains"
SELECT * FROM MYTABLE WHERE CONTAINS(MYCOLUMN,'Life NEAR of NEAR a NEAR King')
It returns NOTHING
Second:
SELECT * FROM MYTABLE WHERE FREETEXT(MYCOLUMN,'Life of a King')
It returns 237 rows!
which is ;
"Life of Pie","It's a Wonderfull Life","The Lion King","King Arthur","Life Story","Life of a King" etc...
I want to return row which only include "Life"+"of"+"a"+"King" words together.
Thanks for replies!
I am assuming full text field is nvarchar.
Here is my example:
CREATE TABLE [dbo].[FullTextTable](
[ID] [int] NOT NULL PRIMARY KEY,
[FullTextField] [nvarchar](max) NOT NULL
);
GO
CREATE FULLTEXT INDEX ON FullTextTable([FullTextField])
KEY INDEX [PK_FullTextTable]
WITH STOPLIST = SYSTEM;
GO
Following query returning exact value:
SELECT FullTextField
FROM FullTextTable
WHERE
CONTAINS
(FullTextField, N'"Life NEAR of NEAR a NEAR King"' );
GO
You must consider below points
The column for which you are doing searching should have FULLTEXT INDEX
Check the searching term is exists in the table
SELECT * FROM sys.dm_fts_index_keywords(DB_ID('your_DB_Name'),
OBJECT_ID('_your_table_Name'))
where display_term like '%your_searching_keyword%'
The "Change Tracking" property should be set to "automatic". If after creating index you are going to add or delete rows from the table or data in the table is not static.