Use few analyzers in GIN index in Postgres - sql

I want to create GIN index for Postges full text search and I would like to ask is it possible if I store analyzer name for each row in table in separate column called lang, use it to create GIN index with different analyzer for each row taken from this field lang?
This is what I use now. Analyzer – ‘english’ and it is common for each row in indexed table.
CREATE INDEX CONCURRENTLY IF NOT EXISTS
decription_fts_gin_idx ON api_product
USING GIN(to_tsvector('english', description))
I want to do something like this:
CREATE INDEX CONCURRENTLY IF NOT EXISTS
decription_fts_gin_idx ON api_product
USING GIN(to_tsvector(api_product.lang, description))
( it doesnt work)
in order to retrieve analyzer configuration from field lang and use its name to populate index.
Is it possible to do it somehow or it is only possible to use one analyzer for the whole index?
DDL, just in case..
-- auto-generated definition
create table api_product
(
id serial not null
constraint api_product_pkey
primary key,
name varchar(100) not null,
standard varchar(40) not null,
weight integer not null
constraint api_product_weight_check
check (weight >= 0),
dimensions varchar(30) not null,
description text not null,
textsearchable_index_col tsvector,
department varchar(30) not null,
lang varchar(25) not null
);
alter table api_product
owner to postgres;
create index textsearch_idx
on api_product (textsearchable_index_col);
Query to run for seach:
SELECT *,
ts_rank_cd(to_tsvector('english', description),
to_tsquery('english', %(keyword)s), 32) as rnk
FROM api_product
WHERE to_tsvector('english', description) ## to_tsquery('english', %(keyword)s)
ORDER BY rnk DESC, id
where 'english' would be changed to 'lang' field analyzer name (english, french, etc)

If you know ahead of time the language you are querying against, you could create a series of partial indexes:
CREATE INDEX CONCURRENTLY ON api_product
USING GIN(to_tsvector('english', description)) where lang='english';
Then in your query you would add the language you are searching in:
SELECT *,
ts_rank_cd(to_tsvector('english', description),
to_tsquery('english', %(keyword)s), 32) as rnk
FROM api_product
WHERE to_tsvector('english', description) ## to_tsquery('english', %(keyword)s)
and lang='english'
ORDER BY rnk DESC, id

What you asked about is definitely possible, but you have the wrong type for the lang column:
create table api_product(description text, lang regconfig);
create index on api_product using gin (to_tsvector(lang, description));
insert into api_product VALUES ('the description', 'english');

Related

How to make human readable autoincrement column in PostgreSQL?

I need to make the column for store serial number of orders in the online shop.
Currently, I have this one
CREATE TABLE public.orders
(
id SERIAL PRIMARY KEY NOT NULL,
title VARCHAR(100) NOT NULL
);
CREATE UNIQUE INDEX orders_id_uindex ON public.orders (id);
But I need to create the special alphanumeric format for storing this number
like this 5CC806CF751A2.
How can I create this format with Postgres capabilities?
You can create a view that simply converts the ID to a hex value:
create view readable_orders
as
select id,
to_hex(id) as readable_id,
title
from orders;

How to create GIN index with LOWER in PostgreSQL?

First of all - I use JPA ORM (EclipseLink) which doesn't support ILIKE. So I am looking for solution to have case insensitive search. I did the following:
CREATE TABLE IF NOT EXISTS users (
id SERIAL NOT NULL,
name VARCHAR(512) NOT NULL,
PRIMARY KEY (id));
CREATE INDEX users_name_idx ON users USING gin (LOWER(name) gin_trgm_ops);
INSERT INTO users (name) VALUES ('User Full Name');
However, this query returns user:
SELECT * FROM users WHERE name ILIKE '%full%';
But this one doesn't:
SELECT * FROM users WHERE name LIKE '%full%';
So, how to create GIN index with LOWER in PostgreSQL?
I'm not sure I understand the question. because you mention GIN and insert one row and expect it to be returned with case insensitive comparison, but a wild guess - maybe you are looking for citext?..
t=# create extension citext;
CREATE EXTENSION
t=# CREATE TABLE IF NOT EXISTS users (
id SERIAL NOT NULL,
name citext NOT NULL,
PRIMARY KEY (id));
CREATE TABLE
t=# INSERT INTO users (name) VALUES ('User Full Name');
INSERT 0 1
t=# SELECT * FROM users WHERE name LIKE '%full%';
id | name
----+----------------
1 | User Full Name
(1 row)
update
expression based index requires expression in query

SQLite performance tuning for paginated fetches

I am trying to optimize the query I use for fetching paginated data from database with large data sets.
My schema looks like this:
CREATE TABLE users (
user_id TEXT PRIMARY KEY,
name TEXT,
custom_fields TEXT
);
CREATE TABLE events (
event_id TEXT PRIMARY KEY,
organizer_id TEXT NOT NULL REFERENCES users(user_id) ON DELETE SET NULL ON UPDATE CASCADE,
name TEXT NOT NULL,
type TEXT NOT NULL,
start_time INTEGER,
duration INTEGER
-- more columns here, omitted for the sake of simplicity
);
CREATE INDEX events_organizer_id_start_time_idx ON events(organizer_id, start_time);
CREATE INDEX events_organizer_id_type_idx ON events(organizer_id, type);
CREATE INDEX events_organizer_id_type_start_time_idx ON events(organizer_id, type, start_time);
CREATE INDEX events_type_start_time_idx ON events(type, start_time);
CREATE INDEX events_start_time_desc_idx ON events(start_time DESC);
CREATE INDEX events_start_time_asc_idx ON events(IFNULL(start_time, 253402300800) ASC);
CREATE TABLE event_participants (
participant_id TEXT NOT NULL REFERENCES users(user_id) ON DELETE CASCADE ON UPDATE CASCADE,
event_id TEXT NOT NULL REFERENCES events(event_id) ON DELETE CASCADE ON UPDATE CASCADE,
role INTEGER NOT NULL DEFAULT 0,
UNIQUE (participant_id, event_id) ON CONFLICT REPLACE
);
CREATE INDEX event_participants_participant_id_event_id_idx ON event_participants(participant_id, event_id);
CREATE INDEX event_participants_event_id_idx ON event_participants(event_id);
CREATE TABLE event_tag_maps (
event_id TEXT NOT NULL REFERENCES events(event_id) ON DELETE CASCADE ON UPDATE CASCADE,
tag_id TEXT NOT NULL,
PRIMARY KEY (event_id, tag_id) ON CONFLICT IGNORE
);
CREATE INDEX event_tag_maps_event_id_tag_id_idx ON event_tag_maps(event_id, tag_id);
Where in events table I have around 1,500,000 entries, and around 2,000,000 in event_participants.
Now, a typical query would look something like:
SELECT
EVTS.event_id,
EVTS.type,
EVTS.name,
EVTS.time,
EVTS.duration
FROM events AS EVTS
WHERE
EVTS.organizer_id IN(
'f39c3bb1-3ee3-11e6-a0dc-005056c00008',
'4555e70f-3f1d-11e6-a0dc-005056c00008',
'6e7e33ae-3f1c-11e6-a0dc-005056c00008',
'4850a6a0-3ee4-11e6-a0dc-005056c00008',
'e06f784c-3eea-11e6-a0dc-005056c00008',
'bc6a0f73-3f1d-11e6-a0dc-005056c00008',
'68959fb5-3ef3-11e6-a0dc-005056c00008',
'c4c96cf2-3f1a-11e6-a0dc-005056c00008',
'727e49d1-3f1b-11e6-a0dc-005056c00008',
'930bcfb6-3f09-11e6-a0dc-005056c00008')
AND EVTS.type IN('Meeting', 'Conversation')
AND(
EXISTS (
SELECT 1 FROM event_tag_maps AS ETM WHERE ETM.event_id = EVTS.event_id AND
ETM.tag_id IN ('00000000-0000-0000-0000-000000000000', '6ae6870f-1aac-11e6-aeb9-005056c00008', '6ae6870c-1aac-11e6-aeb9-005056c00008', '1f6d3ccb-eaed-4068-a46b-ec2547fec1ff'))
OR NOT EXISTS (
SELECT 1 FROM event_tag_maps AS ETM WHERE ETM.event_id = EVTS.event_id)
)
AND EXISTS (
SELECT 1 FROM event_participants AS EPRTS
WHERE
EVTS.event_id = EPRTS.event_id
AND participant_id NOT IN('79869516-3ef2-11e6-a0dc-005056c00008', '79869515-3ef2-11e6-a0dc-005056c00008', '79869516-4e18-11e6-a0dc-005056c00008')
)
ORDER BY IFNULL(EVTS.start_time, 253402300800) ASC
LIMIT 100 OFFSET #Offset;
Also, for fetching the overall count of the query-matching items, I would use the above query with count(1) instead of the columns and without the ORDER BY and LIMIT/OFFSET clauses.
I experience two main problems here:
1) The performance drastically decreases as I increase the #Offset value. The difference is very significant - from being almost immediate to a number of seconds.
2) The count query takes a long time (number of seconds) and produces the following execution plan:
0|0|0|SCAN TABLE events AS EVTS
0|0|0|EXECUTE LIST SUBQUERY 1
0|0|0|EXECUTE LIST SUBQUERY 1
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE event_tag_maps AS ETM USING COVERING INDEX event_tag_maps_event_id_tag_id_idx (event_id=? AND tag_id=?)
1|0|0|EXECUTE LIST SUBQUERY 2
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE event_tag_maps AS ETM USING COVERING INDEX event_tag_maps_event_id_tag_id_idx (event_id=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 3
3|0|0|SEARCH TABLE event_participants AS EPRTS USING INDEX event_participants_event_id_idx (event_id=?)
Here I don't understand why the full scan is performed instead of an index scan.
Additional info and SQLite settings used:
I use System.Data.SQLite provider (have to, because of custom functions support)
Page size = cluster size (4096 in my case)
Cache size = 100000
Journal mode = WAL
Temp store = 2 (memory)
No transaction is open for the query
Is there anything I could do to change the query/schema or settings in order to get as much performance improvement as possible?

Any way to achieve fulltext-like search on InnoDB

I have a very simple query:
SELECT ... WHERE row LIKE '%some%' OR row LIKE '%search%' OR row LIKE '%string%'
to search for some search string, but as you can see, it searches for each string individually and it's also not good for performance.
Is there a way to recreate a fulltext-like search using LIKE on an InnoDB table. Of course, I know I can use something like Sphinx to achieve this but I'm looking for a pure MySQL solution.
use a myisam fulltext table to index back into your innodb tables for example:
Build your system using innodb:
create table users (...) engine=innodb;
create table forums (...) engine=innodb;
create table threads
(
forum_id smallint unsigned not null,
thread_id int unsigned not null default 0,
user_id int unsigned not null,
subject varchar(255) not null, -- gonna want to search this... !!
created_date datetime not null,
next_reply_id int unsigned not null default 0,
view_count int unsigned not null default 0,
primary key (forum_id, thread_id) -- composite clustered PK index
)
engine=innodb;
Now the fulltext search table which we will use just to index back into our innodb tables. You can maintain rows in this table either by using a trigger or nightly batch updates etc.
create table threads_ft
(
forum_id smallint unsigned not null,
thread_id int unsigned not null default 0,
subject varchar(255) not null,
fulltext (subject), -- fulltext index on subject
primary key (forum_id, thread_id) -- composite non-clustered index
)
engine=myisam;
Finally the search stored procedure which you call from your php/application:
drop procedure if exists ft_search_threads;
delimiter #
create procedure ft_search_threads
(
in p_search varchar(255)
)
begin
select
t.*,
f.title as forum_title,
u.username,
match(tft.subject) against (p_search in boolean mode) as rank
from
threads_ft tft
inner join threads t on tft.forum_id = t.forum_id and tft.thread_id = t.thread_id
inner join forums f on t.forum_id = f.forum_id
inner join users u on t.user_id = u.user_id
where
match(tft.subject) against (p_search in boolean mode)
order by
rank desc
limit 100;
end;
call ft_search_threads('+innodb +clustered +index');
Hope this helps :)
Using PHP to construct the query. This is an horrible hack. Once seen, it can't be unseen...
$words=dict($userQuery);
$numwords = sizeof($words);
$innerquery="";
for($i=0;$i<$numwords;$i++) {
$words[$i] = mysql_real_escape_string($words[$i]);
if($i>0) $innerquery .= " AND ";
$innerquery .= "
(
field1 LIKE \"%$words[$i]%\" OR
field2 LIKE \"%$words[$i]%\" OR
field3 LIKE \"%$words[$i]%\" OR
field4 LIKE \"%$words[$i]%\"
)
";
}
SELECT fields FROM table WHERE $innerquery AND whatever;
dict is a dictionary function
InnoDB full-text search (FTS) is finally available in MySQL 5.6.4 release.
These indexes are physically represented as entire InnoDB tables, which are acted upon by SQL keywords such as the FULLTEXT clause of the CREATE INDEX statement, the MATCH() ... AGAINST syntax in a SELECT statement, and the OPTIMIZE TABLE statement.
From FULLTEXT Indexes

Using MySQL's "IN" function where the target is a column?

In a certain TABLE, I have a VARTEXT field which includes comma-separated values of country codes. The field is named cc_list. Typical entries look like the following:
'DE,US,IE,GB'
'IT,CA,US,FR,BE'
Now given a country code, I want to be able to efficiently find which records include that country. Obviously there's no point in indexing this field.
I can do the following
SELECT * from TABLE where cc_list LIKE '%US%';
But this is inefficient.
Since the "IN" function is supposed to be efficient (it bin-sorts the values), I was thinking along the lines of
SELECT * from TABLE where 'US' IN cc_list
But this doesn't work - I think the 2nd operand of IN needs to be a list of values, not a string. Is there a way to convert a CSV string to a list of values?
Any other suggestions? Thanks!
SELECT *
FROM MYTABLE
WHERE FIND_IN_SET('US', cc_list)
In a certain TABLE, I have a VARTEXT field which includes comma-separated values of country codes.
If you want your queries to be efficient, you should create a many-to-many link table:
CREATE TABLE table_country (cc CHAR(2) NOT NULL, tableid INT NOT NULL, PRIMARY KEY (cc, tableid))
SELECT *
FROM tablecountry tc
JOIN mytable t
ON t.id = tc.tableid
WHERE t.cc = 'US'
Alternatively, you can set ft_min_word_len to 2, create a FULLTEXT index on your column and query like this:
CREATE FULLTEXT INDEX fx_mytable_cclist ON mytable (cc_list);
SELECT *
FROM MYTABLE
WHERE MATCH(cc_list) AGAINST('+US' IN BOOLEAN MODE)
This only works for MyISAM tables and the argument should be a literal string (you won't be able to join on this condition).
The first rule of normalization says you should change multi-value columns such as cc_list into a single value field for this very reason.
Preferably into it's own table with IDs for each country code and a pivot table to support a many-to-many relationship.
CREATE TABLE my_table (
my_id INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
mystuff VARCHAR NOT NULL,
PRIMARY KEY(my_id)
);
# this is the pivot table
CREATE TABLE my_table_countries (
my_id INT(11) UNSIGNED NOT NULL,
country_id SMALLINT(5) UNSIGNED NOT NULL,
PRIMARY KEY(my_id, country_id)
);
CREATE TABLE countries {
country_id SMALLINT(5) UNSIGNED NOT NULL AUTO_INCREMENT,
country_code CHAR(2) NOT NULL,
country_name VARCHAR(100) NOT NULL,
PRIMARY KEY (country_id)
);
Then you can query it making use of indexes:
SELECT * FROM my_table JOIN my_table_countries USING (my_id) JOIN countries USING (country_id) WHERE country_code = 'DE'
SELECT * FROM my_table JOIN my_table_countries USING (my_id) JOIN countries USING (country_id) WHERE country_code IN('DE','US')
You may have to group the results my my_id.
find_in_set seems to be the MySql function you want. If you could actually store those comma-separated strings as MySql sets (no more than 64 possible countries, or splitting countries into two groups of no more than 64 each), you could keep using find_in_set and go a bit faster.
There's no efficient way to find what you want. A table scan will be necessary. Putting multiple values into a single text field is a terrible misuse of relational database technology. If you refactor (if you have access to the database structure) so that the country codes are properly stored in a separate table you will be able to easily and quickly retrieve the data you want.
One approach that I've used successfully before (not on mysql, though) is to place a trigger on the table that splits the values (based on a specific delimiter) into discrete values, inserting them into a sub-table. Your select can then look like this:
SELECT * from TABLE where cc_list IN
(
select cc_list_name from cc_list_subtable
where c_list_subtable.table_id = TABLE.id
)
where the trigger parses cc_list in TABLE into separate entries in column cc_list_name in table cc_list_subtable. It involves a bit of work in the trigger, too, as every change to TABLE means that associated rows in cc_list_table have to be deleted/updated/inserted as appropriate, but is an approach that works in situations where the original table TABLE has to retain its original structure, but where you are free to adapt the query as you see fit.