I have a situation where I am trying to JOIN two tables based on partially matching text data. I have read the question Using Full-Text Search in SQL Server 2005 across multiple tables, columns and it appears that my best option is to create a VIEW and add a full-text index on the VIEW.
Let me start by giving a little background of the situation. I have an Excel spreadsheet that I need to calculate some pricing for drugs, but the drug names in the spreadsheet do not match exactly to the database where I am pulling the pricing information. So I figured that using full-text search may be the way to go.
What I have done so far, is exported the spreadsheet as a CSV file and used BULK INSERT to import the data into my database. Now, my drug database has a primary key on NDC, but that information is not available on the spreadsheet unfortunately, or my job would be much easier.
I need to basically be able to match 'AMLODIPINE TAB 5MG' and 'AMLODIPINE BESYLATE 5MG TAB'. This is just one example, but the other drugs are similar. My issue is that I'm not even sure how I would be able to create a VIEW in order to add both columns, without them matching.
Is there a way to use a full-text search in a JOIN statement, something like:
SELECT i.Description, m.ProdDescAbbr
FROM dbo.ImportTable i
LEFT JOIN dbo.ManufNames m ON m.ProdDescAbbr <something similar to> i.Description
EDIT:
Not all of the drug names will contain extra words, another example that I am trying to match is: 'ACYCLOVIR TAB 800MG' AND 'ACYCLOVIR 800MG TAB'
in my work I saw this (fancy for me) function CONTAINSTABLE, which uses full text index. Maybe to much complicated function for this situation, but I wanted to share.
Returns a table of zero, one, or more rows for those columns containing precise or fuzzy (less precise) matches to single words and phrases, the proximity of words within a certain distance of one another, or weighted matches
Overall I see that you will need to prepare search condition (make it text) before looking for it.
Example:
SELECT select_list
FROM table AS FT_TBL
INNER JOIN CONTAINSTABLE(table, column, contains_search_condition) AS KEY_TBL
ON FT_TBL.unique_key_column = KEY_TBL.[KEY];
source http://msdn.microsoft.com/en-us/library/ms189760.aspx
You can add a
CREATE VIEW view_name WITH SCHEMABINDING
AS
in front of your SQL to create the view. Then you could
CREATE UNIQUE CLUSTERED INDEX idx_name
ON view_name(Description, ProdDescAbbr)
Then you can
CREATE FULLTEXT INDEX ON view_name
That will let you run a search with
WHERE CONTAINS( (Description, ProdDescAbbr), 'search_term')
Related
I have a brand_name column in my brand_names table and a product_name column in my product_names table.
At the moment, I have two separate SELECTs (one on brand_names.brand_name and one on product_names.product_name) and I use a UNION to OR the two resultsets together. However, when a search is made for "SomeBrandName Some Product Name", even though such a product exists, my SQL returns zero results (this is because the terms - SomeBrandName Some Product and Name - don't all appear in brand_names.brand_name and they don't all appear in product_names.product_name).
So I need help to work out SQLite / FTS3 equivalent of something like...
SELECT
(brand_names.brand_name || ' ' || product_names.product_name) AS brand_and_product_name
FROM
brand_names, product_names
WHERE
brand_and_product_name MATCH 'SomeBrandName Some Product Name'
What is the actual SQLite / FTS3 SQL that I need to achieve this?
In terms of research, I have read through the SQLite FTS3 guide but it doesn't mention multiple tables.
I've also seen a similar question which is a bit more advanced and so may well be overkill for the simple search I am trying to achieve here.
An FTS search can be done only in FTS indexes.
If you want to have a result for "Brand Product", you have to create an FTS table that contains these words in a single row.
(To reduce storage, try using an external content table on a view.)
For that you must have to prepare first virtual tables for both of your tables .
Then you can apply indexed search(FTS Search) on them using join same as you use join with simple table .
I have a postgresql view that is comprised as a combination of 3 tables:
create view search_view as
select u.first_name, u.last_name, a.notes, a.summary, a.search_index
from user as u, assessor as a, connector as c
where a.connector_id = c.id and c.user_id = u.id;
However, I need to concat tsvector fields from 2 of the 3 table into a single tsvector field in the view which provides full text search across 4 fields: 2 from one table, and 2 from another.
I've read the documentation stating that I can use the concat operator to combine two tsvector fields, but I'm not certain what this looks like syntactically, and also whether there are potential gotchas with this implementation.
I'm looking for example code that concats two tsvector fields from separate tables into a view, and also commentary on whether this is a good or bad practice in postgresql land.
I was wondering the same thing. I don't think we are supposed to be combining tsvectors from multiple tables like this. Best solution is to:
create a new tsv column in each of your tables (user, assessor, connector)
update the new tsv column in each table with all of the text you want to search. for example in the user table you would update the tsv column of all records concatenating first_name and last_name columns.
create an index on the new tsv column, this will be faster than indexing on the individual columns
Run your queries as usual, and let Postgres do the "thinking" about which indexes to use. It may or may not use all indexes in queries involving more than one table.
use the ANALYZE and EXPLAIN commands to look at how Postgres is utilizing your new indexes for particular queries, and this will give you insight into speeding things up further.
This will be my approach at least. I to have been doing lots of reading and have found that people aren't combining data from multiple tables into tsvectors. In fact I don't think this is possible, it may only be possible to use the columns of the current table when creating a tsvector.
Concatenating tsv vectors works but as per comments, index is probably not used this way (not an expert, can't say if it does or does not).
SELECT * FROM newsletters
LEFT JOIN campaigns ON newsletters.campaign_id=campaigns.id
WHERE newsletters.tsv || campaigns.tsv ## to_tsquery(unaccent(?))
The reason why you'd want this is to search for an AND string like txt1 & txt2 & txt 3 which is very common usage scenario. If you simpy split the search by an OR WHERE campaigns.tsv ## to_tsquery(unaccent(?) this won't work because it will try to match all 3 tokens in both tsv column but the tokens could be in either column.
One solution which I found is to use triggers to insert and update the tsv column in table1 whenever the table2 changes, see: https://dba.stackexchange.com/questions/154011/postgresql-full-text-search-tsv-column-trigger-with-many-to-many but this is not a definitive answer and using that many triggers is error prone and hacky.
Official documentation and some tutorials also show concatenating all the wanted colums into a ts vector on the fly without using a tsv column. But it is unclear how much slower is the on-the-fly versus tsv column approach, I can't find a single benchmark or explanation about this. The documenntation simply states:
Another advantage is that searches will be faster, since it will not
be necessary to redo the to_tsvector calls to verify index matches.
(This is more important when using a GiST index than a GIN index; see
Section 12.9.) The expression-index approach is simpler to set up,
however, and it requires less disk space since the tsvector
representation is not stored explicitly.
All I can tell from this is that tsv columns are probably waste of resources and just complicate things but it'd be nice to see some hard numbers. But if you can concat tsv columns like this, then I guess it's no different than doing it in a WHERE clause.
I built sqlite3 from source to include the FTS3 support and then created a new table in an existing sqlite database containing 1.5million rows of data, using
CREATE VIRTUAL TABLE data USING FTS3(codes text);
Then used
INSERT INTO data(codes) SELECT originalcodes FROM original_data;
Then queried each table with
SELECT * FROM original_data WHERE originalcodes='RH12';
This comes back instantly as I have an index on that column
The query on the FTS3 table
SELECT * FROM data WHERE codes='RH12';
Takes almost 28 seconds
Can someone help explain what I have done wrong as I expected this to be significantly quicker
The documentation explains:
FTS tables can be queried efficiently using SELECT statements of two different forms:
Query by rowid. If the WHERE clause of the SELECT statement contains a sub-clause of the form "rowid = ?", where ? is an SQL expression, FTS is able to retrieve the requested row directly using the equivalent of an SQLite INTEGER PRIMARY KEY index.
Full-text query. If the WHERE clause of the SELECT statement contains a sub-clause of the form " MATCH ?", FTS is able to use the built-in full-text index to restrict the search to those documents that match the full-text query string specified as the right-hand operand of the MATCH clause.
If neither of these two query strategies can be used, all queries on FTS tables are implemented using a linear scan of the entire table.
For an efficient query, you should use
SELECT * FROM data WHERE codes MATCH 'RH12'
but this will find all records that contain the search string.
To do 'normal' queries efficiently, you have to keep a copy of the data in a normal table.
(If you want to save space, you can use a contentless or external content table.)
You should read documentation more carefully.
Any query against virtual FTS table using WHERE col = 'value' will be slow (except for query against ROWID), but query using WHERE col MATCH 'value' will be using FTS and fast.
I'm not an expert on this, but here are a few things to think about.
Your test is flawed (I think). You are contrasting a scenario where you have an exact text match (the index can be used on original_data - nothing is going to outperform this scenario) with an equality on the fts3 table (I'm not sure that FTS3 would even come into play in this type of query). If you want to compare apples to apples (to see the benefit of FTS3), you're going to want to compare a "like" operation on original_data against the FTS3 "match" operation on data.
I've got a table with close to 5kk rows. Each one of them has one text column where I store my XML logs
I am trying to find out if there's some log having
<node>value</node>
I've tried with
SELECT top 1 id_log FROM Table_Log WHERE log_text LIKE '%<node>value</node>%'
but it never finishes.
Is there any way to improve this search?
PS: I can't drop any log
A wildcarded query such as '%<node>value</node>%' will result in a full table scan (ignoring indexes) as it can't determine where within the field it'll find the match. The only real way I know of to improve this query as it stands (without things like partitioning the table etc which should be considered if the table is logging constantly) would be to add a Full-Text catalog & index to the table in order to provide a more efficient search over that field.
Here is a good reference that should walk you through it. Once this has been completed you can use things like the CONTAINS and FREETEXT operators that are optimised for this type of retrieval.
Apart from implementing full-text search on that column and indexing the table, maybe you can narrow the results by another parameters (date, etc).
Also, you could add a table field (varchar type) called "Tags" which you can populate when inserting a row. This field would register "keywords, tags" for this log. This way, you could change your query with this field as condition.
Unfortunately, about the only way I can see to optimize that is to implement full-text search on that column, but even that will be hard to construct to where it only returns a particular value within a particular element.
I'm currently doing some work where I'm also storing XML within one of the columns. But I'm assuming any queries needed on that data will take a long time, which is okay for our needs.
Another option has to do with storing the data in a binary column, and then SQL Server has options for specifying what type of document is stored in that field. This allows you to, for example, implement more meaningful full-text searching on that field. But it's hard for me to imagine this will efficiently do what you are asking for.
You are using a like query.
No index involved = no good
There is nothing you can do with what you have currently to speed this up unfortunately.
I don't think it will help but try using the FAST x query hint like so:
SELECT id_log
FROM Table_Log
WHERE log_text LIKE '%<node>value</node>%'
OPTION(FAST 1)
This should optimise the query to return the first row.
Table A has millions of rows of indexed phrases (1-5 words). I'm looking for matches to about 20-30 phrases, e.g., ('bird', 'cat', 'cow', 'purple rain', etc.). I know that the IN operator is generally a bad idea when the search set is large - so the solution is to create a temp table (in memory) and JOIN it against the table I'm looking for.
I can create a TEMP TABLE B using my search phrases, and I know that if I do the join, the SQL engine will work against the Table A indices. Does it make any difference at all to index TEMP TABLE B phrases?
Edit... I just realized you're asking about sqlite. I'd say the same principal of keeping the very small joined table in cache would still apply though.
When joining tables, SQL server will put the relevant contents of one table in cache, if possible. Your 20 to 30 phrases will certainly fit in cache, so there would really be no point of indexing. Indexing is useful for looking up values, but SQL server will already have these values in cache. Also, since SQL server reads data a page at a time (a page is 8K), it will be able to read that entire table in one read.
When you make your temp table, make sure to use the same datatype so SQL server doesn't have to convert values to match.
Why would IN be a bad idea when the search terms are many?
From what I understand when I read about the SQLite query planner, a list of IN(1,2,3,4,5,6,N) would generate the same query plan as a join against a temporary table with the same rows.
An index on a temporary search term table will not make the query any faster since you process all terms. Going via index only adds processing time.