Wildcard text matching (joining) , what database system is best suited? - sql

I have a question for text analysis and database experts. I would like to match person names from one database table to text articles in another table. For example:
SELECT text FROM article
INNER JOIN person
ON article.text LIKE "%" || person.name || "%"
This method is very slow on any database I tried, like Netezza, Redshift and traditional RDS's like MySQL or SQL server.
What system is best suited for queries like this?

It is slow because you don't have an index and every query ends up in multiple full table scans. You can stay with a RDMBS if you like, the only thing you need to create is an index table
This table looks like this:
word varchar(n),
document_id int
The idea is to creaet for every word in a document one entry in this tabe where document_id points to the row in your source table
Then you create a db index in the word column and you end up with O(n log n) time complexity for your query
You can also try ibm db2 text search or similar tools from other vendors which basically are doing the same

Related

SQLite FTS3 - Full-Text Search over multiple tables

I have a brand_name column in my brand_names table and a product_name column in my product_names table.
At the moment, I have two separate SELECTs (one on brand_names.brand_name and one on product_names.product_name) and I use a UNION to OR the two resultsets together. However, when a search is made for "SomeBrandName Some Product Name", even though such a product exists, my SQL returns zero results (this is because the terms - SomeBrandName Some Product and Name - don't all appear in brand_names.brand_name and they don't all appear in product_names.product_name).
So I need help to work out SQLite / FTS3 equivalent of something like...
SELECT
(brand_names.brand_name || ' ' || product_names.product_name) AS brand_and_product_name
FROM
brand_names, product_names
WHERE
brand_and_product_name MATCH 'SomeBrandName Some Product Name'
What is the actual SQLite / FTS3 SQL that I need to achieve this?
In terms of research, I have read through the SQLite FTS3 guide but it doesn't mention multiple tables.
I've also seen a similar question which is a bit more advanced and so may well be overkill for the simple search I am trying to achieve here.
An FTS search can be done only in FTS indexes.
If you want to have a result for "Brand Product", you have to create an FTS table that contains these words in a single row.
(To reduce storage, try using an external content table on a view.)
For that you must have to prepare first virtual tables for both of your tables .
Then you can apply indexed search(FTS Search) on them using join same as you use join with simple table .

JOIN two tables using full-text search in SQL Server

I have a situation where I am trying to JOIN two tables based on partially matching text data. I have read the question Using Full-Text Search in SQL Server 2005 across multiple tables, columns and it appears that my best option is to create a VIEW and add a full-text index on the VIEW.
Let me start by giving a little background of the situation. I have an Excel spreadsheet that I need to calculate some pricing for drugs, but the drug names in the spreadsheet do not match exactly to the database where I am pulling the pricing information. So I figured that using full-text search may be the way to go.
What I have done so far, is exported the spreadsheet as a CSV file and used BULK INSERT to import the data into my database. Now, my drug database has a primary key on NDC, but that information is not available on the spreadsheet unfortunately, or my job would be much easier.
I need to basically be able to match 'AMLODIPINE TAB 5MG' and 'AMLODIPINE BESYLATE 5MG TAB'. This is just one example, but the other drugs are similar. My issue is that I'm not even sure how I would be able to create a VIEW in order to add both columns, without them matching.
Is there a way to use a full-text search in a JOIN statement, something like:
SELECT i.Description, m.ProdDescAbbr
FROM dbo.ImportTable i
LEFT JOIN dbo.ManufNames m ON m.ProdDescAbbr <something similar to> i.Description
EDIT:
Not all of the drug names will contain extra words, another example that I am trying to match is: 'ACYCLOVIR TAB 800MG' AND 'ACYCLOVIR 800MG TAB'
in my work I saw this (fancy for me) function CONTAINSTABLE, which uses full text index. Maybe to much complicated function for this situation, but I wanted to share.
Returns a table of zero, one, or more rows for those columns containing precise or fuzzy (less precise) matches to single words and phrases, the proximity of words within a certain distance of one another, or weighted matches
Overall I see that you will need to prepare search condition (make it text) before looking for it.
Example:
SELECT select_list
FROM table AS FT_TBL
INNER JOIN CONTAINSTABLE(table, column, contains_search_condition) AS KEY_TBL
ON FT_TBL.unique_key_column = KEY_TBL.[KEY];
source http://msdn.microsoft.com/en-us/library/ms189760.aspx
You can add a
CREATE VIEW view_name WITH SCHEMABINDING
AS
in front of your SQL to create the view. Then you could
CREATE UNIQUE CLUSTERED INDEX idx_name
ON view_name(Description, ProdDescAbbr)
Then you can
CREATE FULLTEXT INDEX ON view_name
That will let you run a search with
WHERE CONTAINS( (Description, ProdDescAbbr), 'search_term')

SQLITE FTS3 Query Slower than Standard Tabel

I built sqlite3 from source to include the FTS3 support and then created a new table in an existing sqlite database containing 1.5million rows of data, using
CREATE VIRTUAL TABLE data USING FTS3(codes text);
Then used
INSERT INTO data(codes) SELECT originalcodes FROM original_data;
Then queried each table with
SELECT * FROM original_data WHERE originalcodes='RH12';
This comes back instantly as I have an index on that column
The query on the FTS3 table
SELECT * FROM data WHERE codes='RH12';
Takes almost 28 seconds
Can someone help explain what I have done wrong as I expected this to be significantly quicker
The documentation explains:
FTS tables can be queried efficiently using SELECT statements of two different forms:
Query by rowid. If the WHERE clause of the SELECT statement contains a sub-clause of the form "rowid = ?", where ? is an SQL expression, FTS is able to retrieve the requested row directly using the equivalent of an SQLite INTEGER PRIMARY KEY index.
Full-text query. If the WHERE clause of the SELECT statement contains a sub-clause of the form " MATCH ?", FTS is able to use the built-in full-text index to restrict the search to those documents that match the full-text query string specified as the right-hand operand of the MATCH clause.
If neither of these two query strategies can be used, all queries on FTS tables are implemented using a linear scan of the entire table.
For an efficient query, you should use
SELECT * FROM data WHERE codes MATCH 'RH12'
but this will find all records that contain the search string.
To do 'normal' queries efficiently, you have to keep a copy of the data in a normal table.
(If you want to save space, you can use a contentless or external content table.)
You should read documentation more carefully.
Any query against virtual FTS table using WHERE col = 'value' will be slow (except for query against ROWID), but query using WHERE col MATCH 'value' will be using FTS and fast.
I'm not an expert on this, but here are a few things to think about.
Your test is flawed (I think). You are contrasting a scenario where you have an exact text match (the index can be used on original_data - nothing is going to outperform this scenario) with an equality on the fts3 table (I'm not sure that FTS3 would even come into play in this type of query). If you want to compare apples to apples (to see the benefit of FTS3), you're going to want to compare a "like" operation on original_data against the FTS3 "match" operation on data.

How to speed up sqlite search through over 500,000 rows of international characters

I have a table called search_terms with a column called term with a type of TEXT. There is also a column called id and popularity. The term contains international characters as well as ASCII characters. (Japanese and English)
I'm trying to search through this table quickly using sqlite. Unfortunately, searches are taking well over 5 seconds. I search with something similar to:
SELECT id from search_terms where term LIKE 'ka%' order by popularity
I understand sqlite LIKE operator is slow due to it not taking advantage of indexes (bummer). Also, FTS can't help me here because I'm not searching for full words. It starts with the first letter and the search may continue (using a live search paradigm).
Other things to note. The data in the database is static. It won't change. I can add tables to speed things up, possibly, I'd just like some suggestions. This is heading into an embedded device, so it needs to be as quick as possible. Assume space is not an issue.
According to the documentation, SQLite does use an index for LIKE. But you need to create an index of course.
SQLite FTS does support prefix searches:
MATCH 'ka*'
instead of
LIKE 'ka%'
assuming the table is changed to an FTS table.
If is it possible, use indexes.
create clustered index ix1 on TABLE (term)
or
create nonclustered index ix1 on TABLE (term)
Try It.
Tiz

efficiency of SQL 'LIKE' statement with large number of clauses

I need to extract information from a text field which can contain one of many values. The SQL looks like:
SELECT fieldname
FROM table
WHERE bigtextfield LIKE '%val1%'
OR bigtextfield LIKE '%val2%'
OR bigtextfield LIKE '%val3%'
.
.
.
OR bigtextfield LIKE '%valn%'
My question is: how efficient is this when the number of values approaches the hundreds, and possibly thousands? Is there a better way to do this?
One solution would be to create a new table/column with just the values I'm after and doing the following:
SELECT fieldname
FROM othertable
WHERE value IN ('val1', 'val2', 'val3', ... 'valn')
Which I imagine is a lot more efficient as it only has to do exact string matching. The problem with this is that it will be a lot of work keeping this table up to date.
btw I'm using MS SQL Server 2005.
This functionality is already present in most SQL engines, including MS SQL Server 2005. It's called full-text indexing; here are some resources:
developer.com: Understanding SQL Server Full-Text Indexing
MSDN article: Introduction to Full-Text Search
I don't think the main problem is the number of criteria values - but the sheer fact that a WHERE clause with bigtextfield LIKE '%val1%' can never really be very efficient - even with just a single value.
The trouble is the fact that if you have a placeholder like "%" at the beginning of your search term, all the indices are out the window and cannot be used anymore.
So you're basically just searching each and every entry in your table doing a full table scan in the process. Now your performance basically just depends on the number of rows in your table....
I would support intgr's recommendation - if you need to do this frequently, have a serious look at fulltext indexing.
This will inevitable require a fullscan (over the table or over an index) with a filter.
The IN condition won't help here, since it does not work on LIKE
You could do something like this:
SELECT *
FROM master
WHERE EXISTS
(
SELECT NULL
FROM values
WHERE name LIKE '%' + value + '%'
)
, but this hardly be more efficient.
All literal conditions will be transformed to a CONSTANT SCAN which is just like selecting from the same table, but built in memory.
The best solution for this is to redesign and get rid of that field that is storing multiple values and make it a related table instead. This violates one of the first rules of database design.
You should not be storing multiple values in one field and dead slow queries like this are the reason why. If you can't do that then full-test indexing is your only hope.