I am looking for an efficient way to query a postgreSQL database by removing the the right-most character in a string until a match is found. For example, if my dialing number is 442079285200 then it should strip characters from the end of the sequence, eventually matching to UNITED KINGDOM-LONDON 44207.
442079285200 -> No match
44207928520 -> No match
4420792852 -> No match
442079285 -> No match
44207928 -> No match
4420792 -> No match
442079 -> No match
44207 -> Matches UNITED KINGDOM-LONDON
v_destination_rates
destination
dialing_code
current_rate
rounding_rule
INMARSAT
870
10.8239
1-1-1
INTERNATIONAL NETWORKS
882
10.8239
1-1-1
INTERNATIONAL NETWORKS
883
10.8239
1-1-1
IRIDIUM
521844207
5.1167
1-1-1
UNITED KINGDOM-LONDON
44207
0.0056
1-1-1
I know one way of doing this is to loop over the number of characters in the dialing number (n) and do a select query for the left-most n characters. I haven't successfully ran my query, but I believe it would look something similar to:
$do$
DECLARE
m varchar := '442079285200';
BEGIN
FOR counter IN LENGTH(m)..1 loop
select destination from v_destination_rates where v_destination_rates.dialing_code = left(dialing_number, counter);
END LOOP;
END
$do$
I'm wondering if there is a more efficient way of performing this query, perhaps with the LIKE wildcard operator? We have thousands of dialing numbers to match to approximately 20 000 dialing_codes so a less expensive operation would be preferred.
You haven't said whether dialing_number is coming from a table / sanitized user input / something else.
For simplicity I'll assume it comes from a table contacts and that you want to return everything in contacts and every column in v_destination_rates joined as you describe.
Without using pl/pgSQL:
SELECT
*
FROM contacts c
LEFT JOIN v_destination_rates vdr
ON vdr.dialing_code::TEXT LIKE c.dialing_number::TEXT || '%'
I've tested this on a table of 9,000 records, which I assume is as about as large or larger than the lookup table v_destination_rates, and matched 16 sample integers in less than a tenth of a second.
You could possibly get even better performance if the dialing code is already type TEXT, and indexed lexicographically since that's how you're searching here.
I generally avoid regular expressions and like/similar searches whenever possible; they are slow. In this case they are completely avoidable instead you can use substring and length and do a equal match. (yea, 2 subroutines vs regex it is a toss up). The following does just that and when multiple matches occur it selects the longest match. (see demo)
with dialing_number (dn) as
( values ('442079285200') )
select dr.*
from dialing_number
join v_destination_rates dr
on dr.dialing_code = substring(dn,1, length(dr.dialing_code))
order by length(dr.dialing_code) desc
limit 1 ;
Concerning performance, a search set of 20000 is a very small number of entries to search. For curiosity I generated random dialing_code values until the above query took more than 1sec. That occurred at 4755006 rows searched in 1.07 sec.
I'm wondering if there is a more efficient way of performing this query
Yes, parse the number to get the country and dialing code. There are any number of existing libraries to do this. Then concatenate them and search.
For example, 442079285200 is the country code 44 and the dialing code 20 (207 is obsolete). Then you'd search for '4420'.
Note: 870, 882, and 883 are not dialing codes, they are country codes. And Iridium is 881. Mixing up country codes with dialing codes will probably cause more problems down the road, you may be better off separating them in your table.
Related
I need to execute a query that will join two tables on fields named a.PatientAddress and b.ADDRESS, the issue is that b.ADDRESS needs to be standardized and formatted to match the standardized address found in a.PatientAddress. I don't have control over the incoming data format, so having the data scrubbed before it comes into my b table is not an option. Example:
a.PatientAddress may equal something like 1234 Someplace Cool Dr. Apt 1234 while the matching address in b.ADDRESS may equal something like 1234 Someplace Cool Dr. #1234 (in reality that is just one of many possibilities). The Apartment number (if existent in the address) is the area of fluctuation that needs formatting in order to join properly.
Some possible Apt variations I've seen in the data set are:
1234 Someplace Cool Dr. #1234
1234 Someplace Cool Dr. Apt 1234
1234 Someplace Cool Dr. Apt #1234
1234 Someplace Cool Dr. Apt # 1234
Now, for what I've already tried;
SELECT vgi.VisitNo
,vgi.AdmitDate
,vgi.ChargesTotal
,MONTH(vgi.AdmitDate) AS AdmitMonth
,DATENAME(MONTH, vgi.AdmitDate) AS AdmitMonthName
,YEAR(vgi.AdmitDate) AS AdmitYear
,vgi.PatientAddress
,mm.MAIL_DATE
,mm.ADDRESS
FROM VISIT_GENERAL_INFORMATION vgi
INNER JOIN MARKETING_MAILING mm ON vgi.AdmitDate >= mm.MAIL_DATE
AND vgi.AdmitDate > '2014-01-01 00:00:00.000'
AND (
-- IF APT IS NOT FOUND, THEN ADDRESS SHOULD DIRECTLY EQUAL ANOTHER ADDRESS
( mm.ADDRESS NOT LIKE '%[$0-9]'
AND UPPER(vgi.PatientAddress) = UPPER(mm.ADDRESS)
)
OR
(
mm.ADDRESS LIKE '%[$0-9]'
AND UPPER(vgi.PatientAddress) =
-- PATIENT ADDRESS SHOULD EQUAL THE FORMATTED ADDRESS OF THE MAIL RECIPIENT
-- GET THE FIRST PART OF THE ADDRESS, UP TO THE ADDRESS NUMBER
SUBSTRING(mm.ADDRESS,1,CHARINDEX(REPLACE(LTRIM(RIGHT(mm.ADDRESS, CHARINDEX(' ', mm.ADDRESS)-1)),'#',''),mm.ADDRESS))
+ ' ' +
-- GET THE APARTMENT ADDRESS NUMBER AND FORMAT IT
-- TAKE OUT EXTRA SPACING AROUND IT AND THE # CHARACTER IF IT EXISTS
REPLACE(LTRIM(RIGHT(mm.ADDRESS, CHARINDEX(' ', mm.ADDRESS)-1)),'#','')
)
)
The problem here is that the query takes 20+ minutes to execute, and sometimes doesn't even finish before the operation time expires. I've also tried splitting the two conditions up into UNION statements. I've also tried splitting the street address and apartment number to create a like statement that reads UPPER(vgi.PatientAddress) LIKE UPPER('%1234 Someplace Cool Dr.%1234%') and that doesn't seem to work either. I'm starting to run out of ideas and wanted to see what others could suggest.
Thanks in advance for any pointers or help!
The logic needed to scrub the data is beyond the scope of what we can do for you. You'll likely find that, ultimately, you need some other key for this query to ever work. However, assuming your existing logic is adequate to create a good match (even if slow), we might be able to help improve performance a bit.
One way you can improve things is to join on a projection of the address table that cleans the data. (That means join to a sub query). That projection might look like this:
SELECT Mail_Date, Address,
CASE WHEN ADDRESS LIKE '%[$0-9]' THEN
-- GET THE FIRST PART OF THE ADDRESS, UP TO THE ADDRESS NUMBER
SUBSTRING(ADDRESS,1,CHARINDEX(REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#',''),ADDRESS))
+ ' ' +
-- GET THE APARTMENT ADDRESS NUMBER AND FORMAT IT
-- TAKE OUT EXTRA SPACING AROUND IT AND THE # CHARACTER IF IT EXISTS
REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#','')
ELSE UPPER(ADDRESS)
END AS ADDRESS_CLEAN
FROM MARKETING_MAILING
This improves things because it avoids the "OR" condition in your JOIN; you simply match to the projected column. However, this will force the projection over every row in the table (hint: that was probably happening anyway), and so it's still not as fast as it could be. You can get an idea for whether this will help from how long it takes to run the projection by itself.
You can further improve on the projection method by adding the ADDRESS_CLEAN column above as a computed column to your Marketing_Mailing table. This will force the adjustment to happen at insert time, meaning the work is already done for your slow query. You can even index on the column. Of course, that is at the cost of slower inserts. You might also try a view (or materialized view) on the table. This will help Sql Server save some of the work it does computing that extra column across multiple queries. For best results, also think about what WHERE filters you can use at the time you are creating the projection, to avoid needing to every compute the extra column on those rows in the first place.
An additional note is that, for the default collation, you can skip using the UPPER() function. That is likely hurting your index use.
Put it all together like this:
SELECT vgi.VisitNo
,vgi.AdmitDate
,vgi.ChargesTotal
,MONTH(vgi.AdmitDate) AS AdmitMonth
,DATENAME(MONTH, vgi.AdmitDate) AS AdmitMonthName
,YEAR(vgi.AdmitDate) AS AdmitYear
,vgi.PatientAddress
,mm.MAIL_DATE
,mm.ADDRESS
FROM VISIT_GENERAL_INFORMATION vgi
INNER JOIN
(
SELECT Mail_Date, Address,
CASE WHEN ADDRESS LIKE '%[$0-9]' THEN
-- GET THE FIRST PART OF THE ADDRESS, UP TO THE ADDRESS NUMBER
SUBSTRING(ADDRESS,1,CHARINDEX(REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#',''),ADDRESS))
+ ' ' +
-- GET THE APARTMENT ADDRESS NUMBER AND FORMAT IT
-- TAKE OUT EXTRA SPACING AROUND IT AND THE # CHARACTER IF IT EXISTS
REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#','')
ELSE ADDRESS END AS ADDRESS_CLEAN
FROM MARKETING_MAILING
) mm ON vgi.AdmitDate >= mm.MAIL_DATE
AND vgi.AdmitDate > '2014-01-01 00:00:00.000'
AND vgi.PatientAddress = mm.ADDRESS_CLEAN
Another huge factor not yet covered is indexes. What indexes are on your VISIT_GENERAL_INFORMATION table? I'd especially like to see a single index that covers both AdmitDate and PatientAddress. Which order is determined by the cardinality of those fields, and how clean and how much data is in the Marketing_Mail table.
Finally, one request of my own: if this helps, I'd like to hear back on just how much it helped. If the query used to take 20 minutes, how long does it take now?
I agree with #TomTom that you would really benefit from "pre-standardizing" into either
a derived column that updates on the fly
or a view or just a temp table in your query process
that gives you a clean match.
And with that, I would use a third-party service or library, ideally, because they have spent a lot of time making it a reliable parse.
Either option works after receiving the data you can't control, so that is not a problem.
What you're doing is creating your own, internal copy that is standardized.
Of course, you're going to need to run the other side, "a", through the same standardization.
Suppose I have a table like this
id data
1 0001
2 1000
3 2010
4 0120
5 0020
6 0002
sql fiddle demo
id is primary key, data is fixed length string where characters could be 0, 1, 2.
Is there a way to build an index so I could quickly find strings which are differ by n characters from given string? like for string 0001 and n = 1 I want to get row 6.
Thanks.
There is the levenshtein() function, provided by the additional module fuzzystrmatch. It does exactly what you are asking for:
SELECT *
FROM a
WHERE levenshtein(data, '1110') = 1;
SQL Fiddle.
But it is not very fast. Slow with big tables, because it can't use an index.
You might get somewhere with the similarity or distance operators provided by the additional module pg_trgm. Those can use a trigram index as detailed in the linked manual pages. I did not get anywhere, the module is using a different definition of "similarity".
Generally the problem seems to fit in the KNN ("k nearest neighbours") search pattern.
If your case is as simple as the example in the question, you can use LIKE in combination with a trigram GIN index, which should be reasonably fast with big tables:
SELECT *
FROM a
WHERE data <> '1110'
AND (data LIKE '_110' OR
data LIKE '1_10' OR
data LIKE '11_0' OR
data LIKE '111_');
Obviously, this technique quickly becomes unfeasible with longer strings and more than 1 difference.
However, since the string is so short, any query will match a rather big percentage of the base table. Therefore, index support will hardly buy you anything. Most of the time it will be faster for Postgres to scan sequentially.
I tested with 10k and 100k rows with and without a trigram GIN index. Since ~ 19% match the criteria for the given test case, a sequential scan is faster and levenshtein() still wins. For more selective queries matching less than around 5 % of the rows (depends), a query using an index is (much) faster.
I created a table with two columns.I inserted two rows.
id name
1 narsi reddy
2 narei sia
one is simply number type and another one is CLOB type.So i decided to use indexing on that. I queried on that by using contains.
query:
select * from emp where contains(name,'%a%e%')>0
2 narei sia
I expected 2 would come,but not. But if i give same with like it's given what i wanted.
query:
select * from emp where name like '%a%e%'
ID NAME
1 (CLOB) narsi reddy
2 (CLOB) narei sia
2 rows selected
finally i understood that like is searching whole document or paragraph but contains is looking in words.
so how can i get required output?
LIKE and CONTAINS are fundamentally different methods for searching.
LIKE is a very simple string pattern matcher - it recognises two wildcards (%) and (_) which match zero-or-more, or exactly-one, character respectively. In your case, %a%e% matches two records in your table - it looks for zero or more characters followed by a, followed by zero or more characters followed by e, followed by zero or more characters. It is also very simplistic in its return value: it either returns "matched" or "not matched" - no shades of grey.
CONTAINS is a powerful search tool that uses a context index, which builds a kind of word tree which can be searched using the CONTAINS search syntax. It can be used to search for a single word, a combination of words, and has a rich syntax of its own, such as boolean operators (AND, NEAR, ACCUM). It is also more powerful in that instead of returning a simple "matched" or "not matched", it returns a "score", which can be used to rank results in order of relevance; e.g. CONTAINS(col, 'dog NEAR cat') will return a higher score for a document where those two words are both found close together.
I believe that your CONTAINS query is matching 'narei sia' because the pattern '%a%e%' matches the word 'narei'. It does not match against 'narsi reddy' because neither word, taken individually, matches the pattern.
I assume you want to use CONTAINS instead of LIKE for performance reasons. I am not by any means an expert on CONTAINS query expressions, but I don't see a simple way to do the exact search you want, since you are looking for letters that can be in the same word or different words, but must occur in a given order. I think it may be best to do a combination of the two techniques:
WHERE CONTAINS(name,'%a% AND %e%') > 0
AND name LIKE '%a%e%'
I think this would allow the text index to be used to find candidate matches (anything which has at least one word containing 'a' and at least one word containing 'e'). These would would then be filtered by the LIKE condition, enforcing the requirement that 'a' precede 'e' in the string.
Use:
The user searches for a partial postcode such as 'RG20' which should then be displayed in a specific order. The query uses the MATCH AGAINST method in boolean mode where an example of the postcode in the database would be 'RG20 7TT' so it is able to find it.
At the same time it also matches against a list of other postcodes which are in it's radius (which is a separate query).
I can't seem to find a way to order by a partial match, e.g.:
ORDER BY FIELD(postcode, 'RG20', 'RG14', 'RG18','RG17','RG28','OX12','OX11')
DESC, city DESC
Because it's not specifically looking for RG20 7TT, I don't think it can make a partial match.
I have tried SUBSTR (postcode, -4) and looked into left and right, but I haven't had any success using 'by field' and could not find another route...
Sorry this is a bit long winded, but I'm in a bit of a bind.
A UK postcode splits into 2 parts, the last section always being 3 characters and within my database there is a space between the two if that helps at all.
Although there is a DESC after the postcodes, I do need them to display in THAT particular order (RG20, RG14 then RG18 etc..) I'm unsure if specifying descending will remove the ordering or not
Order By Case
When postcode Like 'RG20%' Then 1
When postcode Like 'RG14%' Then 2
When postcode Like 'RG18%' Then 3
When postcode Like 'RG17%' Then 4
When postcode Like 'RG28%' Then 5
When postcode Like 'OX12%' Then 6
When postcode Like 'OX11%' Then 7
Else 99
End Asc
, City Desc
You're on the right track, trimming the field down to its first four characters:
ORDER BY FIELD(LEFT(postcode, 4), 'RG20', 'RG14', ...),
-- or SUBSTRING(postcode FROM 1 FOR 4)
-- or SUBSTR(postcode, 1, 4)
Here you don't want DESC.
(If your result set contains postcodes whose prefixes do not appear in your FIELD() ordering list, you'll have a bit more work to do, since those records will otherwise appear before any explicitly ordered records you specify. Before 'RG20' in the example above.)
If you want a completely custom sorting scheme, then I only see one way to do it...
Create a table to hold the values upon which to sort, and include a "sequence" or "sort_order" field. You can then join to this table and sort by the sequence field.
One note on the sequence field. It makes sense to create it as an int as... well, sequences are often ints :)
If there is any possibility of changing the sort order, you may want to consider making it alpha numeric... It is a lot easier to insert "5A" between "5 and "6" than it is to insert a number into a sequence of integers.
Another method I use is utilising the charindex function:
order by charindex(substr(postcode,4,1),"RG20RG14RG18...",1)
I think that's the syntax anyway, I'm just doing this in SAS at the moment so I've had to adapt from memory!
But essentially the sooner you hit your desired part of the string, the higher the rank.
If you're trying to rank on a large variety of postcodes then a case statement gets pretty hefty.
Given that I have a table with a column of TEXT in it (MySQL or SQlite) is it possible to use the value of that column in a way that I could find similar rows with somewhat related text values?
For example, I if I wanted to find related rows to row_3 - both 1 & 2 would match:
row_1 = this is about sports
row_2 = this is about study
row_3 = this is about study and sports
I know that I could use FULLTEXT or FTS3 if I had a key word I wanted to MATCH against the column values - but I'm just trying to find text that is somewhat related among the rows.
MySQL supports a fulltext search option called QUERY EXPANSION. The idea is that you search for a keyword, it finds a row, and then it uses the words in that row as keywords, to search for more matching rows.
SELECT ... FROM StudiesTable WHERE MATCH(description_text)
AGAINST ('sports' IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION);
Read about it here: http://dev.mysql.com/doc/refman/5.1/en/fulltext-query-expansion.html
You're using the wrong hammer to pound that screw in. A single string in a database column isn't the way to store that data. You can't easily get at the part you care about, which is the individual words.
There is a lot of research into the problem of comparison of text. If you're serious about this need, you'll want to start reading about the variety of techniques in that problem domain.
The first clue is that you want to access / index the data not by complete text string, but by word or sentence fragment (unless you're interested in words that are spelled similarly being matched together, which is harder).
As an example of one technique, generate a chain out of your sentences by grabbing overlapping sets of three words, and store the chain. Then you can search for entries that have a large number of chain segments in common. A set of chain segments for your statements above would be:
row_1 = this is about sports
row_2 =
this is about study
row_3 = this is
about study and sports
this is about (3 matches)
is about sports
is about study (2 matches)
about study and
study and sports
Maybe it would be enough to take each relevant word (more than 4 letters? or comparing against a list of commom words?) in the base row using them as keywords for the fulltext search and building a tmp table (id, row_matched_id, count) to record the matches for each row adding 1 to count when it matches. At the end you'll get in the tmp table all the lines that matched and how many times they matched (how many relevant words were the same).If you want to run it once against the whole database and keep the results, use a persisted table, add a column for the id of the base row and do the search for each new row inserted (or updated) to update the results table.
Using this results table you can find quickly the rows matching more words of the base row without doing the search again.
Edit: with this you can "score" the results, for example, if you count x relevant words in the base row, you can calculate a score in % as (matches/x * 100) and filter all results with for example less than 50% matches. In your example, each row_1 and row_2 would give 50% if considering relevants only words with more than 4 letters or 67% if you consider all the words.