Testing phrases to see if they match each other - sql

I have a large number of phrases (~ several million), each less than six or seven words and the large majority less than five, and I would like to see if they "phrase match" each other. This is a search engine marketing term - essentially, A phrase matches B if A is contained in B. Right now, they are stored in a db (postgres), and I am performing a join on regexes (see this question). It is running impossibly slowly even after trying all basic optimization tricks (indexing, etc) and trying the suggestions provided.
Is there an easier way to do this? I am not averse to a non-DB solution. Is there any reason to think that regexes are overkill and are taking way longer than a different solution?

An ideal algorithm for doing sub-string matching is AhoCorsick.
Although you will have to read the data out of the database to use it, it is tremendously fast, when compared to more naive methods.
See here for a related question on substring matching:
And here for an AhoCorsick implementation in Java:

It would be great to get a little more context as to why you need to see which phrases are subsets of others: for example, it seems strange that the DB would be built in such a way anyway: you're having to do the work now because the DB is not in an appropriate format, so it makes sense that you should 'fix' the DB or the way in which it is built, instead.
It depends massively on what you are doing with the data and why, but I have found it useful in the past to break things down into single words and pairs of words, then link resources or phrases to those singles/pairs.
For example to implement a search I have done:
Source text: Testing phrases to see
Entries:
testing
testing phrases
phrases
phrases to
to
to see
see
To see if another phrase was similar (granted, not contained within) you would break down the other phrase in the same way and count the number of phrases common between them.
It has the nice side effect of still matching if you were to use (for example) "see phases to testing": because the individual words would match.. but because the order is different the pairs wouldn't, so it's taking phrases (consecutive words) into account at the same time, the number of matches wouldn't be as high, good for use as a 'score' in matching.
As I say that -kind- of thing has worked for me, but it would be great to hear some more background/context, so we can see if we can find a better solution.

When you have the 'cleaned column' from MaasSQL's previous answer, you could, depending on the way "phrase match" works exactly (I don't know), sort this column based on the length of the containing string.
Then make sure you run the comparison query in a converging manner in a procedure instead of a flat query, by stepping through your table (with a cursor) and eliminating candidates for comparison through WHERE statements and through deleting candidates that have already been tested (completely). You may need a temporary table to do this.
What do I mean with 'WHERE' statement previously? Well, if the comparison value is in a column sorted on length, you'll never have to test whether a longer string matches inside a shorter string.
And with deleting candidates: starting with the shortest strings, once you've tested all strings of a certain length, you'll can remove them from the comparison table, as any next test you'll do will never get a match.
Of course, this requires a bit more programming than just one SQL statement. And is dependent on the way "phrase match" works exactly.
DTS or SSIS may be your friend here as well.

Related

Does redshift store failed queries?

I wish to analyze the queries executed on certain redshift warehouse (not mine).
In order to do so I'm using a query with a join on stl_querytext and stl_query.
My question is how come I'm also getting illegal queries (I.E queries with wrong sql syntax)?
When I've tried it in my local redshift I haven't seen those. Also, couldn't find relevant documentation.
Is this a configuration issue? And in case I'm supposed to those queries is there a way to know those are illegal ones?
Thanks,
Nir.
So stl_querytext breaks long queries into parts identified by sequence number. I hope you are reconstructing the parts into the original query as a first step. This can be done with listagg() function as long as the resulting query doesn't over the max tex field (about 320 parts).
Now this is not enough to get valid SQL back in all cases because you need to treat combining the parts differently depending if the section of the query is inside or outside a text string in the query. (Is white space needed between parts or not)
I've done this exact process a bunch so this is doable. I don't have a perfect process on the whitespace question, I get close. Maybe others know the expression to get an exact recreation of the query from stl_querytext.

Is using comma separated field good or not

I have a table named buildings
each building has zero - n images
I have two solutions
the first one (the classic solution) using two tables:
buildings(id, name, address)
building_images(id, building_id, image_url)
and the second solution using olny one table
buildings(id, name, address, image_urls_csv)
Given I won't need to search by image URL obviously,
I think the second solution (using image_urls_csv column) is easier to use, and no need to create another table just to keep the images, also I will avoid the hassle of multiple queries or joining.
the question is, if I don't really want to filter, search or group by the filed value, can I just make it CSV?
On the one hand, by simply having a column of image_urls_list avoids joins or multiple queries, yes. A single round-trip to the db is always a plus.
On the other hand, you then have a string of urls that you need to parse. What happens when a URL has a comma in it? Oh, I know, you quote it. But now you need a parser that is beyond a simple naive split on commas. And then, three months from now, someone will ask you which buildings share a given image, and you'll go through contortions to handle quotes, not-quotes, and entries that are at the beginning or end of the string (and thus don't have commas on either side). You'll start writing some SQL to handle all this and then say to heck with it all and push it up to your higher-level language to parse each entry and tell if a given image is in there, and find that this is slow, although you'll realise that you can at least look for %<url>% to limit it, ... and now you've spent more time trying to hack around your performance improvement of putting everything into a single entry than you saved by avoiding joins.
A year later, someone will give you a building with so many URLs that it overflows the text limit you put in for that field, breaking the whole thing. Or add some extra fields to each for extra metadata ("last updated", "expires", ...).
So, yes, you absolutely can put in a list of URLs here. And if this is postgres or any other db that has arrays as a first-class field type, that may be okay. But do yourself a favour, and keep them separate. It's a moderate amount of up-front pain, and the long-term gain is probably going to make you very happy you did.
Not
"Given I won't need to search by image URL obviously" is an assumption that you cannot make about a database. Even if you never do end up searching by url, you might add other attributes of building images, such as titles, alt tags, width, height, etc, so you would end up having to serialize all this data in that one column, and then you would not be able to index any of it. Plus, if you serialize it with one language, then you or whoever comes after you using a different language will either have to install some 3rd party library to deserialize your stuff or write their own deserialization function.
The only case that I can think of where you should keep serialized data in a database is when you inherit old software that you don't have time to fix yet.

Clean unstructured place name to a structured format

I have around 300k unstructured data as below screen.I'm trying to use Google refine or OpenRefine to make this correct. However, I'm unable to find a proper way to do this. I'm new to this tool. Anyone's help would be greatly appreciated.Also, this tool is quite slow to process 300k records. If I am trying out something its taking lots of time to process and give an output.
OR Please suggest any other opensource tools and techniques do this?
As Owen said in comments, your question is probably too broad and cannot receive acceptable answer. We can just provide you with a general procedure to follow.
In Open Refine, you'll need to create a column based on the messy column and apply transformations to delete unwanted characters. You'll have to use regular expressions. But for that, it's necessary to be able to identify patterns. It's not clear to me why the "ST" of "Nat.secu ST." is important, but not the "US" in "Massy Intertech US". Not even the "36" in "Plowk 36" (Google doesn't know this word, so I'm not sure is an organisation name).
On the basis of your fifteen lines, however, we seem to distinguish some clear patterns. For example, it looks like you'll have to remove the tokens (character suites without spaces) at the end of the string that contain a #. For that, the GREL formula in Open Refine could look like this:
value.trim().replace(/\b\w+#\w+\b$/,'')
Here is a screencast if it's not clear to you.
But sometimes a company name may contain a #, in which case you will need to create more complex rules. For example, remove the token only if the string contains more than two words.
if(value.split(' ').length() > 2, value.replace(/\b\w+#\w+\b$/, ''), value)
And so on for the other patterns that you'll find (for example, any number sequence at the end that contains more than 4 numbers and one - between them)
Feel free to check out the Open Refine documentation in case of doubt.

SQL DIFFERENCE function with names bringing too many results

I have a function that uses the SQL DIFFERENCE function to see if the name of a client is similar to a client already in the database
SELECT ID FROM People p
WHERE DIFFERENCE(p.FullName, #fullName) = 4
Being #fullname a variable passed to the function. The issue I'm having is that if I pass "pedro sanchez" as a parameter, the query will bring me all the Peter's in the database, or if I enter "pablo sanchez", it'll bring record "PEOPLE'S CREDIT UNION".
As I understand the DIFFERENCE function should returns 4 when the two strings are almost identical, but the results I'm having say otherwise.
Is there a way to further specify the resemblance to the DIFFERENCE function, or maybe another approach in finding similar names ?
Difference() is based on soundex(), which in turn -- to be frank -- is a lousy system for comparing strings. Let me add a caveat: it is pretty good for the purpose it was designed for, which is matching last names of people in English. You can read about the rules here and you can try it out here. Using the latter link, you can see that "Pedro" and "People" have the same code, P-140.
Soundex encodes the consonants and basically the first four matching consonants the list it cares about. (Some languages, such as Hawaiian and other Polynesian languages are rather light in consonants. One assumes the designers were not thinking about names in such languages.)
When you are looking for proximity among written strings, Levenshtein distance is a common metric. Unfortunately, SQL Server does not have this functionality built-in, but you can easily find implementations on the web. For most real applications, Levenshtein distance is too slow. Happily, the functionality of the full text search component is usually sufficient for most purposes.

SQL Server Text Searching

I have a business requirement where we need to do somce crazy name matching against records stored in the database and I was wondering if there is any easy way to do it using SQL Server.
Name Stored in the DB : Austin K
Name to be Matched from UI : Austin Kierland
That's just a sample. In reality, there could be whole lot of different permutations and combinations.
If it's other way round, I could've used wild character but in this case, the name in the database is smaller than the search criteria.
Any suggestions?
Realistically - no. Databases were meant for comparing absolute values, not for messy comparisons. The way they store their data internally just isn't fit for really messy matching. Actually even a superpowerful dedicated search engine like Google, that has a LOT of messy matching features, wouldn't be able to pull off your example without prior knowledge.
I don't know how the requirement is precisely worded, but I'd either shoot the feature request with "technically impossible", or implement a rule set for which messy matches are tried - for your example, you could easily 'hard code' that multiple searches are executed when capitalized words are entered, shortening them so a single letter. No idea if that's a solution to your problem though.
You can do a normal search using the LIKE operator which determines whether a specific character string matches a specified pattern. The problem you will run into is the probability of the returning of multiple records or incorrect people. I've had similar requirement myself for a business app and the best solution to the issue is to require other qualifying values rather then just name. If you do a partial name search without other qualifying data you are certainly going to come across the false positive matches and/or multiple records. In my case I built a web service that checks eligibility allowing text search for first & last name but also added date of birth, primary person SSN, and gender which ensured the matching person was in deed the person intended to search for. If my situation was like yours in which name was the only search criteria my recommendation to the business would be we cannot perform the search until qualifying data is entered into the database otherwise there is no accurate way to query the results they are looking for.