Find similar entries in SQL column and rank by frequency - sql

I have a column of 10k URIs in my SQLite database. I would like to identify which of these URIs are subdomains of the same website.
For instance, for the given set...
1. daiquiri.rum.cu
2. mojito.rum.cu
3. cubalibre.rum.cu
4. americano.campari.it
5. negroni.campari.it
6. hemingway.com
... I would like to run a query that returns:
Website | Occurrences
----------------------------
rum.cu | 3
campari.it | 2
hemingway.com | 1
That is, the domain names / patterns that were matched, ranked by the number of times they were found in the database.
The heuristic I would use is: for every URI with 3+ domains, replace first domain with '%'and execute the pseudoquery: COUNT(uris from website where uris LIKE '%.remainderofmyuri').
Note that I don't care much about execution speed (in fact, not at all). The number of entries is within the range of 10k-100k.

The only problem is to find the domain. In order to find an algorithm imagine your urls with an additional dot in front (like '.negroni.campari.it' and '.hemingway.com'). You see then it's always the string that comes after the second dot from right. All we have to do is look for that occurrence and strip part of the string. Unfortunately, however, SQLite's string functions are rather poor. There is no function that gives you the second occurence of a dot, not even when counting from left. So the agorithm is great for most dbms, but it isn't for SQLite. We need another approach. (I am writing this anyhow, to show how to usually approach the problem.)
Here is the SQLite solution: The difference between a domain and the subdomains is that in the domain there is exactly one dot, whereas a subdomain has at least two. So when there is more than one dot, we must remove the first part including the first dot in order to get to the domain. Moreover we want this to work even with sub domains like abc.def.geh.ijk.com, so we must do this recursively.
with recursive cte(uri) as
(
select uri from uris
union all
select substr(uri, instr(uri, '.') + 1) as uri from cte where instr(uri, '.') > 0
)
select uri, count(*)
from cte
where length(uri) = length(replace(uri,'.','')) + 1 -- domains only
group by uri
order by count(*) desc;
Here we generate 'daiquiri.rum.cu' and 'rum.cu' and 'cu' from 'daiquiri.rum.cu' etc. So for every uri we get the domain (here 'rum.cu') and some other strings. At last we filter with LENGTH to get those strings that have exactly one dot - the domains. The rest is group by and count.
Here is the SQL fiddle: http://sqlfiddle.com/#!5/c1f35/37.

select x.site, count(*)
from mytable a
inner join
(
select 'rum.cu' as site
union all select 'campari.it'
union all select 'hemingway.com'
) x on a.url like '%' + x.site + '%'
group by x.site -- EDIT I missed out the GROUP BY on the first go - sorry!
(This is how I'd do it in SQL-Server; not sure how SQLite differs in syntax.)
'mytable' is your table whuch has a column called url containing 'mojito.rum.cu' etc. I haven't put the '%.' in the like because that would miss out hemmingway.com. However you could get around that by using this line instead:
) x on a.url like '%.' + x.site + '%' or a.url = x.site
You may not need the fimal + '%' - I put it in to catch urls like 'hemingway.com/some-page.html. If you don't have urls like that you can skip that.
EDIT for dynamic names
select x.site, count(*)
from mytable a
inner join
(
select distinct ltrim(url, instr(url, '.')) as site
from mytable
where url like '%.%.%'
union
select distinct url
from mytable
where url like '%.%' and url not like '%.%.%'
) x on a.url like '%' + x.site + '%'
group by x.site
Something like that should do it. I haven't tested that the INSTR() function is correct. You may need to add or subtract 1 from the offset it generates when you test it. It may not be the fastest query but it should work.

Related

Where clause to exclude specific email domains

I have a list of emails and I want to write a where statement, to exclude rows that only contain the email domains %#icloud.com or %#mac.com
For example, emails list looks like this:
abc#gmail.com; 123#hotmail.com
123#outlook.com; abc#icloud.com
123#icloud.com;
123#icloud.com; abc#mac.com
the desired output should look like this:
abc#gmail.com; 123#hotmail.com
123#outlook.com; abc#icloud.com (this row should be returned because it also contains '#outlook.com' which isn't on my exclude list)
Given negative lookaheads are not supported, away to achieve that is two remove the unwanted matched, and then look for an "any email left" match:
SELECT column1
,REGEXP_REPLACE(column1, '#((icloud)|(mac))\\.com', '') as cleaned
,REGEXP_LIKE(cleaned, '.*#.*\\.com.*') as logic
FROM VALUES
('abc#gmail.com; 123#hotmail.com'),
('123#outlook.com; abc#icloud.com'),
('123#icloud.com;'),
('123#icloud.com; abc#mac.com');
gives:
COLUMN1
CLEANED
LOGIC
abc#gmail.com; 123#hotmail.com
abc#gmail.com; 123#hotmail.com
TRUE
123#outlook.com; abc#icloud.com
123#outlook.com; abc
TRUE
123#icloud.com;
123;
FALSE
123#icloud.com; abc#mac.com
123; abc
FALSE
which can be merged into one line:
,REGEXP_LIKE(REGEXP_REPLACE(column1, '#((icloud)|(mac))\\.com'), '.*#.*\\.com.*') as logic
If you prefer a more vanilla approach to Simeon's solution
where replace(replace(col,'#icloud.com',''), '#mac.com','') like '%#%'
In Snowflake, the replacement string is optional, which shortens that to
where replace(replace(col,'#icloud.com'), '#mac.com') like '%#%'
This is based on string split approach in SQL server, using split_to_table function, you probably have to tweak the syntax a little:
select *
from t
where exists (
select *
from split_to_table(t.emails, ';') as sv
where sv.value not like '%#icloud.com'
and sv.value not like '%#mac.com'
)

BigQuery - Using regexp with LIKE operator (?)

I'd like to get productids from url and I've almost finetuned a query to do it but still there is an issue I cannot solve.
The url usually looks like this:
/xp-pen/toll-spe43-deco-pro-small-medium-spe43-tobuy-p665088831/
or
/harry-potter-es-a-tuz-serlege-2019-m19247107/
As you can see there are two types of ids:
in general, ids start with '-p'
ids of some special products start with '-m'
I created this case when statement:
CASE
WHEN MAX(hits.page.pagePath) LIKE '%-p%'
THEN MAX(REGEXP_REPLACE(REGEXP_EXTRACT(
hits.page.pagePath, '-p[0-9]+/'), '\\-|p|/', ''))
WHEN MAX(hits.page.pagePath) LIKE '%-m%'
THEN MAX(REGEXP_REPLACE(REGEXP_EXTRACT(
hits.page.pagePath, '-m[0-9]+/'), '\\-|m|/', ''))
ELSE NULL
END AS productId
It's a little complicated at the first look but I really needed a regexp_replace and a regexp_extract because '-p' or '-m' characters doesn't appear only before the id but it can be multiplied times in a url.
The problem with my code is that there are some special cases when the url looks like this:
/elveszett-profeciak-2019-m17855487/
As you can see the id starts with '-m' but the url also contains '-p'. In this case the result is empty value in the query.
I think it could be solved by modifying the like operator in the when part of the case when statement: LIKE '%-p%' or LIKE '%-m%'
It would be great to have a regexp expression after or instead of the LIKE operator. Something similar to the parameter of '-p[0-9]+/' what I used in regexp_extract function.
So what I would need is to define in the when part of the statement that if the '-p' or '-m' text is followed by numbers in the urls
I'm not sure it's possible to do or not in BQ.
So what I would need is to define in the when part of the statement that if the '-p' or '-m' text is followed by numbers in the urls
I think you want '-p' and '-m' followed by digits. If so, I think this does what you want:
select regexp_extract(url, '-[pm][0-9]+')
from (select '/xp-pen/toll-spe43-deco-pro-small-medium-spe43-tobuy-p665088831/' as url union all
select '/elveszett-profeciak-2019-m17855487/' union all
select '/harry-potter-es-a-tuz-serlege-2019-m19247107/'
) x

Like Operator in SQL

Im trying to write a report using an sql query
SELECT
r.PostCode,
r.Email
FROM dbo.tbl_RegisterMain_Holding as r
Where r.PostCode LIKE #PostCode AND RegisterType = '2'
When the report is run you can put in multiple postcodes to search by, however i don't get any results back.
I only want to search by the first 3 or 4 letters of the post code.
Any Idea why this is not bringing back data? Would be much appreciated
Thanks
If #PostCode contains a SINGLE postcode/fragment to search for, and assuming you want to search for matches STARTING with that value, try:
Where r.PostCode LIKE #PostCode + '%' AND RegisterType = '2'
If #PostCode contains MULTIPLE postcodes/fragments to search for (e.g. "AB,AC,AD") then this approach will not work and you need a way to split them out into the individual values first before matching. You can use one of the approaches I've outlined here, except the JOIN would use a LIKE instead of an equals.
Edit:
As they are comma separated, then you need to go down one of the routes outlined in the link above. There's 3 options available for passing the multiple values in : CSV (as you are), XML or Table Valued Parameter. Table Valued Parameter is IMHO the best route to go down. But assuming you stick with CSV, the end SQL would be (needs the fnSplit function):
SELECT r.PostCode, r.Email
FROM dbo.tbl_RegisterMain_Holding as r
JOIN dbo.fnSplit(#Postcode, ',') t ON r.PostCode LIKE t.Item + '%'
LIKE operator searches for a pattern. So, you must add wildcards to your var #PostalCode.
Try:
SELECT r.PostCode, r.Email FROM dbo.tbl_RegisterMain_Holding as r
WHERE r.PostCode LIKE '%' + #PostCode + '%' AND RegisterType = '2'

Finding Potential Duplicates With hyphens or dashes in SQL server 2008

I am trying to find potential duplicates in my database.
Some people might have a duplicate since they have added a "-" into their name or last name (for which ever reason).
My query currently does not pull people who might have a duplicate of someone with a "-".
What might be the best way to do this?
This is my query so far
SELECT t1.FirstName, t1.LastName, t1.ID, t2.dupeCount
FROM Contact t1
INNER JOIN (
SELECT FirstName, REPLACE(LastName, '-', ' ') as LastName, COUNT(*) AS dupeCount
FROM Contact
GROUP BY FirstName, LastName
HAVING COUNT(*) > 1
) t2 ON ((SOUNDEX(t1.LastName) = SOUNDEX(t2.LastName)
OR SOUNDEX(REPLACE(t1.LastName, '-', ' ')) like '%' + SOUNDEX(t2.LastName) + '%'
OR SOUNDEX(REPLACE(t2.LastName, '-', ' ')) like '%' + SOUNDEX(t1.LastName) + '%' )
AND SOUNDEX(t1.FirstName) = SOUNDEX(t2.FirstName))
ORDER BY t1.LastName, t1.ID
This is a lot more involved than something you can fix in one Select statement. When I run into this, I create a stored procedure and trim off leading and trailing spaces, remove punctuation that shouldn't be there (such as in middle names that are abbreviated some of the time and not other times), and check to see if phone numbers, address/zip code combinations, and/or email addresses point to the same person. Soundex helps, but it isn't enough.
Something like a Levenshtein distance algorithm would be useful, this measures the number of edits you would need to make to a string to make it the same as another string. In Oracle there is a built in function called edit_distance under the utl_match library, but I don't know of a built in version in SQL Server.
I did a quick Google search for Levenshtein distance and Edit distance SQL Server and found the following stack overflow thread among other results that may be helpful:
Levenshtein distance in T-SQL
If you are able to create a function that you can call to get the Levenshtein distance then you can just filter the query on whether the distance < x, setting the threshold as you see fit.

Dictionary-like search using SQL Server FULL-TEXT

I am trying to create a dictionary for my website.
Searching for 'server' using FREETEXTTable & Rank DESC returns:
name server - A program or server that maps human-readable names..
server - One who serves; a waitress or waiter.
server - A tray for dishes; a salver.
4...
'server' is obviously closer to 'server' than 'name server'. How do I fix the ranking?
I can not just reverse to ASC because there are even worse matches.
Top 3 results for 'God' are 'act of God', 'Lamb of God', 'Le God'..
Edit: Sorry for any confusion. nameserver, server, server.. are in a single column called 'word' this is the column that is queried with full-text search. The definitions are in the next column 'definition' and returned as query results.
I think you can use union to solve the problem of your result ordering problem ..
like
select * from your_table_name where col_name = 'server'
union
select * from your_table_name where col_name like '%server%' order by col1,col2..
this query should give you first row with full text search and then with partial search ..
clarification ..
please note that by col_name i meant to say about the column name what you have for your words..
say your table structure is ..
dictionary-
( c_word ,
c_definition,
c_synonyms
)
then you have to modify my query as
select * from Dictionary where c_word = 'server'
union
select * from Dictionary where c_word like '%server%' order by c_definition,c_synonyms
so that this query will show first where c_word value exactly match the word 'server' followed by the partial search..
for dynamic query-- you need to replace 'server' with the variable where you are getting requested keyword for search .
I used the PATINDEX function to do this at one point. Something like the following:
SELECT
Word,
Definition
FROM
FREETEXTTABLE(Dictionary, Word, #search, 20) AS Matches
INNER JOIN Dictionary ON
Matches.Key = Dictionary.ID
ORDER BY
CASE PATINDEX('%' + #search + '%', Word)
WHEN -1 THEN 1000
ELSE PATINDEX('%' + #search + '%', Word)
END
It doesn't perform too terribly since you're using the full text index to get a smaller result set (max 20 as well, in this case). PATINDEX finds a string within an expression. If the search string doesn't exist within the expression, it returns -1. This might occur if you're also searching definitions, if your search matches on a synonym or stemmed word (eg: you search for "took" so "take" is returned), or if your search involves multiple words. The CASE statement sorts those results to the end.