Pattern matching email address in SQL Server - sql

We're getting fraudulent emails in our database, and are trying to make an alert to find them. Some example of email addresses we are getting:
addisonsdsdsdcfsd#XXXX.com
agustinasdsdfdf#XXXX.com
I want the query to search for:
pattern of consonants and pattern length > 4 characters
Here's what I have so far, I can't figure out how to get it to search for the length of the string. Right now it's catching addresses that have even two consonants back to back, which I want to avoid because that catches emails like bobsaget#xxxx.com.
select * from recips
where address like like '%[^aeiou]#%'
UPDATE
I think there is some misunderstanding of what I am trying to find, this is not a query for validating emails, we are simply trying to spot patterns in our signups for fraudulent emails.
we are searching on other criteria besides this, such as datelastopened/clicked, but for the sake of keeping the question simple I only attached the string that searches for the pattern. We don't mail to anyone who has hardbounced more than once. However, these emails in particular are bots that still find a way to click/open and don't hardbounce. They are also coming from particular sets of IP blocks where the first octets are the same, and these IP blocks vary.
this is by no means our first line of defense, this is just a catch-all to make sure we catch anything that slips through the cracks

I'd think your current query is finding bobsaget#xxxxx.com because it contains t# which matches [^aeiouy]# because that character-class between [] only matches 1 character unless you quantify it like so: [^aeiouy]{4,}#
Maybe that works, but what I'm getting from Googling about the using Regex in the WHERE clause in SQL-Server, you need to define a User Defined Function to do this for you. If that is too cumbersome, maybe doing something like this would do the trick:
WHERE address LIKE '%[^aeiouy][^aeiouy][^aeiouy][^aeiouy]#%'
Side note, just 4 seems strict to me, I know languages where Heinsch would be a valid name. So I'd go for 6 or more I think, in which case it would be [^aeiouy]{6,}# or repeating the [^aeiouy] part 6 times in the above query.

Related

Select Distinct for First Part of column values, but last part (ticket number) needs to be wildcard

So I have a database of emails sent and received by our ticket system, Cherwell, version 9.3.2. It uses Microsoft SQL as a backend, we're on version 2012. I'm interested in doing cleanup on old or irrelevant emails. For instance emails 3+ years old, or notices sent to technicians saying they have a new task, or notices we send out that really have no value in retaining in full email stored in the database, as Cherwell also creates rows of plaintext for most of these emails. The table related to mail, TrebuchetMail is this size: 193,883.156 MB.
I'm wondering if it would improve overall performance to reduce this table, as nearly every type of record in Cherwell would access this table. Granted it would only be those rows relevant to the specific record.
Okay so my question: Subject is a column that stores the subject of the email. I have a few types of Subjects identified for removal, one example is this:
--165765
select count(*)
FROM [cherwell].[dbo].[TrebuchetMail]
where subject like 'You have an unacknowledged Task%';
After the You have an unacknowledged Task part of the subject is a number, the individual Task object's ID number. So doing a select distinct treats all 165765 rows as distinct, because they are. Can you do a wildcard with select distinct to group together similar but not exactly the same? Is there another function I could use rather than distinct? I realize it actually is distinct, but surely this problem has come up before. "select distinct Subject" query that would group together the rows where Subject is like 'You have an unacknowledged Task%' and Subject is like 'Ticket #%Created'. Would I always need some criteria, so maybe this is pointless because I'm going to have to look at the full results to come up with the criteria for the select distinct query anyway.
My goal is to identify different Subjects that could be targeted for archival/removal.
I found a 2013 thread that was a similar question, but it had to do with dates. The asker wanted to group together rows from a log that grouped together the days, disregarding the time aspect of the log. I didn't quite understand how I could translate that to work for my situation. I'd be very grateful for an explanation if that would work for me.
I know this might not be the answer you are looking for since it is low tech not based on formula. But since it is most likely a one time action, why not export the database as a table and sort it by the subject field. All the irrelevant records would be grouped together and could be easily deleted.
After this action simply re-import the table to the database. Of course this only works nicely on a flat database, not on a highly linked up one.
At the same time you would have a backup in case something goes wrong.
What you may want to do (bearing in mind I'm not a SQL expert), is create a new subquery/expression that stands in as a new column in your query, as a truncated section of subject.
Something like,
Select RecID, ( Subject.Replace('1','').Replace('2','').Replace('3','') As CustomColumn )
From TrebuchetMail
and so on, to where you strip out numbers 0-9 anywhere they appear in the subject line.
You can then potentially go distinct based on this I believe.
I'm sure there's a more elegant way of doing this with a Regex expression as well, I just am too novice for it
Not sure how it works out in practice.
Note.... I might have the syntax wrong on those replace commands. I think I'm thinking of how it's done in VB/C# and I think in SQL it's more like Replace(expression, 'text to be replaced', 'text to replace with') but you get the idea

Clean unstructured place name to a structured format

I have around 300k unstructured data as below screen.I'm trying to use Google refine or OpenRefine to make this correct. However, I'm unable to find a proper way to do this. I'm new to this tool. Anyone's help would be greatly appreciated.Also, this tool is quite slow to process 300k records. If I am trying out something its taking lots of time to process and give an output.
OR Please suggest any other opensource tools and techniques do this?
As Owen said in comments, your question is probably too broad and cannot receive acceptable answer. We can just provide you with a general procedure to follow.
In Open Refine, you'll need to create a column based on the messy column and apply transformations to delete unwanted characters. You'll have to use regular expressions. But for that, it's necessary to be able to identify patterns. It's not clear to me why the "ST" of "Nat.secu ST." is important, but not the "US" in "Massy Intertech US". Not even the "36" in "Plowk 36" (Google doesn't know this word, so I'm not sure is an organisation name).
On the basis of your fifteen lines, however, we seem to distinguish some clear patterns. For example, it looks like you'll have to remove the tokens (character suites without spaces) at the end of the string that contain a #. For that, the GREL formula in Open Refine could look like this:
value.trim().replace(/\b\w+#\w+\b$/,'')
Here is a screencast if it's not clear to you.
But sometimes a company name may contain a #, in which case you will need to create more complex rules. For example, remove the token only if the string contains more than two words.
if(value.split(' ').length() > 2, value.replace(/\b\w+#\w+\b$/, ''), value)
And so on for the other patterns that you'll find (for example, any number sequence at the end that contains more than 4 numbers and one - between them)
Feel free to check out the Open Refine documentation in case of doubt.

Google Places API - RadarSearch results are confusing

I'm running a query vs the Google Places RadarSearch API and don't entirely understand the results. I'm trying to find nearby Tesco Supermarkets. My query is structured like this:
https://maps.googleapis.com/maps/api/place/radarsearch/xml?location=51.503186,-0.126446&types=store&keyword=tesco&name=tesco&radius=5000&key=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
I've tried a bunch of variations of the fields types, keyword and name. None of the results are Tesco stores. Am i missing something?
The Google docs show the fields as:
keyword — A term to be matched against all content that Google has indexed for this place, including but not limited to name, type, and address, as well as customer reviews and other third-party content.
name — One or more terms to be matched against the names of places, separated by a space character. Results will be restricted to those containing the passed name values. Note that a place may have additional names associated with it, beyond its listed name. The API will try to match the passed name value against all of these names. As a result, places may be returned in the results whose listed names do not match the search term, but whose associated names do.
I always get the maximum of 200 results which maybe includes 1 or 2 Tescos. When I check on Google maps there are 10 Tescos in the radius I am searching. It's as if the api is ignoring the name field. It doesn't matter what I populate in the name field, I still get the same results
UPDATE: Seems this is a known bug https://code.google.com/p/gmaps-api-issues/issues/detail?id=7082
maybe I am wrong, but I believe it is a commercial issue, google will show all business filtering them with a particular criteria they are no publishing the rules, for example in your search, the type you used was "store" , so they are returning to you all stores, and using the name or keyword in their own way who knows which criteria they are internally using, and there is something else, on the API description, the sample that they provide for radar search shows the name of the place in the result, but in the tests i am doing, they are not even sending the name, so you couldn't iterate those results, and filter by your own, for you to get the name, you have to do another call using:
https://maps.googleapis.com/maps/api/place/details/json?placeid=ChIJq4lX1doEdkgR5JXPstgQjc0&key=YOUR_KEY
Maybe there is another way but I don't see it.
I find the radar search is returning strange results today. It worked differently a couple of days ago.
The keyword-parameter has no effect at the moment and I have breaking integration-tests that were working before. I hope this is a temporary issue.
I filed a bug report for it: https://code.google.com/p/gmaps-api-issues/issues/detail?id=7086

SQL Server Text Searching

I have a business requirement where we need to do somce crazy name matching against records stored in the database and I was wondering if there is any easy way to do it using SQL Server.
Name Stored in the DB : Austin K
Name to be Matched from UI : Austin Kierland
That's just a sample. In reality, there could be whole lot of different permutations and combinations.
If it's other way round, I could've used wild character but in this case, the name in the database is smaller than the search criteria.
Any suggestions?
Realistically - no. Databases were meant for comparing absolute values, not for messy comparisons. The way they store their data internally just isn't fit for really messy matching. Actually even a superpowerful dedicated search engine like Google, that has a LOT of messy matching features, wouldn't be able to pull off your example without prior knowledge.
I don't know how the requirement is precisely worded, but I'd either shoot the feature request with "technically impossible", or implement a rule set for which messy matches are tried - for your example, you could easily 'hard code' that multiple searches are executed when capitalized words are entered, shortening them so a single letter. No idea if that's a solution to your problem though.
You can do a normal search using the LIKE operator which determines whether a specific character string matches a specified pattern. The problem you will run into is the probability of the returning of multiple records or incorrect people. I've had similar requirement myself for a business app and the best solution to the issue is to require other qualifying values rather then just name. If you do a partial name search without other qualifying data you are certainly going to come across the false positive matches and/or multiple records. In my case I built a web service that checks eligibility allowing text search for first & last name but also added date of birth, primary person SSN, and gender which ensured the matching person was in deed the person intended to search for. If my situation was like yours in which name was the only search criteria my recommendation to the business would be we cannot perform the search until qualifying data is entered into the database otherwise there is no accurate way to query the results they are looking for.

SQL exact match within a pattern?

I am using qodbc (a quickbooks database connector) It uses an ODBC-like sql language.
I would like to find all the records where a field matches a pattern but I have a slight delema.
The information in my field looks like this:
321-......02/25/10
321-1.....02/26/10
321-2.....03/25/10
321-3.....03/26/10
322-......04/25/10
322-1.....04/26/10
322-2.....05/25/10
322-3.....05/26/10
I would like my query to return only the rows where the pattern matches the first number. So if the user searches for '321' it will only show records that look like 321 but not those that have 321-1 or 321-3. Similarly if the user searched for 321-1 you would not see 321. (that's the easy part)
Right now I have
LIKE '321%'
This finds all of them regardless of if they are followed by dots or not. Is there a way I can limit the query to only specifics despite that field having more information that it should.
(P.S. I did not set up this system, it makes me wince to see two data points in one field
I'm sorry if my title isn't right, suggest a new title if you can. )
LIKE '321%' AND NOT LIKE '321-%'