Guidance on Regex - regex-lookarounds

I am struggling to complete this regex condition:
"Match anything that is not a legitimate subdomain of rafael.com (including the domain rafael.com)"
For example, these 3 lines should not match because they are legitimate
rafael.com
hello.rafael.com
hi.hello.rafael.com
And all the lines below should match
hello.rafael.xyz
badrafael.com
rafaelbad.com
rafaelbad.xyz
badrafael.xyz
arafael.com
arafael.xyz
rafael.xyz
a.b.rafael.xyz
This expression .*rafael(?!\.com).* gets me part of the way, but it isn't matching, for example,
badrafael.com
arafael.com
I am getting caught up with the lookbehind portion of this regex, I have been staring at this for 3 hours and can't figure it out. Any guidance, suggestions, links to examples would be tremendously appreciated!

I have decided after much troubleshooting that the best way to do this is actually to create separate regexes for some of the most common scenarios that may come up when blocking domain names and their combinations, and create more than 1 rule, as opposed to trying to capture every single possible combination of a domain in 1 regex.

Related

Clean unstructured place name to a structured format

I have around 300k unstructured data as below screen.I'm trying to use Google refine or OpenRefine to make this correct. However, I'm unable to find a proper way to do this. I'm new to this tool. Anyone's help would be greatly appreciated.Also, this tool is quite slow to process 300k records. If I am trying out something its taking lots of time to process and give an output.
OR Please suggest any other opensource tools and techniques do this?
As Owen said in comments, your question is probably too broad and cannot receive acceptable answer. We can just provide you with a general procedure to follow.
In Open Refine, you'll need to create a column based on the messy column and apply transformations to delete unwanted characters. You'll have to use regular expressions. But for that, it's necessary to be able to identify patterns. It's not clear to me why the "ST" of "Nat.secu ST." is important, but not the "US" in "Massy Intertech US". Not even the "36" in "Plowk 36" (Google doesn't know this word, so I'm not sure is an organisation name).
On the basis of your fifteen lines, however, we seem to distinguish some clear patterns. For example, it looks like you'll have to remove the tokens (character suites without spaces) at the end of the string that contain a #. For that, the GREL formula in Open Refine could look like this:
value.trim().replace(/\b\w+#\w+\b$/,'')
Here is a screencast if it's not clear to you.
But sometimes a company name may contain a #, in which case you will need to create more complex rules. For example, remove the token only if the string contains more than two words.
if(value.split(' ').length() > 2, value.replace(/\b\w+#\w+\b$/, ''), value)
And so on for the other patterns that you'll find (for example, any number sequence at the end that contains more than 4 numbers and one - between them)
Feel free to check out the Open Refine documentation in case of doubt.

String - Matching Automaton

So I am going to find the occurrence of s in d. s = "infinite" and d = "ininfinintefinfinite " using finite automaton. The first step I need is to construct a state diagram of s. But I think I have problems on identifying the occurrence of the string patterns. I am so confused about this. If someone could explain a little bit on this topic to me, it'll be really helpful.
You should be able to use regular expressions to accomplish your goal. Every programming language I've seen has support for regular expressions and should make your task much easier. You can test your regexs here: https://regexr.com/ and find a reference sheet for different things. For your example, the regex /(infinite)/ should do the trick.

How to set permissions efficiently

I have a list of Exchanges (extract shown below) and I want to give a user WRITE permissions to all but those containing Invoice in the name. Here an extract of the list of Exchanges as an example:
E.BankAccount.Commands
E.BankAccount.Commands.Confirmations
E.BankAccount.Commands.Errors
E.BankAccount.Events
E.Customer.Commands
E.Customer.Commands.Confirmations
E.Customer.Commands.Errors
E.Customer.Events
E.Invoice.Commands
E.Invoice.Commands.Confirmations
E.Invoice.Commands.Errors
E.Payment.Commands
E.Payment.Commands.Confirmations
E.Payment.Commands.Errors
E.Payment.EventsEvents.test
So I can write a regex like this:
E\.(BankAccount|Customer|Payment)\.[a-zA-Z.]+
This works but since I have more than 100 different types it is a lot to type and almost impossible to maintain. Better would be to have a regex which excludes invoice. I tried a couple of things but without any look, e.g.
E\.([a-zA-Z.]+|?!Invoice)\.[a-zA-Z.]+ or
E\.([a-zA-Z.]+|?!(Invoice))\.[a-zA-Z.]+
But that is all syntactically wrong. Another strategy could be to reverse the result of the expression e.g. use
E\.(Invoice)\.[a-zA-Z.]+
and then somehow reverse the entire result but I could not figure out how!
I am not using regex very often so I can't find any better solution then the first one which is impracticable in my production environment.
Does anybody have more experience and a solution to that problem?? Any help would be greatly appreciated!

Pattern matching email address in SQL Server

We're getting fraudulent emails in our database, and are trying to make an alert to find them. Some example of email addresses we are getting:
addisonsdsdsdcfsd#XXXX.com
agustinasdsdfdf#XXXX.com
I want the query to search for:
pattern of consonants and pattern length > 4 characters
Here's what I have so far, I can't figure out how to get it to search for the length of the string. Right now it's catching addresses that have even two consonants back to back, which I want to avoid because that catches emails like bobsaget#xxxx.com.
select * from recips
where address like like '%[^aeiou]#%'
UPDATE
I think there is some misunderstanding of what I am trying to find, this is not a query for validating emails, we are simply trying to spot patterns in our signups for fraudulent emails.
we are searching on other criteria besides this, such as datelastopened/clicked, but for the sake of keeping the question simple I only attached the string that searches for the pattern. We don't mail to anyone who has hardbounced more than once. However, these emails in particular are bots that still find a way to click/open and don't hardbounce. They are also coming from particular sets of IP blocks where the first octets are the same, and these IP blocks vary.
this is by no means our first line of defense, this is just a catch-all to make sure we catch anything that slips through the cracks
I'd think your current query is finding bobsaget#xxxxx.com because it contains t# which matches [^aeiouy]# because that character-class between [] only matches 1 character unless you quantify it like so: [^aeiouy]{4,}#
Maybe that works, but what I'm getting from Googling about the using Regex in the WHERE clause in SQL-Server, you need to define a User Defined Function to do this for you. If that is too cumbersome, maybe doing something like this would do the trick:
WHERE address LIKE '%[^aeiouy][^aeiouy][^aeiouy][^aeiouy]#%'
Side note, just 4 seems strict to me, I know languages where Heinsch would be a valid name. So I'd go for 6 or more I think, in which case it would be [^aeiouy]{6,}# or repeating the [^aeiouy] part 6 times in the above query.

Testing phrases to see if they match each other

I have a large number of phrases (~ several million), each less than six or seven words and the large majority less than five, and I would like to see if they "phrase match" each other. This is a search engine marketing term - essentially, A phrase matches B if A is contained in B. Right now, they are stored in a db (postgres), and I am performing a join on regexes (see this question). It is running impossibly slowly even after trying all basic optimization tricks (indexing, etc) and trying the suggestions provided.
Is there an easier way to do this? I am not averse to a non-DB solution. Is there any reason to think that regexes are overkill and are taking way longer than a different solution?
An ideal algorithm for doing sub-string matching is AhoCorsick.
Although you will have to read the data out of the database to use it, it is tremendously fast, when compared to more naive methods.
See here for a related question on substring matching:
And here for an AhoCorsick implementation in Java:
It would be great to get a little more context as to why you need to see which phrases are subsets of others: for example, it seems strange that the DB would be built in such a way anyway: you're having to do the work now because the DB is not in an appropriate format, so it makes sense that you should 'fix' the DB or the way in which it is built, instead.
It depends massively on what you are doing with the data and why, but I have found it useful in the past to break things down into single words and pairs of words, then link resources or phrases to those singles/pairs.
For example to implement a search I have done:
Source text: Testing phrases to see
Entries:
testing
testing phrases
phrases
phrases to
to
to see
see
To see if another phrase was similar (granted, not contained within) you would break down the other phrase in the same way and count the number of phrases common between them.
It has the nice side effect of still matching if you were to use (for example) "see phases to testing": because the individual words would match.. but because the order is different the pairs wouldn't, so it's taking phrases (consecutive words) into account at the same time, the number of matches wouldn't be as high, good for use as a 'score' in matching.
As I say that -kind- of thing has worked for me, but it would be great to hear some more background/context, so we can see if we can find a better solution.
When you have the 'cleaned column' from MaasSQL's previous answer, you could, depending on the way "phrase match" works exactly (I don't know), sort this column based on the length of the containing string.
Then make sure you run the comparison query in a converging manner in a procedure instead of a flat query, by stepping through your table (with a cursor) and eliminating candidates for comparison through WHERE statements and through deleting candidates that have already been tested (completely). You may need a temporary table to do this.
What do I mean with 'WHERE' statement previously? Well, if the comparison value is in a column sorted on length, you'll never have to test whether a longer string matches inside a shorter string.
And with deleting candidates: starting with the shortest strings, once you've tested all strings of a certain length, you'll can remove them from the comparison table, as any next test you'll do will never get a match.
Of course, this requires a bit more programming than just one SQL statement. And is dependent on the way "phrase match" works exactly.
DTS or SSIS may be your friend here as well.