Could you provide a regex that match Twitter usernames?
Extra bonus if a Python example is provided.
(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9-_]+)
I've used this as it disregards emails.
Here is a sample tweet:
#Hello how are #you doing #my_friend, email #000 me # whats.up#example.com #shahmirj
Matches:
#Hello
#you
#my_friend
#shahmirj
It will also work for hashtags, I use the same expression with the # changed to #.
If you're talking about the #username thing they use on twitter, then you can use this:
import re
twitter_username_re = re.compile(r'#([A-Za-z0-9_]+)')
To make every instance an HTML link, you could do something like this:
my_html_str = twitter_username_re.sub(lambda m: '%s' % (m.group(1), m.group(0)), my_tweet)
The regex I use, and that have been tested in multiple contexts :
/(^|[^#\w])#(\w{1,15})\b/
This is the cleanest way I've found to test and replace Twitter username in strings.
#!/usr/bin/python
import re
text = "#RayFranco is answering to #jjconti, this is a real '#username83' but this is an#email.com, and this is a #probablyfaketwitterusername";
ftext = re.sub( r'(^|[^#\w])#(\w{1,15})\b', '\\1\\2', text )
print ftext;
This will return me as expected :
RayFranco is answering to jjconti, this is a real 'username83' but this is an#email.com, and this is a #probablyfaketwitterusername
Based on Twitter specs :
Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease.
A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces.
Twitter recently released to open source in various languages including Java, Ruby (gem) and Javascript implementations of the code they use for finding user names, hash tags, lists and urls.
It is very regular expression oriented.
The only characters accepted in the form are A-Z, 0-9, and underscore. Usernames are not case-sensitive, though, so you could use r'#(?i)[a-z0-9_]+' to match everything correctly and also discern between users.
This is a method I have used in a project that takes the text attribute of a tweet object and returns the text with both the hashtags and user_mentions linked to their appropriate pages on twitter, complying with the most recent twitter display guidelines
def link_tweet(tweet):
"""
This method takes the text attribute from a tweet object and returns it with
user_mentions and hashtags linked
"""
tweet = re.sub(r'(\A|\s)#(\w+)', r'\1#\2', str(tweet))
return re.sub(r'(\A|\s)#(\w+)', r'\1#\2', str(tweet))
Once you call this method you can pass in the param my_tweet[x].text. Hope this is helpful.
Shorter, /#([\w]+)/ works fine.
This regex seems to solve Twitter usernames:
^#[A-Za-z0-9_]{1,15}$
Max 15 characters, allows underscores directly after the #, (which Twitter does), and allows all underscores (which, after a quick search, I found that Twitter apparently also does). Excludes email addresses.
I have used the existing answers and modified it for my use case. (username must be longer then 4 characters)
^[A-z0-9_]{5,15}$
Rules:
Your username must be longer than 4 characters.
Your username must be shorter than 15 characters.
Your username can only contain letters, numbers and '_'.
Source: https://help.twitter.com/en/managing-your-account/twitter-username-rules
In case you need to match all the handle, #handle and twitter.com/handle formats, this is a variation:
import re
match = re.search(r'^(?:.*twitter\.com/|#?)(\w{1,15})(?:$|/.*$)', text)
handle = match.group(1)
Explanation, examples and working regex here:
https://regex101.com/r/7KbhqA/3
Matched
myhandle
#myhandle
#my_handle_2
twitter.com/myhandle
https://twitter.com/myhandle
https://twitter.com/myhandle/randomstuff
Not matched
mysuperhandleistoolong
#mysuperhandleistoolong
https://twitter.com/mysuperhandleistoolong
You can use the following regex: ^#[A-Za-z0-9_]{1,15}$
In python:
import re
pattern = re.compile('^#[A-Za-z0-9_]{1,15}$')
pattern.match('#Your_handle')
This will check if the string exactly matches the regex.
In a 'practical' setting, you could use it as follows:
pattern = re.compile('^#[A-Za-z0-9_]{1,15}$')
if pattern.match('#Your_handle'):
print('Match')
else:
print('No Match')
Related
Web.HttpUtility.UrlEncode method in my project. When I am encoding name in English language then I got correct result. For example,
string temp = System.Web.HttpUtility.UrlEncode("Jewelry");
then I got exact result in temp variable. But if I wrote name in Russian language then I got different result.
string temp = System.Web.HttpUtility.UrlEncode("ювелирные изделия");
then I got value in temp variable like "%d1%8e%d0%b2%d0%b5%d0%bb%d0%b8%d1%80%d0%bd%d1%8b%d0%b5+%d0%b8%d0%b7%d0%b4%d0%b5%d0%bb%d0%b8%d1%8f"
Can anyone help me how to achieve exact name as per language?
Thank you!
Actually, the method has "done the right thing" for you!
It encodes non-ASCII characters so that it can be valid in all of the cases and transmit over the Internet. If you put your temp variable in an URL as a parameter, you will get your correct result at server side. That's what UrlEncode means for. Here your question is not a problem at all.
So please have a look at this link for further reading to understand about URL Encoding: http://www.w3schools.com/tags/ref_urlencode.asp
If you input that Russian word to the "URL Encoding Functions" part in the page I have given, it will return the same result as Web.HttpUtility.UrlEncode method does.
Can anyone help me how to achieve exact name as per language?
In short: not with that method, but it might depend on what is your exact goal.
In details:
In general URIs as defined by RFC 3986 (see Section 2: Characters) may contain any of the following characters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]#!$&'()*+,;=. Any other character needs to be encoded with the percent-encoding (%hh).
This is why UrlEncode produces
UrlEncode("Jewelry") -> "Jewelry"
UrlEncode("ювелирные изделия") -> "%d1%8e%d0%b2%d0%b5%d0%bb%d0%b8%d1%80%d0%bd%d1%8b%d0%b5+%d0%b8%d0%b7%d0%b4%d0%b5%d0%bb%d0%b8%d1%8f"
The string of "ювелирные изделия" contains characters that are not allowed in a URL as per RFC 3986.
Today, modern browsers could work with UTF-8 in URL it might be not necessary to use UrlEncode(). See example: http://jsfiddle.net/ybgt96ms/
In Solr I have a field dedicated to URLs. The URL field can be anywhere up to 2000 in length. However, I only ever need to search the first 200 characters.
Example URL:
https://www.google.co.uk/search/2014/here/?q=help+me&oq=stackoverflow&aqs=c
I've experimented over the last 2 weeks with Grams and various combinations of Tokenizers to no avail. I always seem to fall short. I would provide examples but they are all standard so no point cluttering this with non-working types.
The main problem seems to be with how Solr deals with punctuation. It treats non-A-z/0-9 characters as separators. How do I disable this for a field?
For example I can search: 'google' and get the correct result, but when I search 'google.co' nothing comes back. Same problem with most of the non-A-z/0-9 characters, it seems to treat them as a separator.
Everything needs to be *wildcard*searchable from 4char strings up to 200 char strings.
So the following search terms would return the above result. '&aqs','ow&aqs=','ps://www.goo','q=help+','2014/he'... etc
How would you define a field type for the URL wildcard use case?
You can use a string field for your url and use a filter that cuts it off to 200 characters.It can be a regex expressions also to keep only 200 characters for that field.
String field will match the exact tokens
I tried using Ruby's url_encode (doc here.)
It encodes http://www.google.com as http%3A%2F%2Fwww.google.com. But it turns out that I cannot open the latter via a browser. If so, what's the use of this function? What is it useful for, when the URL that it encodes can't even be opened?
A typical use is the HTTP GET method, in where you need a query String.
Query String 1:
valudA=john&valueB=john2
Actual value server get:
valueA : "john"
valueB : "john2"
url_encode is used to make the key-value pair enable to store the string which includes some non-ASCII encoded character such as space and special character.
Suppose the valueB will store my name, code 4 j, you need to encode it because there are some spaces.
url_encode("code 4 j")
code%204%20j
Query string 2:
valueA=john&valueB=code%204%20j
Actual value server get:
valueA: "john"
valueB: "code 4 j"
You can use url_encode to encode for example the keys/values of a GET request.
Here is an example of what a SO search query URL looks after encoding:
https://stackoverflow.com/questions/tagged/c%23+or+.net+or+asp.net
As you can see, url encoding appears to be applied only on the last part of the URL, after the last slash.
In general you cannot use url_encode on your entire URL or you will also encode the special characters in a normal URL like the :// in your example.
You can check a tutorial that explains how it works here: http://www.permadi.com/tutorial/urlEncoding/
I'd like to create a regular expression such that when I compare the a string against an array of strings, matches are returned with the regex ignoring certain characters.
Here's one example. Consider the following array of names:
{
"Andy O'Brien",
"Bob O'Brian",
"Jim OBrien",
"Larry Oberlin"
}
If a user enters "ob", I'd like the app to apply a regex predicate to the array and all of the names in the above array would match (e.g. the ' is ignored).
I know I can run the match twice, first against each name and second against each name with the ignored chars stripped from the string. I'd rather this by done by a single regex so I don't need two passes.
Is this possible? This is for an iOS app and I'm using NSPredicate.
EDIT: clarification on use
From the initial answers I realized I wasn't clear. The example above is a specific one. I need a general solution where the array of names is a large array with diverse names and the string I am matching against is entered by the user. So I can't hard code the regex like [o]'?[b].
Also, I know how to do case-insensitive searches so don't need the answer to focus on that. Just need a solution to ignore the chars I don't want to match against.
Since you have discarded all the answers showing the ways it can be done, you are left with the answer:
NO, this cannot be done. Regex does not have an option to 'ignore' characters. Your only options are to modify the regex to match them, or to do a pass on your source text to get rid of the characters you want to ignore and then match against that. (Of course, then you may have the problem of correlating your 'cleaned' text with the actual source text.)
If I understand correctly, you want a way to match the characters "ob" 1) regardless of capitalization, and 2) regardless of whether there is an apostrophe in between them. That should be easy enough.
1) Use a case-insensitivity modifier, or use a regexp that specifies that the capital and lowercase version of the letter are both acceptable: [Oo][Bb]
2) Use the ? modifier to indicate that a character may be present either one or zero times. o'?b will match both "o'b" and "ob". If you want to include other characters that may or may not be present, you can group them with the apostrophe. For example, o['-~]?b will match "ob", "o'b", "o-b", and "o~b".
So the complete answer would be [Oo]'?[Bb].
Update: The OP asked for a solution that would cause the given character to be ignored in an arbitrary search string. You can do this by inserting '? after every character of the search string. For example, if you were given the search string oleary, you'd transform it into o'?l'?e'?a'?r'?y'?. Foolproof, though probably not optimal for performance. Note that this would match "o'leary" but also "o'lea'r'y'" if that's a concern.
In this particular case, just throw the set of characters into the middle of the regex as optional. This works specifically because you have only two characters in your match string, otherwise the regex might get a bit verbose. For example, match case-insensitive against:
o[']*b
You can add more characters to that character class in the middle to ignore them. Note that the * matches any number of characters (so O'''Brien will match) - for a single instance, change to ?:
o[']?b
You can make particular characters optional with a question mark, which means that it will match whether they're there or not, e.g:
/o\'?b/
Would match all of the above, add .+ to either side to match all other characters, and a space to denote the start of the surname:
/.+? o\'?b.+/
And use the case-insensitivity modifier to make it match regardless of capitalisation.
I have url, for example:
http://i.myhost.com/myimage.jpg
I want to change this url to
http://i.myhost.com/myimageD.jpg.
(Add D after image name and before point)
i.e I want add some words after image name and before point using regex.
What is the best way do it using regex?
Try using ^(.*)\.([a-zA-Z]{3,5}) and replacing with \1D\2. I'm assuming the extension is 3-5 alphanumeric numbers but you can modify it to suit. E.g. if it's just jpg images then you can put that instead of the [a-zA-Z]{3,5}.
Sounds like a homework question given the solution must use a regex, on that assumption here is an outline to get you going.
If all you have is a URL then #mathematical.coffee's solution will suit. However if you have a chunk of text within which is one or more URLs and you have to locate and change just those then you'll need something a little more involved.
Look at the structure of a URL: {protocol}{address}{item}; where
{protocol} is "http://", "ftp://" etc.;
{address} is a name, e.g. "www.google.com", or a number, e.g. "74.125.237.116" - there will always be at least one dot in the address; and
{item} is "/name" where name is quite flexible - there will be zero or more items, you can think of them as directories and a file but this isn't strictly true. Also the sequence of items can end in a "/" (including when there are zero of them).
To make a regex which matches a URL start by matching each part. In the case of the items you'll want to match the last in the sequence separately - you'll have zero or more "directories" and one "file", the latter must be of the form "name.extension".
Once you have regexes for each part you just concatenate them to produce a regex for the whole. To form the replacement pattern you can surround parts of your regex with parentheses and refer to those parts using \number in the replacement string - see #mathematical.coffee's solution for an example.
The best way to learn regexs is to use an editor which supports them and just experiment. The exact syntax may not be the same as NSRegularExpression but they are mostly pretty similar for the basic stuff and you can translate from one to another easily.