Usage of url_encode - urlencode

I tried using Ruby's url_encode (doc here.)
It encodes http://www.google.com as http%3A%2F%2Fwww.google.com. But it turns out that I cannot open the latter via a browser. If so, what's the use of this function? What is it useful for, when the URL that it encodes can't even be opened?

A typical use is the HTTP GET method, in where you need a query String.
Query String 1:
valudA=john&valueB=john2
Actual value server get:
valueA : "john"
valueB : "john2"
url_encode is used to make the key-value pair enable to store the string which includes some non-ASCII encoded character such as space and special character.
Suppose the valueB will store my name, code 4 j, you need to encode it because there are some spaces.
url_encode("code 4 j")
code%204%20j
Query string 2:
valueA=john&valueB=code%204%20j
Actual value server get:
valueA: "john"
valueB: "code 4 j"

You can use url_encode to encode for example the keys/values of a GET request.
Here is an example of what a SO search query URL looks after encoding:
https://stackoverflow.com/questions/tagged/c%23+or+.net+or+asp.net
As you can see, url encoding appears to be applied only on the last part of the URL, after the last slash.
In general you cannot use url_encode on your entire URL or you will also encode the special characters in a normal URL like the :// in your example.
You can check a tutorial that explains how it works here: http://www.permadi.com/tutorial/urlEncoding/

Related

URL-parameters input seems inconsistent

I have review multiple instructions on URL-parameters which all suggest 2 approaches:
Parameters can follow / forward slashes or be specified by parameter name and then by parameter value. so either:
1) http://numbersapi.com/42
or
2) http://numbersapi.com/random?min=10&max=20
For the 2nd one, I provide parameter name and then parameter value by using the ?. I also provide multiple parameters using ampersand.
Now I have see the request below which works fine but does not fit into the rules above:
http://numbersapi.com/42?json
I understand that the requests sets 42 as a parameter but why is the ? not followed by the parameter name and just by the value. Also the ? seems to be used as an ampersand???
From Wikipedia:
Every HTTP URL conforms to the syntax of a generic URI. The URI generic syntax consists of a hierarchical sequence of five components:
URI = scheme:[//authority]path[?query][#fragment]
where the authority component divides into three subcomponents:
authority = [userinfo#]host[:port]
This is represented in a syntax diagram as:
As you can see, the ? ends the path part of the URL and starts the query part.
The query part is usually a &-separated string of name=value pairs, but it doesn't have to be, so json is a valid value for the query part.
Or, as the Wikipedia articles says it:
An optional query component preceded by a question mark (?), containing a query string of non-hierarchical data. Its syntax is not well defined, but by convention is most often a sequence of attribute–value pairs separated by a delimiter.
It is also fairly common for request processors to treat a name=value pair that is missing the = sign, as if the it was name=.
E.g. if you're writing Servlet code and call servletRequest.getParameter("json"), it would return an empty string ("") for that last URL in the question.

Is it possible to use apache's URIBuilder to set a parameter with percentage sign?

I want to build this complete URL:
locahost/some/path?param1=%06
using org.apache.http.client.utils.URIBuilder method setParameter(final String param, final String value). At its javadoc, there's line:
The parameter name and value are expected to be unescaped and may contain non ASCII characters
But when I use setParameter("param1","%06") I always get ...param1=%2506 instead of ...param1=%06. Looking here I noticed percent sign is 25 in hex.
Should I parse this manually or there's a way to keep using URIBuilder and keep the parameters as is?

Add special characters to Uri, Kotlin

So, I have my base URL, which is this:
val GITHUB_BASE_URL: String = "https://api.github.com/search/repositories"
And then I have this code that appends the param q (REPO_NAME_PARAM == query) to the Uri and builds it:
val builtUri: Uri = Uri.parse(GITHUB_BASE_URL).buildUpon()
.appendQueryParameter(REPO_NAME_PARAM, repoName)
.build()
Until here, everything works fine. But, when I try to filter the search of the repositories by the language they are written in (which the URL, for example, should be https://api.github.com/search/repositories?q=hello+language:Kotlin), the + and the : characters get replaced by %2B and %3A. This causes the app to not retrieve the expected results, as the characters got changed in the final url.
This is the code that I currently have
val WRITTEN_IN_PARAM: String = "+language:"
val builtUri: Uri = Uri.parse(GITHUB_BASE_URL).buildUpon()
.appendQueryParameter(REPO_NAME_PARAM, repoName+ WRITTEN_IN_PARAM+"Kotlin")
.build()
2B or not 2B, that is the question. :)
The problem is that the URL parameter is being URL Encoded twice. When we send certain characters in HTTP queries, they need to be encoded. One encoding (considered a shortcut) is to turn a space into a + symbol. The proper way to encode a space is with %20.
However, when the code above gets that already encoded String it doesn't know that the + is already encoded from a space and tries to encode it again (using %2B, the encoding for +).
If you hit the URL you've provided with %20 in place of +, and %3A in place of :, it should work fine. Therefore, the fix is to not send + unless you really want a +, in which case it will be properly encoded to a %2B.
The Fix: The library being used appears to correctly encode strings, just leave the + as a space and it should give you what you need.
Here is a good list of characters and their encoding, if you are interested.

Regular expression to find usernames in NSString Objective C [duplicate]

Could you provide a regex that match Twitter usernames?
Extra bonus if a Python example is provided.
(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9-_]+)
I've used this as it disregards emails.
Here is a sample tweet:
#Hello how are #you doing #my_friend, email #000 me # whats.up#example.com #shahmirj
Matches:
#Hello
#you
#my_friend
#shahmirj
It will also work for hashtags, I use the same expression with the # changed to #.
If you're talking about the #username thing they use on twitter, then you can use this:
import re
twitter_username_re = re.compile(r'#([A-Za-z0-9_]+)')
To make every instance an HTML link, you could do something like this:
my_html_str = twitter_username_re.sub(lambda m: '%s' % (m.group(1), m.group(0)), my_tweet)
The regex I use, and that have been tested in multiple contexts :
/(^|[^#\w])#(\w{1,15})\b/
This is the cleanest way I've found to test and replace Twitter username in strings.
#!/usr/bin/python
import re
text = "#RayFranco is answering to #jjconti, this is a real '#username83' but this is an#email.com, and this is a #probablyfaketwitterusername";
ftext = re.sub( r'(^|[^#\w])#(\w{1,15})\b', '\\1\\2', text )
print ftext;
This will return me as expected :
RayFranco is answering to jjconti, this is a real 'username83' but this is an#email.com, and this is a #probablyfaketwitterusername
Based on Twitter specs :
Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease.
A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces.
Twitter recently released to open source in various languages including Java, Ruby (gem) and Javascript implementations of the code they use for finding user names, hash tags, lists and urls.
It is very regular expression oriented.
The only characters accepted in the form are A-Z, 0-9, and underscore. Usernames are not case-sensitive, though, so you could use r'#(?i)[a-z0-9_]+' to match everything correctly and also discern between users.
This is a method I have used in a project that takes the text attribute of a tweet object and returns the text with both the hashtags and user_mentions linked to their appropriate pages on twitter, complying with the most recent twitter display guidelines
def link_tweet(tweet):
"""
This method takes the text attribute from a tweet object and returns it with
user_mentions and hashtags linked
"""
tweet = re.sub(r'(\A|\s)#(\w+)', r'\1#\2', str(tweet))
return re.sub(r'(\A|\s)#(\w+)', r'\1#\2', str(tweet))
Once you call this method you can pass in the param my_tweet[x].text. Hope this is helpful.
Shorter, /#([\w]+)/ works fine.
This regex seems to solve Twitter usernames:
^#[A-Za-z0-9_]{1,15}$
Max 15 characters, allows underscores directly after the #, (which Twitter does), and allows all underscores (which, after a quick search, I found that Twitter apparently also does). Excludes email addresses.
I have used the existing answers and modified it for my use case. (username must be longer then 4 characters)
^[A-z0-9_]{5,15}$
Rules:
Your username must be longer than 4 characters.
Your username must be shorter than 15 characters.
Your username can only contain letters, numbers and '_'.
Source: https://help.twitter.com/en/managing-your-account/twitter-username-rules
In case you need to match all the handle, #handle and twitter.com/handle formats, this is a variation:
import re
match = re.search(r'^(?:.*twitter\.com/|#?)(\w{1,15})(?:$|/.*$)', text)
handle = match.group(1)
Explanation, examples and working regex here:
https://regex101.com/r/7KbhqA/3
Matched
myhandle
#myhandle
#my_handle_2
twitter.com/myhandle
https://twitter.com/myhandle
https://twitter.com/myhandle/randomstuff
Not matched
mysuperhandleistoolong
#mysuperhandleistoolong
https://twitter.com/mysuperhandleistoolong
You can use the following regex: ^#[A-Za-z0-9_]{1,15}$
In python:
import re
pattern = re.compile('^#[A-Za-z0-9_]{1,15}$')
pattern.match('#Your_handle')
This will check if the string exactly matches the regex.
In a 'practical' setting, you could use it as follows:
pattern = re.compile('^#[A-Za-z0-9_]{1,15}$')
if pattern.match('#Your_handle'):
print('Match')
else:
print('No Match')

System.Web.HttpUtility.UrlEncode method gives wrong result with different language value

Web.HttpUtility.UrlEncode method in my project. When I am encoding name in English language then I got correct result. For example,
string temp = System.Web.HttpUtility.UrlEncode("Jewelry");
then I got exact result in temp variable. But if I wrote name in Russian language then I got different result.
string temp = System.Web.HttpUtility.UrlEncode("ювелирные изделия");
then I got value in temp variable like "%d1%8e%d0%b2%d0%b5%d0%bb%d0%b8%d1%80%d0%bd%d1%8b%d0%b5+%d0%b8%d0%b7%d0%b4%d0%b5%d0%bb%d0%b8%d1%8f"
Can anyone help me how to achieve exact name as per language?
Thank you!
Actually, the method has "done the right thing" for you!
It encodes non-ASCII characters so that it can be valid in all of the cases and transmit over the Internet. If you put your temp variable in an URL as a parameter, you will get your correct result at server side. That's what UrlEncode means for. Here your question is not a problem at all.
So please have a look at this link for further reading to understand about URL Encoding: http://www.w3schools.com/tags/ref_urlencode.asp
If you input that Russian word to the "URL Encoding Functions" part in the page I have given, it will return the same result as Web.HttpUtility.UrlEncode method does.
Can anyone help me how to achieve exact name as per language?
In short: not with that method, but it might depend on what is your exact goal.
In details:
In general URIs as defined by RFC 3986 (see Section 2: Characters) may contain any of the following characters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]#!$&'()*+,;=. Any other character needs to be encoded with the percent-encoding (%hh).
This is why UrlEncode produces
UrlEncode("Jewelry") -> "Jewelry"
UrlEncode("ювелирные изделия") -> "%d1%8e%d0%b2%d0%b5%d0%bb%d0%b8%d1%80%d0%bd%d1%8b%d0%b5+%d0%b8%d0%b7%d0%b4%d0%b5%d0%bb%d0%b8%d1%8f"
The string of "ювелирные изделия" contains characters that are not allowed in a URL as per RFC 3986.
Today, modern browsers could work with UTF-8 in URL it might be not necessary to use UrlEncode(). See example: http://jsfiddle.net/ybgt96ms/