Base64 Encoded String for Filename - filenames

I cant think of an OS (Linux, Windows, Unix) where this would cause an issue but maybe someone here can tell me if this approach is undesirable.
I would like to use a base64 encoded string as a filename. Something like gH9JZDP3+UEXeZz3+ng7Lw==. Is this likely to cause issues anywhere?
Edit: I will likely keep this to a max of 24 characters
Edit: It looks like I have a character that will cause issues. My function that generated my string is providing stings like: J2db3/pULejEdNiB+wZRow==
You will notice that this has a / which is going to cause issues.
According to this site the / is a valid base64 character so I will not be able to use a base64 encoded string for a filename.

No. You can not use a base64 encoded string for a filename. This is because the / character is valid for base64 strings which will cause issues with file systems.
https://base64.guru/learn/base64-characters
Alternatives:
You could use base64 and then replace unwanted characters but a better option would be to hex encode your original string using a function like bin2hex().

The official RFC 4648 states:
An alternative alphabet has been suggested that would use "~" as the 63rd character. Since the "~" character has special meaning in some file system environments, the encoding described in this section is recommended instead. The remaining unreserved URI character is ".", but some file system environments do not permit multiple "." in a filename, thus making the "." character unattractive as well.
I also found on the serverfault stackexchange I found this:
There is no such thing as a "Unix" filesystem. Nor a "Windows" filesystem come to that. Do you mean NTFS, FAT16, FAT32, ext2, ext3, ext4, etc. Each have their own limitations on valid characters in names.
Also, your question title and question refer to two totally different concepts? Do you want to know about the subset of legal characters, or do you want to know what wildcard characters can be used in both systems?
http://en.wikipedia.org/wiki/Ext3 states "all bytes except NULL and '/'" are allowed in filenames.
http://msdn.microsoft.com/en-us/library/aa365247(VS.85).aspx describes the generic case for valid filenames "regardless of the filesystem". In particular, the following characters are reserved < > : " / \ | ? *
Windows also places restrictions on not using device names for files: CON, PRN, AUX, NUL, COM1, COM2, COM3, etc.
Most commands in Windows and Unix based operating systems accept * as a wildcard. Windows accepts % as a single char wildcards, whereas shells for Unix systems use ? as single char wildcard.
And this other one:
Base64 only contains A–Z, a–z, 0–9, +, / and =. So the list of characters not to be used is: all possible characters minus the ones mentioned above.
For special purposes . and _ are possible, too.
Which means that instead of the standard / base64 character, you should use _ or .; both on UNIX and Windows.
Many programming languages allow you to replace all / with _ or ., as it's only a single character and can be accomplished with a simple loop.

In Windows, you should be fine as long if you conform to the naming conventions of Windows:
https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions.
As far a I know, any base64 encoded string does not contain any of the reserves characters.
The thing that is probably going to be a problem is the lengte of the file name.

Related

Saving CSV file with degree symbol and ASCII encoded

I have string variable txt. It contains "°" degree symbol. I would like to save string into CSV file ASCII encoded. I use the procedure below But the "°" symbol is converted to "?". Do you have any idea how to save properly degree symbol?
Public Sub Write_File(ByVal txt As String, ByVal fName As String)
Try
Using OutFile As New StreamWriter(fName, False, Text.Encoding.ASCII)
OutFile.Write(txt)
End Using
Me.Write_Log("Succesfully Exported")
Catch ex As Exception
Me.Write_Log("Write Error during export")
End Try
End Sub
Encoding.ASCII is for the standard 7-bit ASCII encoding, which does not contain a degree symbol at all. In order to get a degree symbol in ASCII, you would have to use one of the many 8-bit ASCII encodings. For English, you'd probably be most interested in using the ISO 8859-1 code page, since that's the most standard-ish one there is of the bunch. For instance, instead of using Encoding.ASCII, you could do something like this:
Using OutFile As New StreamWriter(fName, False, Text.Encoding.GetEncoding("iso-8859-1"))
OutFile.Write(txt)
End Using
For a complete list of available encodings, use the Encoding.GetEncodings method, or look at the list of supported ones in the MSDN documentation.
Of course, none of the various 8-bit ASCII encodings are compatible with each other, so, if you do use that, the degree symbol will be a completely different symbol when viewed on a system that uses a different code page by default. That is precisely why UTF-8 has become the new standard. Usage of 8-bit ASCII is widely discouraged since it is practically unworkable in multi-cultural scenarios. If you can use UTF-8 instead, I would. If you must use ASCII, it's best to stick to the standard 7-bit encoding. If you must use an 8-bit ASCII encoding, please do so sparingly and with full awareness of its drawbacks.
One more thing. You mention the degree symbol as being character 167 (0xA7) in your desired target encoding. If that is the case, you may actually be wanting IBM437 encoding rather than ISO 8859-1. IBM437 is the old code page that was used by default in MS-DOS. If you really need to use that code page, you may have additional trouble for two reasons. As you'll see in the MSDN article, that code page is not well supported in the .NET framework. In my testing, outputting the Unicode string containing the degree symbol using that encoding did not work properly. Therefore, you may find yourself needing to use a byte array to represent the data rather than a String variable (which is Unicode). For instance:
File.WriteAllBytes("Test.txt", {167})
The second problem is that IBM437 is likely not the default code page for your windows OS, so even when it is written to the file as byte value 167, it won't actually look like a degree symbol when you view it in a windows application such as notepad.

Is there any limitation in giving file name in Unix?

We are using crontab to schedule jobs and it was not picking the files for processing that have [ or ] or ¿ . Is there any limitation in giving file name or these characters means something in UNIX? Is there any other variables like these we shouldnt use in file name?? Thanks in advance.
Following are general rules for both Linux, and Unix (including *BSD) like systems:
All file names are case sensitive. So filename vivek.txt Vivek.txt VIVEK.txt all are three different files.
You can use upper and lowercase letters, numbers, "." (dot), and "_" (underscore) symbols.
You can use other special characters such as blank space, but they are hard to use and it is better to avoid them.
In short, filenames may contain any character except / (root directory), which is reserved as the separator between files and directories in a pathname. You cannot use the null character.
No need to use . (dot) in a filename. Some time dot improves readability of filenames.
And you can use dot based filename extension to identify file. For example:
.sh = Shell file
.tar.gz = Compressed archive
Most modern Linux and UNIX limit filename to 255 characters (255 bytes). However, some older version of UNIX system limits filenames to 14 characters only.
A filename must be unique inside its directory. For example, inside /home/vivek directory you cannot create a demo.txt file and demo.txt directory name. However, other directory may have files with the same names. For example, you can create demo.txt directory in /tmp.
Linux / UNIX: Reserved Characters And Words
Avoid using the following characters from appearing in file names:
/
>
<
|
:
&
Please note that Linux and UNIX allows white spaces, <, >, |, \, :, (, ), &, ;, as well as wildcards such as ? and *, to be quoted or escaped using \ symbol.
It will be good if you can avoid white spaces in your filename. It will make your scripting a lot more easier.
I got the answer from this link. I am just pasting it here so that this info will be available even if that website goes down.
The only characters that are actually illegal in *nix filenames are / (reserved as the directory separator) and NUL (because it's the C string terminator). Everything else is fair game, although various utilities may fail on certain characters - typically characters that have special meaning to the shell. These will need quoting or escaping to be handled correctly.

Approximate search with openldap

I am trying to write a search that queries our directory server running openldap.
The users are going to be searching using the first or last name of the person they're interested in.
I found a problem with accented characters (like áéíóú), because first and last names are written in Spanish, so while the proper way is Pérez it can be written for the sake of the search as Perez, without the accent.
If I use '(cn=*Perez*)' I get only the non-accented results.
If I use '(cn=*Pérez*)' I get only accented results.
If I use '(cn=~Perez)' I get weird results (or at least nothing I can use, because while the results contain both Perez and Pérez ocurrences, I also get some results that apparently have nothing to do with the query...
In Spanish this happens quite a lot... be it lazyness, be it whatever you want to call it, the fact is that for this kind of thing people tend NOT to write the accents because it's assumend all these searches work with both options (I guess since Google allowes it, everybody assumes it's supposed to work that way).
Other than updating the database and removing all accents and trimming them on the query... can you think of another solution?
You have your ~ and = swapped above. It should be (cn~=Perez). I still don't know how well that will work. Soundex has always been strange. Since many attributes are multi-valued including cn you could store a second value on the attribute that has the extended characters converted to their base versions. You would at least have the original value to still go off of when you needed it. You could also get real fancy and prefix the converted value with something and use the valuesReturnFilter to filter it out from your results.
#Sample object
dn:cn=Pérez,ou=x,dc=y
cn:Pérez
cn:{stripped}Perez
sn:Pérez
#etc.
Then modify your query to use an or expression.
(|(cn=Pérez)(cn={stripped}Perez))
And you would include a valuesReturnFilter that looked like
(!(cn={stripped}*))
See RFC3876 http://www.networksorcery.com/enp/rfc/rfc3876.txt for details. The method for adding a request control varies by what platform/library you are using to access the directory.
Search filters ("queries") are specified by RFC2254.
Encoding:
RFC2254
actually requires filters (indirectly defined) to be an
OCTET STRING, i.e. ASCII 8-byte String:
AttributeValue is OCTET STRING,
MatchingRuleId
and AttributeDescription
are LDAPString, LDAPString is an OCTET STRING.
The standard on escaping: Use "<ASCII HEX NUMBER>" to replace special characters
(https://www.rfc-editor.org/rfc/rfc4515#page-4, examples https://www.rfc-editor.org/rfc/rfc4515#page-5).
Quote:
The <valueencoding> rule ensures that the entire filter string is a
valid UTF-8 string and provides that the octets that represent the
ASCII characters "*" (ASCII 0x2a), "(" (ASCII 0x28), ")" (ASCII
0x29), "\" (ASCII 0x5c), and NUL (ASCII 0x00) are
represented as a backslash "\" (ASCII 0x5c) followed by the two hexadecimal digits
representing the value of the encoded octet.
Additionally, you should probably replace all characters that semantically modify the filter (RFC 4515's grammar gives a list), and do a Regex replace of non-ASCII characters with wildcards (*) to be sure. This will also help you with characters like "é".

When should space be encoded to plus (+) or %20? [duplicate]

This question already has answers here:
URL encoding the space character: + or %20?
(5 answers)
Closed 1 year ago.
Sometimes the spaces get URL encoded to the + sign, and some other times to %20. What is the difference and why should this happen?
+ means a space only in application/x-www-form-urlencoded content, such as the query part of a URL:
http://www.example.com/path/foo+bar/path?query+name=query+value
In this URL, the parameter name is query name with a space and the value is query value with a space, but the folder name in the path is literally foo+bar, not foo bar.
%20 is a valid way to encode a space in either of these contexts. So if you need to URL-encode a string for inclusion in part of a URL, it is always safe to replace spaces with %20 and pluses with %2B. This is what, e.g., encodeURIComponent() does in JavaScript. Unfortunately it's not what urlencode does in PHP (rawurlencode is safer).
See Also
HTML 4.01 Specification application/x-www-form-urlencoded
So, the answers here are all a bit incomplete. The use of a '%20' to encode a space in URLs is explicitly defined in RFC 3986, which defines how a URI is built. There is no mention in this specification of using a '+' for encoding spaces - if you go solely by this specification, a space must be encoded as '%20'.
The mention of using '+' for encoding spaces comes from the various incarnations of the HTML specification - specifically in the section describing content type 'application/x-www-form-urlencoded'. This is used for posting form data.
Now, the HTML 2.0 specification (RFC 1866) explicitly said, in section 8.2.2, that the query part of a GET request's URL string should be encoded as 'application/x-www-form-urlencoded'. This, in theory, suggests that it's legal to use a '+' in the URL in the query string (after the '?').
But... does it really? Remember, HTML is itself a content specification, and URLs with query strings can be used with content other than HTML. Further, while the later versions of the HTML spec continue to define '+' as legal in 'application/x-www-form-urlencoded' content, they completely omit the part saying that GET request query strings are defined as that type. There is, in fact, no mention whatsoever about the query string encoding in anything after the HTML 2.0 specification.
Which leaves us with the question - is it valid? Certainly there's a lot of legacy code which supports '+' in query strings, and a lot of code which generates it as well. So odds are good you won't break if you use '+'. (And, in fact, I did all the research on this recently because I discovered a major site which failed to accept '%20' in a GET query as a space. They actually failed to decode any percent encoded character. So the service you're using may be relevant as well.)
But from a pure reading of the specifications, without the language from the HTML 2.0 specification carried over into later versions, URLs are covered entirely by RFC 3986, which means spaces ought to be converted to '%20'. And definitely that should be the case if you are requesting anything other than an HTML document.
http://www.example.com/some/path/to/resource?param1=value1
The part before the question mark must use % encoding (so %20 for space), after the question mark you can use either %20 or + for a space. If you need an actual + after the question mark use %2B.
For compatibility reasons, it's better to always encode spaces as "%20", not as "+".
It was RFC 1866 (HTML 2.0 specification), which specified that space characters should be encoded as "+" in "application/x-www-form-urlencoded" content-type key-value pairs. (see paragraph 8.2.1. subparagraph 1.). This way of encoding form data is also given in later HTML specifications, look for relevant paragraphs about application/x-www-form-urlencoded.
Here is an example of a URL string where RFC 1866 allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses, according to RFC 1866. In other cases, spaces should be encoded to %20. But since it's hard to determine the context, it's the best practice to never encode spaces as "+".
I would recommend to percent-encode all characters except "unreserved" defined in RFC 3986, p.2.3.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
The only situation when you may want to encode spaces as "+" (one byte) rather than "%20" (three bytes) is when you know for sure how to interpret the context, and when the size of the query string is of the essence.
What's the difference? See the other answers.
When should we use + instead of %20? Use + if, for some reason, you want to make the URL query string (?.....) or hash fragment (#....) more readable. Example: You can actually read this:
https://www.google.se/#q=google+doesn%27t+encode+:+and+uses+%2B+instead+of+spaces
(%2B = +)
But the following is a lot harder to read (at least to me):
https://www.google.se/#q=google%20doesn%27t%20oops%20:%20%20this%20text%20%2B%20is%20different%20spaces
I would think + is unlikely to break anything, since Google uses + (see the 1st link above) and they've probably thought about this. I'm going to use + myself just because readable + Google thinks it's OK.

Are square brackets permitted in URLs?

Are square brackets in URLs allowed?
I noticed that Apache commons HttpClient (3.0.1) throws an IOException, wget and Firefox however accept square brackets.
URL example:
http://example.com/path/to/file[3].html
My HTTP client encounters such URLs but I'm not sure whether to patch the code or to throw an exception (as it actually should be).
RFC 3986 states
A host identified by an Internet
Protocol literal address, version 6
[RFC3513] or later, is distinguished
by enclosing the IP literal within
square brackets ("[" and "]"). This
is the only place where square bracket
characters are allowed in the URI
syntax.
So you should not be seeing such URI's in the wild in theory, as they should arrive encoded.
Square brackets [ and ] in URLs are not often supported.
Replace them by %5B and %5D:
Using a command line, the following example is based on bash and sed:
url='http://example.com?day=[0-3][0-9]'
encoded_url="$( sed 's/\[/%5B/g;s/]/%5D/g' <<< "$url")"
Using Java URLEncoder.encode(String s, String enc)
Using PHP rawurlencode() or urlencode()
<?php
echo '<a href="http://example.com/day/',
rawurlencode('[0-3][0-9]'), '">';
?>
output:
<a href="http://example.com/day/%5B0-3%5D%5B0-9%5D">
or:
<?php
$query_string = 'day=' . urlencode('[0-3][0-9]') .
'&month=' . urlencode('[0-1][0-9]');
echo '<a href="http://example.com?',
htmlentities($query_string), '">';
?>
Using your favorite programming language... Please extend this answer by posting a comment or editing directly this answer to add the function you use from your programming language ;-)
For more details, see the RFC 3986 specifying the URL syntax. The Appendix A is about %-encoding in the query string (brackets as belonging to “gen-delims” to be %-encoded).
I know this question is a bit old, but I just wanted to note that PHP uses brackets to pass arrays in a URL.
http://www.example.com/foo.php?bar[]=1&bar[]=2&bar[]=3
In this case $_GET['bar'] will contain array(1, 2, 3).
Pretty much the only characters not allowed in pathnames are # and ? as they signify the end of the path.
The uri rfc will have the definative answer:
http://www.ietf.org/rfc/rfc1738.txt
Unsafe:
Characters can be unsafe for a number of reasons. The space
character is unsafe because significant spaces may disappear and
insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
The characters "<" and ">" are unsafe because they are used as the
delimiters around URLs in free text; the quote mark (""") is used to
delimit URLs in some systems. The character "#" is unsafe and should
always be encoded because it is used in World Wide Web and in other
systems to delimit a URL from a fragment/anchor identifier that might
follow it. The character "%" is unsafe because it is used for
encodings of other characters. Other characters are unsafe because
gateways and other transport agents are known to sometimes modify
such characters. These characters are "{", "}", "|", "\", "^", "~",
"[", "]", and "`".
All unsafe characters must always be encoded within a URL. For
example, the character "#" must be encoded within URLs even in
systems that do not normally deal with fragment or anchor
identifiers, so that if the URL is copied into another system that
does use them, it will not be necessary to change the URL encoding.
The answer is that they should be hex encoded, but knowing postel's law, most things will accept them verbatim.
Any browser or web-enabled software that accepts URLs and is not throwing an exception when special characters are introduced is almost guaranteed to be encoding the special characters behind the scenes. Curly brackets, square brackets, spaces, etc all have special encoded ways of representing them so as not to produce conflicts. As per the previous answers, the safest way to deal with these is to URL-encode them before handing them off to something that will try to resolve the URL.
For using the HttpClient commons class, you want to look into the org.apache.commons.httpclient.util.URIUtil class, specifically the encode() method. Use it to URI-encode the URL before trying to fetch it.
StackOverflow seems to not encode them:
https://stackoverflow.com/search?q=square+brackets+[url]
Best to URL encode those, as they are clearly not supported in all web servers. Sometimes, even when there is a standard, not everyone follows it.
According to the URL specification, the square brackets are not valid URL characters.
Here's the relevant snippets:
The "national" and "punctuation" characters do not appear in any
productions and therefore may not appear in URLs.
national { | } | vline | [ | ] | \ | ^ | ~
punctuation < | >
Square brackets are considered unsafe, but majority of browsers will parse those correctly. Having said that it is better to replace square brackets with some other characters.