What are the best rules to follow for what characters to allow in a password? - passwords

Without thinking about it at all I just want to say I should allow every character. It gets hashed in any case, and I don't want to limit people who want to create strong passwords.
However, thinking about it more, there are plenty of characters that I have no idea what effect they'd have on things. Foreign characters, ascii symbols, etc. to name a couple.
I tried to Google but I can't find any definitive standard for what people do. Even most professional organizations don't seem to know. It seems to be a common practice for many sites to disallow special characters altogether, which is just silly and not what I want to do.
Anyway, are there any standard recommendations for length, allowed characters, and so forth?
I'm not sure if it matters, but I'll be using ASP.NET w/ C#

Any printable, non-whitespace ASCII character (between 33 and 126 inclusive) are typically allowed in passwords. Many security professionals (and SO commenters) are advising the use of a passphrase in place of a password, so you'd have to allow spaces. The argument is that due to their length, and since phrases aren't in a dictionary, passphrases are more difficult to crack than passwords. (A passphrase can also be easier to remember, so a legitimate user doesn't have to keep it written down on a sticky-note right on their monitor.)
Some strong password generators use a hash, so I'd put a very high limit on the length (512 or 1024) just to be inclusive. Password generators today often yield strings of 32-128 characters, but who knows what hashes will be used in the next few years.

Non-ASCII characters certainly make things harder when it comes to entering the password on limited devices (mobiles, consoles etc) - but usually not impossible. Arguably if the user wants to do that, you should let them. It's easy enough to do a reasonable and consistent thing - encode in UTF-8 before hashing, for example. You'd only get into difficulties if some input device sent the characters as a composition (e.g. e + acute accent instead of "e acute") - but I suspect that wouldn' t happen in real life. (You could decompose everything yourself, but that would be a lot of trouble to go to for an edge case.)
I'd restrict it to printable characters, however. Putting tabs, form feeds etc in a password really is asking for trouble.

Not an expert, but I hate when characters I choose and not that bizarre are rejected. So, I think I agree with your gut.

Short answer: allow as much as the system backing it can support. Nowadays there's really no excuse not to use full unicode support for text entry, and that includes passwords. I don't think you need to worry about problems with characters as long as they're handled literally (but I'm not a pro in this field--beware of sql injection).
I have a pet peeve against sites that impose restrictions on passwords... any kind of restriction. I like sites that will tell you how strong your password is and recommend you make it stronger, but forcing a user to type at least 8 characters, or to require both letters and numbers, etc. is just plain frustrating.
If you need to have a maximum field size (for example for storing in a database) try to make it large enough for anything that people would type out by hand. There's really no such thing as a too-large password field since there's always the potential to use an automated, generated strong password, but 64 to 128 characters would certainly suffice.

Fundamentally, most of the unicode class of characters should be allowed. Do skip however control characters (e.g. 0-31 besides space), the byte order mark (0xfffe and oxfeff). Further, you want to first canonicalize the representation to get rid of problems caused by differing representations. You might issue warnings though for characters that seem to be too hard to enter, but users will guard against that themselves.

Remember: When you are storing passwords, all passwords should be encrypted with a one-way algorithm like md5 of sha1. Since these algorithms always yield hexadecimal numbers, you don't need to worry about SQL injections or anything like that.
So, as long as you can md5 or sha1 a character, it should be accepted.

If you are talking about preventing SQL-injection type of attacks, it is probably a better idea to make sure your code does what it is supposed to do, rather than relying on restricting the input so the problem becomes easier.
For non-ascii characters, I don't see that as a more difficult problem if your input can be correctly represented as a binary string (and not as text), which is then passed to your hash function or key generator, etc.

Add another vote for "let the user include any and all characters that their interface allows them to enter". I wouldn't even disallow tab or control characters. Your software has the capability to accept arbitrary byte strings and hash them, so accept arbitrary byte strings as passwords. To do otherwise reduces the space which an attacker must search in a brute-force or dictionary attack.
(Of course, even if you do allow everything, 99% of users will still use their pet's name as their password...)

Eventually you may have to print out the clear password in a confirmlation email sent to your users.
PS: Might consider also encoding problems in the email, if it's not standard ascii (eg. Japanese characters), it's possible that a user will not receive the email in the proper format or simply can't read it on another system due to fonts not being installed.
All this weighs in the "printable" ascii characters range.

Related

Displaying Korean Characters - iOS App

I am trying to display Korean text in my iPhone app. The app appends the Unicode of letters one by one to an NSMutableString and displays the string on the screen after each letter is appended.
I understand that there are some rules for conjoining letters (Jamo).
Is there a function for automatically applying all these rules to a string of letters or do I need to write code to make changes (e.g., changing a consonant to a tail consonant if there is a vowel before it)?
FCA. It's you who sent email to me, right? Because the more detailed question is here, I will try (my best) to answer here instead of replying to your email.
By reading the whole text you and people wrote here, I figured out that you are making a Korean handwriting recognition software. So, you would not enjoy the luxury of the Korean input method provided by Apple.
There are two things for me to say. Let's go one by one. (I believe you are already aware of one of the two things I'm going to explain.)
How to compose Hangul text.
So, by reading your inquiry, it should not be about Unicode composed/decomposed Korean string (or just a series of Ja (Consonants) and Mo (Vowels)). Your question looks to be about "how to determine if a consonant (your term is tail consonant, right?) a user writes is a last consonant or the begining consonant of next syllable.
Best thing is to learn Korean, but let me briefly explain it.
Let's say you write 소방차 (a Fire dept. car.)
You are to write : ㅅㅗㅂㅏㅇㅊㅏ
(Again I'm not talking about the decomposed form of Unicode. It's about how people write Korean text.)
When you type ㅗ (which is the 2nd char) temporarily a display system displays 소 by attaching the ㅗ to its preceding ㅅ. And it will look up Korean table. (Although how to assemble Hangul is JoHap style (조합형), which is called composite style, there are tables of allowed Korean text defined in any Korean standard called Wansung style (완성형). So, you are to test the "assembled" syllable to the table to see if there is such a syllable). Then you will find "소" in the table. So, you will display "소".
Now the next char, "ㅂ" is written. Then here it becomes a little complicated. Because there is a syllable "솝" in the table, first it will attach ㅂ to the preceding syllable. So, it will display "솝". However, things are not determined yet completely. A user writes the next char, "ㅏ". It's pretty sure that there is no syllable without first/beginning consonant (Ja). It will look up the table, but fail to find a syllable "ㅏ".
So, it will guess the ㅂ (edited from ㅅ. it was typo) attached to the previous syllable actually belongs to the 2nd syllable. And it should display "소바". Now, ㅇ is typed. Then it tries to attach the ㅇ to the second syllable. So it displays 소방. (At this moment it can also lookup 방 in the table. And it is found.)
Now, "ㅊ" is typed. Probably internally it can test 소방ㅊ where o and ㅊ exist under 바 (I can't write it, because there is no such syllable with o and ㅊ exist together under 바, like 밝.). However, there is no such syllable. So, it instantly determines that ㅊ belongs to the next syllable.
Then "ㅏ" is typed. It will assemble ㅊ and ㅏ to make 차. When you press the space key or return key or any other white space key, it will finish composing Hangul.
This is a simple case. In Korean, there are more complicated syllables like 빨, 꼭, 헗, etc. For the first consonants, 복자음 (BokJaUm, Double Consonants) like ㅃ, ㄲ in 빨 and 꼭, people type ㅂ and ㅅ by pressing the shift key. Then it will display ㅃ and ㄲ. So, picking up how may consonants and determine where (previous syllable or next syllable) it belongs to can be easy if a user type with keyboard. (However, there are some nice Korean input methods for Windows and Xterm, where it allows to type ㅂ twice to make ㅃ. It's kind of an intelligent feature. But testing text like 빱빠라빱, 흙을 can be complicated because you end up testing 3 or 4 consonants grouped like {1,3}, {2,2}, {3, 1}.
The bad news is... because you are writing handwriting recognition, you may need to handle such complicated case if you input recognized Hangul characters one by one into a Korean input method engine. However, if you write up your own input method in your app, you can maintain its own state machine, so it can be easier. But as you can see, it's a trade off. Depending on the existing input method engine and ingesting each char into it. (Hmmm... wait... Maybe the input method engine can handle those complicated cases too.)
FYI, I would like to introduce two open source projects. One is a Korean input method Finder module for Mac, and the other is an input method engine with which you can make a Korean input method. Also, there is a Korean input method for X-Windows hosted here. If you prefer Windows project to look up, here is one.
The latter two were hosted at KLDP.net, a Korean open source project hosting site, but they were moved to Google code. As far as I can remember, "SaeNaRu" and "Nabi" (butterfly) can support typing the same consonant twice to make a double consonant.
For more detailed information, you can look up the libhangul and nabi. (I remember that the input method part of code was almost the same between libhangul and nabi before. But at that time they were separated and expected to evolve independently. So, I guess that they are different.
OK. The first thing is done.
Now let's move on to the second issue. (This is the part I said you may know about already. But just to complete my explanation, let me explain this also.)
It's about what character to choose as an input to your probable Korean input method state machine or a engine like libhangul. There are basically two representation of composed (on display) Hangul characters : Composed and Decomposed. Composed one contains fully composed chars. For example, 사랑합니다, each syllable, 사, 랑, 합, 니, 다 is saved as such. They are not stored as ㅅ, ㅏ, ㄹ, ㅏ, ㅇ, ㅎ, ㅏ, ㅂ, ㄴ, ㅣ, ㄷ, ㅏ.
That is composed representation in Unicode. This representation is usually used by text editors, etc. The other representation is decomposed in Unicode. It's like ㅅ, ㅏ, ㄹ, ㅏ, ㅇ, ㅎ, ㅏ, ㅂ, ㄴ, ㅣ, ㄷ, ㅏ.
This representation is usually used by file systems. For example, if you put a file name in Hangul on Windows, and access the folder which contains it from Mac, it will be displayed like ㅅㅏㄹㅏㅇㅎㅏㅂㄴㅣㄷㅏ although it is displayed as 사랑합니다 on Windows.
However, there is another set of characters if memory serves, which is just a list of Hangul consonants and vowels. Although they may look same or similar to decomposed syllables, they are actually different in that the location where they are drawn is in middle a space where a character is drawn. Its purpose is to present Hangul characters in Korean alphabet tables or things like that for education purpose (or any other purpose.)
So, I'm not sure what characters (i.e. the decomposed or the characters for the list of Hangul consonants and vowels) to ingest to a input method state machine or input method engine you choose or implement. If you implement it, its your choice, but if you use some external libraries for the engine, you need to figure it out.
Also, as I mentioned in my blog post, there are two variants in each composed and decomposed representation, which are all defined in Unicode standard. So, well.. yeah.. I agree. It's quite a bit of work.
As for me, I tried to make an input method for Mac, (when Apple announced they would get rid of the Finder plugin architecture for security issue), but at that time libhangul (Yeah.. I tried to use it) was being changed a lot. So, until it stabilized, I decided to hold off. But because I became very busy at work and tired when I got home, so I didn't make progress on my own input method. So, I believe the state of the libhangul project is much better now than ever. So, it's good try at least to take a look at it.
Also, if you don't have Windows, it would be good to try hanterm or any xterm derivatives which supports Hangul input in itself. The source code will be available at their hosting web site.
Good luck with your project, and if there are more things to ask me, please do so.
Check out these system level text-input facility. I never used these, but looks promising.
http://developer.apple.com/library/ios/#documentation/StringsTextFonts/Conceptual/TextAndWebiPhoneOS/CustomTextProcessing/CustomTextProcessing.html#//apple_ref/doc/uid/TP40009542-CH4-SW8
http://developer.apple.com/library/ios/#documentation/UIKit/Reference/UITextInput_Protocol/Reference/Reference.html#//apple_ref/occ/intf/UITextInput
Because iOS doesn't support system-wide keyboard customization, everybody just use system-default input facility. And handling of Hangul composition is all different by every operating-systems or platforms. (MS/Apple/Samsung/LG or others) So the best way is using system-supplied facility such as UITextField for consistency for users. Or you should accurately simulate how your platform OS does it. Of course you can make it yourself, but users won't like it.
Though I'm not expert on this topic - Korean Hangul compositor -, but I don't think there's simple algorithm without table lookup. Anyway if you really want to implement it yourself, these are all the core problems you have to handle.
Compositing your visual symbols into consonants and vowels which defined in Unicode.
Determining initial-consonant / final-consonants by placement of vowels.
It wouldn't be so hard, but anyway ability to modify preceding character sequence is required. You cannot implement Korean input with only one-way stream unless you have separate key for initial/final consonants which are looks same.
Unicode defines all valid set of Jamo components. Usually those components are too many to be presented on a device. And also inefficient. Most Korean input system decomposes those Jamo again and composite them once before compositing final litter. You also can identify and decompose them visually just like Korean people do.
After you get initial/final-consonants and vowels which are defined in Unicode standard, Unicode Normalization feature (such as -[NSString precomposedStringWithCompatibilityMapping]) will do the rest of jobs.
libhangul (code.google.com/p/libhangul ) does the conversion! It has several functions to handle different types of keyboards (i.e., keyboards with different layouts) and converting the keys to the Unicodes of Hanguls.
It also has several functions which combine the Hanguls to make syllables (they basically implement table lookups that Eonil has mentioned in his response).
Libhangul stores the Hanguls in its buffer as it receives them (it does not output them). After receiving enough Hanguls and successfully converting them into a syllable, it outputs the syllable. Unfortunately, this is quite confusing for the user. The way around this is displaying the buffer content on the screen. After receiving a new Hangul, what has been displayed must be erased. If a syllable has been successfully formed, then the syllable is displayed. Otherwise, the buffer content is displayed again. Note that you can’t just display the new Hangul on the screen. You must erase what you have displayed before and read the previous Hanguls and the new one from the buffer and display them on the screen again.
The reason is that Libhangul may change the code for the previous Hanguls stored in the buffer to make it possible to combine them with the new Hangul. This way, you will get the updated Hanguls.
Also note if the user changes the location of the cursor, the buffer must be emptied.
Additionally, if the user presses backspace, then, the last Hangul displayed on the screen must be erased and must be removed from the buffer.
Libhangul has also some features for correcting typos. For example, if you typeᅡ and ᄉ, it converts them into사.
Thank you JongAm Park and Eonil for your help and thoughtful comments! Since my reputation is less than 15 at this point, I can’t upvote your answers, but I will do when I can.

Password strength check: comparing to previous passwords

Every now and then I come across applications that force you to change passwords once in a while. Almost universally, they have this strange requirement for the new password: it has to be "significantly" different from your previous password(s).
While at first this sounds logical, next thing I think is: how do they do that? Do they store my passwords in plain text? I would have accepted the answer that they do, if it wasn't for the fact that these are kinds of applications that pretend to care about security so much they force you to change your password if it is expired! Microsoft Exchange is one example of this.
I'm not very good at cryptography and hash functions, so my question is this: Is it possible to enforce this kind of policy without storing passwords in plain text?
Do you know how this policy is implemented in real world applications?
UPDATE: An Example.
I was recently changing my Microsoft Exchange password. I only use Web Access, so it might be different a little -- I have no idea.
So, it forces me to change my password. What I do sometimes is I change it to something new and then change it back almost immediately. The freaky part is that It did not allow me to even change it back because of this. I tried changing it a little, by adding a letter in front of it or changing one symbol -- no luck, it was complaining.
With a typical hash, the best you can do is see if the new password is exactly equal to previous ones. You can break the password into multiple hashes in order to get more flexible with comparison, for example 3 hashes:
Alpha characters only
Numeric characters only
All other characters
You could for example require all the hashes to change to be accepted, to prevent users from just changing their password from SecretPassword01 to SecretPassword02.
A cryptographic expert may weigh in here on if this could be made as secure as a single hash.
NOTE that this is not as secure as a single hash, so before you go implementing this, make sure you have really done your research.
When changing password you're usually asked for the old one to confirm your identity. It's then trivial to compare the old one and the new one to see how much they differ. TBH I don't know how to compare to several previous passwords without storing them, but that's getting into the territory of ridiculous policies anyway.

Isn't it difficult to recognize a successful decryption?

When I hear about methods for breaking encryption algorithms, I notice there is often focused on how to decrypt very rapidly and how to reduce the search space. However, I always wonder how you can recognize a successful decryption, and why this doesn't form a bottleneck. Or is it often assumed that a encrypted/decrypted pair is known?
From Cryptonomicon:
There is a compromise between the two
extremes of, on the one hand, not
knowing any of the plaintext at all,
and, on the other, knowing all of it.
In the Cryptonomicon that falls under
the heading of cribs. A crib is an
educated guess as to what words or
phrases might be present in the
message. For example if you were
decrypting German messages from World
War II, you might guess that the
plaintext included the phrase "HElL
HITLER" or "SIEG HElL." You might pick
out a sequence of ten characters at
random and say, "Let's assume that
this represented HEIL HITLER. If that
is the case, then what would it imply
about the remainder of the message?"
...
Sitting down in his office with the
fresh Arethusa intercepts, he went to
work, using FUNERAL as a crib: if this
group of seven letters decrypts to
FUNERAL, then what does the rest of
the message look like? Gibberish?
Okay, how about this group of seven
letters?
Generally, you have some idea of the format of the file you expect to result from the decryption, and most formats provide an easy way to identify them. For example, nearly all binary formats such as images, documents, zipfiles, etc, have easily identifiable headers, while text files will contain only ASCII, or only valid UTF-8 sequences.
In assymetric cryptography you usually have access to the public key. Therefore, any decryption of an encrypted ciphertext can be re-encrypted using the public key and compared to the original ciphertext, thus revealing if the decryption was succesful.
The same is true for symmetric encryption. If you think you have decrypted a cipher, you must also think that you have found the key. Therefore, you can use that key to encrypt your, presumably correct, decrypted text and see if the encrypted result is identical to the original ciphertext.
For symmetric encryption where the key length is shorter than the cipher-text length, you're guaranteed to not be able to produce every possible plain-text. You can probably guess what form your plain--text will take, to some degree -- you probably know whether it's an image, or XML, or if you don't even know that much then you can assume you'll be able to run file on it and not get 'data'. You have to hope that there are only a few keys which would give you even a vaguely sensible decryption and only one which matches the form you are looking for.
If you have a sample plain-text (or partial plain-text) then this gets a lot easier.

How do you test your app for Iñtërnâtiônàlizætiøn? (Internationalization?)

How do you test your app for Iñtërnâtiônàlizætiøn compliance? I tell people to store the Unicode string Iñtërnâtiônàlizætiøn into each field and then see if it is displayed correctly on output.
--- including output as a cell's content in Excel reports, in rtf format for docs, xml files, etc.
What other tests should be done?
Added idea from #Paddy:
Also try a right-to-left language. Eg, שלום ירושלים ([The] Peace of Jerusalem). Should look like:
(source: kluger.com)
Note: Stackoverflow is implemented correctly. If text does not match the image, then you have a problem with your browser, os, or possibly a proxy.
Also note: You should not have to change or "setup" your already running app to accept either the W European characters or the Hebrew example. You should be able to just type those characters into your app and have them come back correctly in your output. In case you don't have a Hebrew keyboard laying around, copy and paste the the examples from this question into your app.
Pick a culture where the text reads from right to left and set your system up for that - make sure that it reads properly (easier said than done...).
Use one of the three "pseudo-locales" available since Windows Vista:
The three different pseudo-locale are for testing 3 kinds of locales:
Base The qps-ploc locale is used for English-like pseudo
localizations. Its strings are longer versions of English strings,
using non-Latin and accented characters instead of the normal script.
Additionally simple Latin strings should sort in reverse order with
this locale.
Mirrored qpa-mirr is used for right-to-left pseudo data, which is
another area of interest for testing.
East Asian qps-asia is intended to utilize the large CJK character
repertoire, which is also useful for testing.
Windows will start formatting dates, times, numbers, currencies in a made-up psuedo-locale that looks enough like english that you can work with it, but obvious enough when you're not respecting the locale:
[Шěđлеśđαỳ !!!], 8 ōf [Μäŕςћ !!] ōf 2006
There is more to internationalization than unicode handling. You also need to make sure that dates show up localized to the user's timezone, if you know it (and make sure there's a way for people to tell you what their time zone is).
One handy fact for testing timezone handling is that there are two timezones (Pacific/Tongatapu and Pacific/Midway) that are actually 24 hours apart. So if timezones are being handled properly, the dates should never be the same for users in those two timezones for any timestamp. If you use any other timezones in your tests, results may vary depending on the time of day you run your test suite.
You also need to make sure dates and times are formatted in a way that makes sense for the user's locale, or failing that, that any potential ambiguity in the rendering of dates is explained (e.g. "05/11/2009 (dd/mm/yyyy)").
"Iñtërnâtiônàlizætiøn" is a really bad string to test with since all the characters in it also appear in ISO-8859-1, so the string can work completely without any Unicode support at all! I've no idea why it's so commonly used when it utterly fails at its primary function!
Even Chinese or Hebrew text isn't a good choice (though right-to-left is a whole can of worms by itself) because it doesn't necessarily contain anything outside 3-byte UTF-8, which curiously was a very large hole in MySQL's default UTF-8 implementation (which is limited to 3-byte chars), until it was fixed by the addition of the utf8mb4 charset in MySQL 5.5. These days one of the more common uses of >3-byte UTF-8 is Emojis like these: [💝🐹🌇⛔]. If you don't see some pretty little coloured pictures between those brackets, congratulations, you just found a hole in your Unicode stack!
First, learn The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
Make sure your application can handle Turkish. It has several quirks that break applications that assume English rules. Because there are four kinds of letter "i" (dotted and dot-less, upper and lower case), applications that assume uppercase(i) => I will break when using Turkish rules, where uppercase(i) => İ.
A common thing to do is check if the user typed the command "exit" by using lowercase(userInput) == "exit" or uppercase(userInput) == "EXIT". This works as expected under English rules but will fail under Turkish rules where "exıt" != "exit" and "EXİT" != "EXIT". To do this correctly, one must use case-insensitive comparison routines which are built into all modern languages.
I was thinking about this question from a completely different angle. I can't recall exactly what we did, but on a previous project I think we wound up changing the Regional Settings (in the Regional and Language Options control panel?) to help us ensure the localized strings were working.

Testing common passwords

I have to write a program that will test the strength our our teams password after they have chosen
i need to write a program that will email them and tell them to choose a better password
Is there any lists available, legal of course, that i can use to do this?
You ask for lists so I'm guessing you're fine with the programming but are seeking wordlists/dictionaries to use?
To begin, if you have access to a UNIX/Linux/MacOS box there is a list in /usr/dict/words or /usr/share/dict/words.
A list of common passwords is at http://www.openwall.com/passwords/wordlists/password.lst
Also, check here for a large collection of wordlists - http://www.net-comber.com/wordurls.html
However, a list alone isn't sufficient, you'll want to check for words being reversed, repeated letters/numbers, etc etc.
There is a jQuery plugin that will show you password strength. The link also tells you the algorithm it uses (so you could implement it server-side if you want.)
A slightly different (or simpler?) approach may be to measure the password strength based on the diversity of characters used.
For example award one point if:
Password has at least one lower case letter
Password has at least one upper case letter
Password has at least one number
Password has at least one special symbol
Password is at least 6 characters long
Now you have password strength on the scale of 0 to 5....
Would it not be better to devise a set of guidelines or requirements (must contain letters, numbers, symbols and must be over 8 characters long and not be your username) or similar. This way you can test against those requirements and remove the ability for people to choose weak passwords such as dictionary words and short strings.