Testing common passwords - passwords

I have to write a program that will test the strength our our teams password after they have chosen
i need to write a program that will email them and tell them to choose a better password
Is there any lists available, legal of course, that i can use to do this?

You ask for lists so I'm guessing you're fine with the programming but are seeking wordlists/dictionaries to use?
To begin, if you have access to a UNIX/Linux/MacOS box there is a list in /usr/dict/words or /usr/share/dict/words.
A list of common passwords is at http://www.openwall.com/passwords/wordlists/password.lst
Also, check here for a large collection of wordlists - http://www.net-comber.com/wordurls.html
However, a list alone isn't sufficient, you'll want to check for words being reversed, repeated letters/numbers, etc etc.

There is a jQuery plugin that will show you password strength. The link also tells you the algorithm it uses (so you could implement it server-side if you want.)

A slightly different (or simpler?) approach may be to measure the password strength based on the diversity of characters used.
For example award one point if:
Password has at least one lower case letter
Password has at least one upper case letter
Password has at least one number
Password has at least one special symbol
Password is at least 6 characters long
Now you have password strength on the scale of 0 to 5....

Would it not be better to devise a set of guidelines or requirements (must contain letters, numbers, symbols and must be over 8 characters long and not be your username) or similar. This way you can test against those requirements and remove the ability for people to choose weak passwords such as dictionary words and short strings.

Related

Primary key requirements

Is it a good idea to store phone number as a primary key on RDBMS? They are unique to nearly all of us. But my friend suggests it is not a good idea because of the following reasons.
What if two people in a family share a phone number?
What if a person does not have a phone number?
What are your insights, please let me know!.
I'd be against this idea, generally for reasons:
It is personally identifiable information and I'd recommend using it with caution if you're bound to GDPR. Some users might ask you to not use their phone numbers. It might later be required to hash or mask part of the phone number, or even completely get rid of it.
Value depends on user input even it is validated. There are several services which lend you a phone number for validation if you're not in the target country of the validator.
A schema needs to be defined of the phone number if it will contain country code, parentheses or spaces.
There should be a validation to prevent duplicates and null values.
In summary it is not a good idea to use a field which has a dependency to external facts. As others mentioned, using an autogenerated identifier for the ID and non-unique index for the phone number seems like a better approach.
A phone number certainly can make sense as a key but all depends on what you need to identify and how you intend to use it. There is no general right or wrong answer.
Three very good criteria (but not absolute rules) for choosing and designing keys are: Simplicity, Stability, Familiarity. Phone numbers are simple and familiar enough for many purposes. Whether they are stable enough is probably highly dependent on circumstances. For example you might require all your employees to supply a unique phone number for third-factor authentication but probably it's quite acceptable to change that number occasionally.
what is the purpose of having phone number as primary key, is it to identify a individual? if so one individual can have multiple phone numbers (mobile/home phone) so it is not advisable to use phone number as primary key.
Also your question is right what if a person does not have phone number.

Password strength check: comparing to previous passwords

Every now and then I come across applications that force you to change passwords once in a while. Almost universally, they have this strange requirement for the new password: it has to be "significantly" different from your previous password(s).
While at first this sounds logical, next thing I think is: how do they do that? Do they store my passwords in plain text? I would have accepted the answer that they do, if it wasn't for the fact that these are kinds of applications that pretend to care about security so much they force you to change your password if it is expired! Microsoft Exchange is one example of this.
I'm not very good at cryptography and hash functions, so my question is this: Is it possible to enforce this kind of policy without storing passwords in plain text?
Do you know how this policy is implemented in real world applications?
UPDATE: An Example.
I was recently changing my Microsoft Exchange password. I only use Web Access, so it might be different a little -- I have no idea.
So, it forces me to change my password. What I do sometimes is I change it to something new and then change it back almost immediately. The freaky part is that It did not allow me to even change it back because of this. I tried changing it a little, by adding a letter in front of it or changing one symbol -- no luck, it was complaining.
With a typical hash, the best you can do is see if the new password is exactly equal to previous ones. You can break the password into multiple hashes in order to get more flexible with comparison, for example 3 hashes:
Alpha characters only
Numeric characters only
All other characters
You could for example require all the hashes to change to be accepted, to prevent users from just changing their password from SecretPassword01 to SecretPassword02.
A cryptographic expert may weigh in here on if this could be made as secure as a single hash.
NOTE that this is not as secure as a single hash, so before you go implementing this, make sure you have really done your research.
When changing password you're usually asked for the old one to confirm your identity. It's then trivial to compare the old one and the new one to see how much they differ. TBH I don't know how to compare to several previous passwords without storing them, but that's getting into the territory of ridiculous policies anyway.

Verifying equivalence of a secret

Alice & Bob are both secret quadruple agents who could be working for the US, Russia or China. They want to come up with a scheme that would:
If they are both working for the same side, prove this to each other so they can talk freely.
If they are working for different sides, not expose any additional information about which side they are on.
Oh, and because of the sensitive nature of what they do, there is no trusted third party who can do the comparison for both of them.
What protocol would be able to satisfy both of these needs?
Ideally, any protocol would also be able to generalize to multiple participants and multiples states but that's not essential.
I've puzzled over it for a while and I can't find a satisfactory solution, mainly owing to condition 2.
edit: Here's the original problem that motivated me to look for a solution. "Charlie" had some personal photos that he shared with me and I later discovered that he had also shared them with "Bob". We both wanted to know if we had the same set of photos but, at the same time, if Charlie hadn't shared a certain photo with either of us, he probably had a good reason not to and we didn't want to leak information.
My first thought would be for each of us to concatenate all the photos and provide the MD5 sum. If they matched, then we had the same photos but if they didn't, neither party would know which photos the other had. However, I realized soon after that this scheme would still leak information because Bob could generate an MD5 for each subset of photos he had and if any of them matched my sum, he would know which photos I didn't have. I've yet to find a satisfactory solution to this particular problem but I thought I would generalize it to avoid people focusing on the particulars of my situation.
For both problems, you could use a Secure two-party computation equality-algorithm. There are many schemes, for example this by Damgard, Fitzi, Kiltz, Nielsen and Toft: Unconditionally Secure Constant Round Multi-Party
Computation for Equality, Comparison, Bits and Exponentiation.
Of course an agent could try to pose as an agent from another side to get a 1/3 chance to discover the true side of another agent, but that seems unavoidable.
A much simpler scheme for the photo-problem, which should be almost as good as the secure multiparty computation, is the following:
Alice and Bob sorts their pictures and generate a SHA-512 hash.
Alice sends the first bit of her hash to Bob.
Bob compares the bit to the first bit of his hash. If it is different, they know that they have received different photos. Otherwise they continue.
Bob sends the second bit of his hash to Alice.
Alice checks this bit and decides whether to continue.
Continue until the protocol aborts or all bits have been checked.
So they are guaranteed to be quadruple agents? That is they are guaranteed to be secretly working for one faction while pretending to work for a second while pretending to work for a third while pretending to work for a fourth? They are limited to just the US, Russia or China? If so then that means that there will always be at least one faction they are both pretending to work for and are simultaneously actually working for. That seems to negate their ability to be quadruple agents, because surely one of them can't be working for the Americans while secretly working for the Americans, while secretly working for the Americans, while secretly working for the Americans.
You say that the ideal solution would generalize to arbitrary numbers of states and spy-stacks. Can the degree of secret agent-ness be either higher, equal or lower than the number of states? This might be important. Also, is Alice always guaranteed to have the same degree of agent-ness as Bob? i.e. They will ALWAYS both be triple agents, or ALWAYS both by quintuple agents? The modulo operator springs to mind...
More details please.
As a potential answer, you can enumerate the states into a bitfield. US=1 Russia=2, China=4, Madagascar=8, Tuva=16 etc. Construct a device that is essentially an AND gate. Alice builds and brings one half and Bob builds and brings the other. Separated by a cloth, they each press the button of the state they're really working for. If the output of the AND gate is high, then they're on the same side. If not, then they quietly take down the cloth, and depart with the respective halves of their machine so that the button can't be determined by fingerprint.
This is not theoretical or rigorous, but practical.
For your photos problem, create hashes for all subsets of your photos; randomly select a subset of these, and shuffle in an agreed quantity of randomly generated hash values. Bob does the same, and you exchange these sets. If the proportion of hashes in what Bob has sent you that matches ones you can generate by hashing subsets of your photos significantly differs from what you expect, it is likely you have a significantly different corpus of photos from him. If the proportion of random hashes you agree on is high, you risk being unable to detect small differences in your collections of photos; if the proportion is low, you risk exposing information about missing photos; you will have to select a suitable point for the tradeoff.
Interesting.
I think, no matter what the scheme, it'll need to involve a component of random failure. This is because of the conflicting requirements. You would need a scheme that, occasionally, even when they are on the same side, doesn't work. Because if it always worked, they would immediately be able to determine they aren't on the same side.
Your point 'B' is also vague. You say you don't want to expose what side they are on. Does that mean that the info can't point to specifically one of the sides? Is it okay if Alice thinks Bob is from either one of the others?
Also, have you tried emailing this to the cryptography mailing list? May get a better response there. It's an interesting one to think about :)
Here's the closest I've come to a solution:
Assume there is a function doubleHash such that
doubleHash(a+doubleHash(b)) == doubleHash(b+doubleHash(a))
Alice generates a 62 bit secret and appends the 2 bit country code to the end of it, hashes it and gives Bob doubleHash(a).
Bob does the same thing and gives Alice doubleHash(b).
Alice appends the original secret to the hash that Bob gave her, hashes it and publishes it as doubleHash(a+doubleHash(b)).
Bob does the same thing and publishes doubleHash(b+doubleHash(a)).
If both the hashes match, then they are from the same country. On the other hand, if they don't match, then Bob can't decipher the hash because he doesn't know Alice's secret and vice versa.
However, such a scheme relies on the existence of a doubleHash function and I'm not sure if such a thing is possible.
The most simple thing I can think of with the photos that would possibly work is as thus:
Hash all the photos with a 4096 bit hash.
Sort the photos by hash value. ( Hashes are afterall, just a string representation of a large number )
using that sort order, use a streaming system to pipe, and hash, those photos, as if they were a singular file.
Share your hashes.
If the hashes match, you have the same files. ( low low risk of incorrect positive match, but at 4K hashing, its a bit unlikely )
There are of course, a few weaknesses here:
Don't share how many photos you have. Doing so could permit the party with the greater number of photos do intelligent permutation of the data and remove photos from the hash set they suspect likely you don't have, using the number as a guide, and find ( at great computational expense mind ) a set of images that matches your hash.
They can do 1 without the number, but its harder, and they're out of luck if they actually have less photos.
They could create a fake hash, simply with a random number generator, and send it to you, giving you the impression you had different datasets when you really had the same.
The above weaknesses are also prevalent in your country code identification system, except of course, you have far less entropy to get in the way, and its far easier to fraud the system. ( and thus, far far far easier to work out who they are by sheer brute force, or have yourself worked out by brute force, regardless of how fancy your hash algorithm is )
If this were not the case, you would have already been found out by the very agencies you work for, because something that reliable and secure would be a sure fire way to do a secure background check.
The Photo Scenario is Impossible to Achieve:
Your scheme is impossible for the reasons that you name.
Consider a function f, which takes two sets of photos, s1 and s2.
f(s1, s2) returns true if s1=s2 and false if s1!=s2.
That is, this function implements the scheme you want.
Bob can always supply a subset of photo's he has, and learn which photo's charlie doesn't have.
There is no way around this, any function which has the property you want can not have the security you want.
The Spy Scenario is Even More Impossible:
As Kent Fredric pointed out the spy scenario has even greater inherent weaknesses.
It has all problems of the photo scenario, plus the additional weakness of having only four secrets.
In the photo scenario it would be highly unlikely that Bob would randomly guess one of Charlies photographs.
It is trivial in the spy scenario for Bob to guess Alices choice (1/4).
The spys only have four countries they can belong to, as they are both quadruple agents they both know all the secret code words for each country.
Thus, Bob could pretend to be working for the Chinese to test Alice.
A Different Type of Solution:
Some posters have noted, the security can be increased if you weaken the accuracy of f.
Of course if it is not accurate what is the point. I propose a different type of solution.
Do not let them compare the same
photographs more than one time.
The party which wishes to initiate the comparison must first show that this is a new comparison and does not use any of the pictures from before.
EDIT: Problems with Double Hash
I am making some assumptions about the doublhash protocol, but...
For the photograph scheme, the doublehash protocol is no better than f, because the 62 bit secret must be constructed from a set of photographs for the comparison to be meaningfull. The subset attack mentioned in the original question still applies here. Try all subsets of photographs to brute force the secrets you can generate, thus Bob can see if he can generate the same secret as Alice.
Using the doublehash property Bob can still brute force the secret.
doubleHash(s1+doubleHash(b)) != doubleHash(aliceSecret+doubleHash(a))
doubleHash(s2+doubleHash(b)) != doubleHash(aliceSecret+doubleHash(a))
doubleHash(s3+doubleHash(b)) == doubleHash(aliceSecret+doubleHash(a))
Bingo, aliceSecret == s3.
DoubleHash is only as strong as it is hard to bruteforce either a or b
Implementating DoubleHash
Instead doubleHash(a + doubleHash(b)), try doubleHash(a, md5(b)).
DoubleHash(a + doubleHash(b)) is bad because Bob could generate colliding hashes like so:
doubleHash((12 + doubleHash(34)) + doubleHash(5678))
= doubleHash((34 + doubleHash(12)) + doubleHash(5678))
= doubleHash(5678 + doubleHash(12 + doubleHash(34))
= doubleHash(5678 + doubleHash(34 + doubleHash(12))
Here is an implementation of doubleHash using the new formulation,
Doublehash(a, hashOfB){
hashOfA = md5(a)
combinedHash = hashOfA xor hashOfB
return md5(combinedHash)
}
One could also use the math behind blind signatures to impliment version of doubleHash.
Wouldn't RSA work here? Each nation knows its private key, you publish your public key, and only nations that are the same can decrypt the info. I guess the second person would know that the first isn't on the same side as they are, however.
Hmm.
How about Public Key Cryptography?

What are the things should we consider while writing a Spell Checker?

I want to write a very simple Spell Checker. The spell checker will try to match the input word with equivalent words form the dictionary.
What can be done to find those 'equivalent words'? What analysis can be preformed on two words to mark them equivalent?
Before investing too much trying to unravel that i'd first look to already existing implementations like Aspell or netspell for two main reasons
Not much point in re-inventing the wheel. Spell checking is much trickier than it first appears and it makes sense to build on work that has already been done
If your interest is finding out how to do it, the source code and community will be a great benefit should you decide to implement your own anyway
Much depends on your use case. For example:
Is your dictionary very small (about twenty words)? In this case it probably is better to precompute all possible nearby mistaken words and use a table/hash lookup.
What is your error model? Aspell has at least two (one for spelling errors caused by nearby letters on the keyboard, and the other for spelling errors caused by the way a word sounds).
How dynamic is your dictionary? Can you afford to do a massive preparation in order to get an efficient retrieval?
You may need a "word equivalence" measure like Double Metaphone, in addition to edit distance.
You can get some feel by reading Peter Norvig's great description of spelling correction.
And, of course, whenever possible, steal code. Do not reinvent the wheel without a reason - a reason could be a very special domain, a special way your users make spelling mistakes, or just to learn how it's done.
Edit Distance is the theory you need to write a spell checker. You also need a dictionary. Most UNIX systems come with a dictionary already installed for your locale.
I just finished implementing a spell checker and used a combination of the following in getting a list of "suggested" words
Phonetic hashing of the "misspelled" word to lookup a hash of identical dictionary hashed real words (for java check out Apache Commons Codec for a suitable library). The phonetic hash of your dictionary file can be precomputed.
Edit distance between the input and the potentials (this is reasonably expensive so you need to reduce the list first with something like a phonetic hash, assuming a higher volume load - in my case, a server based spell check)
A known list of common misspellings, e.g. recieve vs. receive.
An ordered list of the most common words in the english language
Essentially I weighted each potential word primarily based on edit-distance and commonality. e.g. if word probability is a percentage, then
weight = edit-distance * 100 / probability
(lower weights are better)
But then I also also override any result with the known common misspellings (i.e. these always float to the top suggested result).
There may be better ways, but this worked pretty well.
You may also wish to ignore ALL CAPS words, initials etc, so choosing what to ignore is also something to think about.
Under linux/unix you have ispell. Why reinventing the whell.

What are the best rules to follow for what characters to allow in a password?

Without thinking about it at all I just want to say I should allow every character. It gets hashed in any case, and I don't want to limit people who want to create strong passwords.
However, thinking about it more, there are plenty of characters that I have no idea what effect they'd have on things. Foreign characters, ascii symbols, etc. to name a couple.
I tried to Google but I can't find any definitive standard for what people do. Even most professional organizations don't seem to know. It seems to be a common practice for many sites to disallow special characters altogether, which is just silly and not what I want to do.
Anyway, are there any standard recommendations for length, allowed characters, and so forth?
I'm not sure if it matters, but I'll be using ASP.NET w/ C#
Any printable, non-whitespace ASCII character (between 33 and 126 inclusive) are typically allowed in passwords. Many security professionals (and SO commenters) are advising the use of a passphrase in place of a password, so you'd have to allow spaces. The argument is that due to their length, and since phrases aren't in a dictionary, passphrases are more difficult to crack than passwords. (A passphrase can also be easier to remember, so a legitimate user doesn't have to keep it written down on a sticky-note right on their monitor.)
Some strong password generators use a hash, so I'd put a very high limit on the length (512 or 1024) just to be inclusive. Password generators today often yield strings of 32-128 characters, but who knows what hashes will be used in the next few years.
Non-ASCII characters certainly make things harder when it comes to entering the password on limited devices (mobiles, consoles etc) - but usually not impossible. Arguably if the user wants to do that, you should let them. It's easy enough to do a reasonable and consistent thing - encode in UTF-8 before hashing, for example. You'd only get into difficulties if some input device sent the characters as a composition (e.g. e + acute accent instead of "e acute") - but I suspect that wouldn' t happen in real life. (You could decompose everything yourself, but that would be a lot of trouble to go to for an edge case.)
I'd restrict it to printable characters, however. Putting tabs, form feeds etc in a password really is asking for trouble.
Not an expert, but I hate when characters I choose and not that bizarre are rejected. So, I think I agree with your gut.
Short answer: allow as much as the system backing it can support. Nowadays there's really no excuse not to use full unicode support for text entry, and that includes passwords. I don't think you need to worry about problems with characters as long as they're handled literally (but I'm not a pro in this field--beware of sql injection).
I have a pet peeve against sites that impose restrictions on passwords... any kind of restriction. I like sites that will tell you how strong your password is and recommend you make it stronger, but forcing a user to type at least 8 characters, or to require both letters and numbers, etc. is just plain frustrating.
If you need to have a maximum field size (for example for storing in a database) try to make it large enough for anything that people would type out by hand. There's really no such thing as a too-large password field since there's always the potential to use an automated, generated strong password, but 64 to 128 characters would certainly suffice.
Fundamentally, most of the unicode class of characters should be allowed. Do skip however control characters (e.g. 0-31 besides space), the byte order mark (0xfffe and oxfeff). Further, you want to first canonicalize the representation to get rid of problems caused by differing representations. You might issue warnings though for characters that seem to be too hard to enter, but users will guard against that themselves.
Remember: When you are storing passwords, all passwords should be encrypted with a one-way algorithm like md5 of sha1. Since these algorithms always yield hexadecimal numbers, you don't need to worry about SQL injections or anything like that.
So, as long as you can md5 or sha1 a character, it should be accepted.
If you are talking about preventing SQL-injection type of attacks, it is probably a better idea to make sure your code does what it is supposed to do, rather than relying on restricting the input so the problem becomes easier.
For non-ascii characters, I don't see that as a more difficult problem if your input can be correctly represented as a binary string (and not as text), which is then passed to your hash function or key generator, etc.
Add another vote for "let the user include any and all characters that their interface allows them to enter". I wouldn't even disallow tab or control characters. Your software has the capability to accept arbitrary byte strings and hash them, so accept arbitrary byte strings as passwords. To do otherwise reduces the space which an attacker must search in a brute-force or dictionary attack.
(Of course, even if you do allow everything, 99% of users will still use their pet's name as their password...)
Eventually you may have to print out the clear password in a confirmlation email sent to your users.
PS: Might consider also encoding problems in the email, if it's not standard ascii (eg. Japanese characters), it's possible that a user will not receive the email in the proper format or simply can't read it on another system due to fonts not being installed.
All this weighs in the "printable" ascii characters range.