How do I implement these strong password requirements? - passwords

No keyboard patterns. i.e. keys that are adjacent vertically or horizontally on a keyboard. For example, 'ZXCVBN123' should be rejected.
No commonly used words and no words written backwards or disguised with special characters. For example 'Universe1' and 'Un1ver$e' should be rejected.

Well, first you need to define exactly what you want. What are keyboard patterns? Is 'jk' a keyboard pattern, or just 'jkl'? What's the shortest pattern there is? Is 'gy' a pattern? First you need to define what a pattern really.
Then you should make a list of all the available patterns (There aren't all that many. You have 36 starting points and 4 directions to go from each starting point). When you get a password, try to locate each of the patterns in it. Note that if you decide the shortest pattern is 3 letters long, you don't need to search for 4-letter patterns, all 4-letter patterns already contain 3-letter patterns.
As for words, that's easier, but first you need to make a list of all disallowed transformations ($->S, 1->i, etc...). Once you get a word, apply all the transformations and get yourself a 'normalized' word. Compare the normalized password against a dictionary of all legal words twice - the second time reverse the password.
You will probably need to do something a little more complicated than that, because you need to ignore numbers at the end of the word - sometimes. 1ncredible can be a substitute for 'incredible', although ncredible is not a word.

If you inspect the code of http://howsecureismypassword.net you can see that the password is compared to a large array of usual passwords.
On the page threre is a reference to the page http://xato.net/passwords/more-top-worst-passwords/ which lists the top 10.000 most common passwords.
One approach would be to download that list and check the users passwords against it or at least some top 100 of them.

Related

Karp-Rabin algorithm

The below image is from : 6.006-Introduction to algorithms,
While doing the course 6.006-Introduction to algorithms, provided by MIT OCW, I came across the Rabin-Karp algorithm.
Can anyone help me understand as to why the first rs()==rt() required? If it’s used, then shouldn’t we also check first by brute force whether the strings are equal and then move ahead? Why is it that we are not considering equality of strings when hashing is done from t[0] and then trying to find other string matches?
In the image, rs() is for hash value, and rs.skip[arg] is to remove the first character of that string assuming it is ‘arg’
Can anyone help me understand as to why the first rs()==rt() required?
I assume you mean the one right before the range-loop. If the strings have the same length, then the range-loop will not run (empty range). The check is necessary to cover that case.
If it’s used, then shouldn’t we also check first by brute force whether the strings are equal and then move ahead?
Not sure what you mean here. The posted code leaves blank (with ...) after matching hashes are found. Let's not forget that at that point, we must compare strings to confirm we really found a match. And, it's up to the (not shown) implementation to continue searching until the end or not.
Why is it that we are not considering equality of strings when hashing is done from t[0] and then trying to find other string matches?
I really don't get this part. Note the first two loops are to populate the rolling hashes for the input strings. Then comes a check if we have a match at this point, and then the loop updating the rolling hashes pair wise, and then comparing them. The entire t is checked, from start to end.

Questions in regards to full-text search with PDF SQL Server 2008 restricted to embedded text only?

What i'm curious about is lets say I have 100 pdfs. And all of them have the words "happy apple". Lets say that only 20 of these have embedded text that has "happy apple".
When i do a search for "happy apple" will i receive all 100 docs or only 20? I'm unable to find a clear answer on this question.
Flat out impossible to answer without any further information on your search tool and the actual PDFs.
"Happy apple" will be found if the text is 1. not compressed, 2. not encrypted, 3. not weirdly constructed, 4. not re-encoded, or 5. re-encoded but the translation table to Unicode is present and correct.
ad 1: Usually data streams in a PDF are compressed, using one or more algorithms from the standard set (usually LZW or Flate).
ad 2: PDFs may be encrypted with a password, preventing casual inspection. Levels of security range from mid-difficult to theoretically uncrackable with current technology.
ad 3: Single characters may appear on your page in any order. The software used to create it may, at its whim, split up text string in separate parts or even draw each individual character at any position, and omit all spaces. Only strict sorting on absolute x and y coordinates of each text fragment may reveal the original text.
ad 4: If a font gets subsetted, a PDF composer may decide to store 'h' as 0, 'a' as 1 and 'p' as 2 (and so on). The correct glyphs are still associated with these codes, but "the" text now may appear as "0 1 2 2 3 4 1 2 2 5 6" in the text stream. Also, even if it does not subset the font, a PDF composer is free to move characters around anyway.
ad 5: To revert this re-encoding, software may include a ToUnicode table. This is to associate character codes back to the original Unicode values again; one table per re-encoded font. If the table is missing, there usually is no straightforward way to create it.
There is even an ad 6 I did not think of: text may be outlined or appear in bitmaps only.
Only the very simplest PDFs can be searched with a general tool such as command-line grep. For anything else, you need a good PDF decoding tool -- and the better it is, the more points of this list you can tick off. Except, then, #5 and #6.
(Later edit) Oh wait. You obfuscated your actual question enough to entirely throw me off the target, which (I think!) was "does sql-server-2008 search for entire phrases or for individual words?"
Good thing, then, the above still holds. If you cannot search inside your PDFs anyway, the actual question is moot.

Forward Index Implementation in google

I am trying to develop a search engine in my free time modeled after google.
I am using the original google research paper listed here: http://infolab.stanford.edu/~backrub/google.html
However I am having a few problems here. To be exact I am having problem developing the forward index.
In the paper it says:
If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID's with hitlists which correspond to those words.
Now there are two problem with in this statement. First who decides which words out of the huge lexicon goes into the Forward Barrels? Do all of them go. Second is the meaning of the word corresponding. Does it mean words that actually appear in that document after the previous word or something else?
I am really new to Search Engines and would really appreciate any Information Retrival Expert helping me on this. If moderators think that this question belong in some other Stack Exchange site please do so.
First Question:
The string value of every word is mapped into an integer (by a hash function). This is because integers are far more easier to handle than strings. You can then define ranges (buckets or bins or whatever else you might want to call them) over these integer values, e.g.
term ids 0 to 1000 => Bin-1
term ids 1001 to 2000 => Bin-2
and so on.
Second question:
The context information is typically not used. A word is simply a term present in a document, such as the terms "the", "quick", "brown" etc.
Since you said you are new to IR, a good way to start would be to read an introductory book to IR, e.g. the book by Manning and Schutze.

Check if two NSStrings are similar

I present a tricky question that I am not sure how to approach. So, I have formulated a plist containing dictionaries which contain two objects:
The Country Name
The Plug Size Of The Country
There are only 210 countries/facts though.
And, I have enabled to search through a list of many many countries, in which there might be a fact or not. But here is my problem, I am using a web service called Geonames and the user can use a search bar display controller to search for countries, and these plist country names paired with plug sizes are actually from a Wikipedia article.
Now, the country naming in Geonames and in my plist from Wikipedia might be named slightly different, maybe with an extra space, an extra dash, an extra letter. This is why I want to see if the geoname country string is very similar to the one in the plist.
So, this would not be isEqualToString: because that finds if it is exact, can the compare: method work?
How can I approach this? Here is an example:
Geoname returns (not a real country just an example):
Yiting
But plist may return:
Yitting
So with 1 extra 't', but there are other circumstances. I would like these to be compared as exact, or at least similar, so I could consider them as a match.
Are there any tutorials, resources, projects etc. you could point me towards?
Thank you! Bye!
The Soundex algorithm is useful in cases like this.
I found a sample implementation on github.
You need to implement an algorithm for approximate matching of strings. One of the most popular such algorithms is Levenshtein distance, one of several Edit distance algorithms. The distance is calculated as the number of editing operations required to transform string A into string B - inserting, deleting, or changing a character counts as one edit operation. The closer are the strings, the smaller is the edit distance between them. You can calculate pairwise edit distances, and find the smallest one to identify the match.
You may find this post about auto update/complete useful:
I've tested that UITextViews work well while adhering to the UITextViewDelegate protocol in your UIViewController class and will produce a result similar to what you'll find in the Messages app. I haven't checked if UITextField and UITextFieldDelegate do as well.

How to implement autocomplete on a massive dataset

I'm trying to implement something like Google suggest on a website I am building and am curious how to go about doing in on a very large dataset. Sure if you've got 1000 items you cache the items and just loop through them. But how do you go about it when you have a million items? Further, suppose that the items are not one word. Specifically, I have been really impressed by Pandora.com. For example, if you search for "wet" it brings back "Wet Sand" but it also brings back Toad The Wet Sprocket. And their autocomplete is FAST. My first idea was to group the items by the first two letters, so you would have something like:
Dictionary<string,List<string>>
where the key is the first two letters. That's OK, but what if I want to do something similar to Pandora and allow the user to see results that match the middle of the string? With my idea: Wet would never match Toad the Wet Sprocket because it would be in the "TO" bucket instead of the "WE" bucket. So then perhaps you could split the string up and "Toad the Wet Sprocket" go in the "TO", "WE" and "SP" buckets (strip out the word "THE"), but when you're talking about a million entries which may have to say a few words each possibly, that seems like you'd quickly start using up a lot of memory. Ok, that was a long question. Thoughts?
As I pointed out in How to implement incremental search on a list you should use structures like a Trie or Patricia trie for searching patterns in large texts.
And for discovering patterns in the middle of some text there is one simple solution. I am not sure if it is the most efficient solution, but I usually do it as follows.
When I insert some new text into the Trie, I just insert it, then remove the first character, insert again, remove the second character, insert again ... and so on until the whole text is consumed. Then you can discover every substring of every inserted text by just one search from the root. That resulting structure is called a Suffix Tree and there are a lot of optimizations available.
And it is really incredible fast. To find all texts that contain a given sequence of n characters you have to inspect at most n nodes and perform a search on the list of children for every node. Depending on the implementation (array, list, binary tree, skip list) of the child node collection, you might be able to identify the required child node with as few as 5 search steps assuming case insensitive latin letters only. Interpolation sort might be helpful for large alphabets and nodes with many children as those usually found near the root.
Don't try to implement this yourself (unless you're just curious). Use something like Lucene or Endeca - it will save you time and hair.
Not algorithmically related to what you are asking, but make sure you have a 200ms or more delay (lag) after the kaypress(es) so you ensure that the user has stopped typing before issuing the asynchronous request. That way you will reduce redundant http requests to the server.
I would use something along the lines of a trie, and have the value of each leaf node be a list of the possibilities that contain the word represented by the leaf node. You could sort them in order of likelihood, or dynamically sort/filter them based on other words the user has entered into the search box, etc. It will execute very quickly and in a reasonable amount of RAM.
You keep the items on the server side (perhaps in a DB, if the dataset is really large and complex) and you send AJAX calls from the client's browser that return the results using json/xml. You can do this in response to the user typing, or with a timer.
if you don't want a trie and you want stuff from the middle of the string, you generally want to run some sort of edit distance function (levenshtein distance) which will give you a number indicating how well 2 strings match up. it's not a particularly efficient algorithm, but it doesn't matter too much for things like words, as they're relatively short. if you're running comparisons on like, 8000 character strings it'll probably take a few seconds. i know most languages have an implementation, or you can find code/pseudocode for it pretty easily on the internet.
I've built AutoCompleteAPI for this scenario exactly.
Sign up to get a private index, then,
Upload your documents.
Example upload using curl on document "New York":
curl -X PUT -H "Content-Type: application/json" -H "Authorization: [YourSecretKey]" -d '{
"key": "New York",
"input": "New York"
}' "http://suggest.autocompleteapi.com/[YourAccountKey]/[FieldName]"
After indexing all document, to get autocomplete suggestions, use:
http://suggest.autocompleteapi.com/[YourAccountKey]/[FieldName]?prefix=new
You can use any client autocomplete library to show these results to the user.