How to recognize if word has no meaning, maybe some impossible syllables? - cryptography

Initially, I have m arrays of n characters, where each array contains unknown (for me) character of needed word (condition: word has meaning).
For example, m = 4, n = 3: array0 = {'t', 'e', 'c'}, array1 = {'g' 'o' 'a'}, array2 = {'w' 'd' 'y'}, array3 = {'e' 'o' 's'}. Each array contains only one correct letter: in array0 is first letter, in array1 - second... So, the probable secret word is 'code': array0[2] = 'c', array1[1] = 'o', array2[1] = 'd', array3[0] = 'e'.
I need to find all of existing letter-combinations, i.e. exclude generated meaningless words.
Are there any rules/regularities of 'impossible' syllables/letter-combinations in English?
I'm attacking Vigenere's cipher. So, I know the length of key and its probable characters. I'm shuffling my arrays and getting many meaningless words. Problem is to filter them. As I get it, some conditions can help to recognize incorrect words. For example, if word length is > 4 then all vowel chars, or all consonant chars word is wrong. Some syllables, such as kk *hh* ww, in general, are impossible too. Where can I find such rules?

I'm supposing what you mean by the "word has meaning" is that it is an English dictionary word.
I believe that you should approach the problem from the other direction, as GregS suggests, and go through a dictionary. English has many exceptions when it comes to letters and spelling, and the number of words that look English are much greater than the actual number of English words. You won't be able to cut down your search very much in that way.
But because you know the length and probable characters you are able to quickly throw out many dictionary words. Also, if the message isn't too short, it would also be very fast to attempt a decoding of the message with possible words, and throw out unlikely decodings by letter, digram or trigram frequencies.

I'm not sure I follow your strategy for attacking a Vigenere cipher. However, in response to:
I need to find all of existing letter-combinations, i.e. exclude generated meaningless words. Are there any rules/regularities of 'impossible' syllables/letter-combinations in English?
Yes, indeed there is a plethora of such rules. There are two ways of learning and implementing these rules:
Carefully study the morphology of English, and meticulously implement the rules.
Train a Markov model on a corpus of English text.
1 will be substantially less work for little additional benefit.

Related

Naming conventions. Method or variable names containing numbers as a words

I can't find even couple of words about containing numbers in names of variables or on methods. Does anyone have any authoritative information about such cases:
string2map
its4me
etc...
Exactly using number as a word but not number as a number.
Is it acceptable? Not acceptable, stupidly, professional or not. Please argue your opinion.
I haven't found any information either but below are my own thoughts.
Using a digit in an identifier which happens 2 be pronounced in the same way as a word is just silly word play. It also makes the meaning of the identifier ambiguous - does char2old mean that a character is too old, is it an old version of char2 or is it a conversion? It's fun however to come up with names like a10sorFlow, the2lbox, my4mula but they are best avoided.
When it comes to using numbers 1 to N at the end of equally named identifiers, it is probably better to use an array instead if N > 2. Also, when N = 2 there are often clearer names that can be used, like leftCircle and rightCircle instead of circle1 and circle2, or currentChar and nextChar instead of char1 and char2.
Here is a good general guide for naming variables:
Identifier kind
Word class
Example
Boolean variable or pure function
Last word is an adjective
doorClosed, TablePrepared
Non-boolean variable or pure function
Last word is a noun
closedDoor, PreparedTable
Non-pure function (has side-effects)
First word is a verb
CloseDoor, PrepareTable

How to treat numbers inside text strings when vectorizing words?

If I have a text string to be vectorized, how should I handle numbers inside it? Or if I feed a Neural Network with numbers and words, how can I keep the numbers as numbers?
I am planning on making a dictionary of all my words (as suggested here). In this case all strings will become arrays of numbers. How should I handle characters that are numbers? how to output a vector that does not mix the word index with the number character?
Does converting numbers to strings weakens the information i feed the network?
Expanding your discussion with #user1735003 - Lets consider both ways of representing numbers:
Treating it as string and considering it as another word and assign an ID to it when forming a dictionary. Or
Converting the numbers to actual words : '1' becomes 'one', '2' as 'two' and so on.
Does the second one change the context in anyway?. To verify it we can find similarity of two representations using word2vec. The scores will be high if they have similar context.
For example,
1 and one have a similarity score of 0.17, 2 and two have a similarity score of 0.23. They seem to suggest that the context of how they are used is totally different.
By treating the numbers as another word, you are not changing the
context but by doing any other transformation on those numbers, you
can't guarantee its for better. So, its better to leave it untouched and treat it as another word.
Note: Both word-2-vec and glove were trained by treating the numbers as strings (case 1).
The link you provide suggests that everything resulting from a .split(' ') is indexed -- words, but also numbers, possibly smileys, aso. (I would still take care of punctuation marks). Unless you have more prior knowledge about your data or your problem you could start with that.
EDIT
Example literally using your string and their code:
corpus = {'my car number 3'}
dictionary = {}
i = 1
for tweet in corpus:
for word in tweet.split(" "):
if word not in dictionary: dictionary[word] = i
i += 1
print(dictionary)
# {'my': 1, '3': 4, 'car': 2, 'number': 3}
The following paper can be helpful: http://people.csail.mit.edu/mcollins/6864/slides/bikel.pdf
Specifically, page 7.
Before they use an <unknown> tag they try to replace alphanumeric symbol combination with common pattern names tags, such as:
FourDigits (good for years)
I've tried to implement it and it gave great results.

How to generate (book) indexes?

I need to create an index for a book. While the task is easy at the first look -- group words by the first letter, then sort them, -- this obvious solution works only for the usa language. The real word is, however, more complex. See http://en.wikipedia.org/wiki/Collation :
The difference between computer-style numerical sorting and true alphabetical sorting becomes obvious in languages using an extended Latin alphabet. For example, the 29-letter alphabet of Spanish treats ñ as a basic letter following n, and formerly treated ch and ll as basic letters following c and l, respectively. Ch and ll are still considered letters, but are now alphabetized as two-letter combinations. (The new alphabetization rule was issued by the Royal Spanish Academy in 1994.) On the other hand, the digraph rr follows rqu as expected, both with and without the 1994 alphabetization rule. A numeric sort may order ñ incorrectly following z and treat ch as c + h, also incorrect when using pre-1994 alphabetization.
I tried to find an existing solution.
DocBook stylesheets does not address the problem.
The best match I found is xindy ( http://xindy.sourceforge.net/ ), but this tool is too much connected to LaTeX.
Any other suggestions?
Naively, you could examine every word in the text and create a hash, using the words as a key, and building up an array of locations (page numbers?) as values.
But indexes are generally a bit more focused than that.
Well, after answering to comments, I realized that I don't need a tool to generate indexes, but a library which can sort according to cultures. First experiments shows that I'm going to use ICU and its Python bindings PyICU. For example:
import icu
words = ["liche", "lichée", "lichen", "lichénoïde", "licher", "lichoter"]
collator = icu.Collator.createInstance(icu.Locale.getFrance())
for word in sorted(words, cmp=collator.compare):
print word.decode("string-escape")

Add spaces between words in spaceless string

I'm on OS X, and in objective-c I'm trying to convert
for example,
"Bobateagreenapple"
into
"Bob ate a green apple"
Is there any way to do this efficiently? Would something involving a spell checker work?
EDIT: Just some extra information:
I'm attempting to build something that takes some misformatted text (for example, text copy pasted from old pdfs that end up without spaces, especially from internet archives like JSTOR). Since the misformatted text is probably going to be long... well, I'm just trying to figure out whether this is feasibly possible before I actually attempt to actually write system only to find out it takes 2 hours to fix a paragraph of text.
One possibility, which I will describe this in a non-OS specific manner, is to perform a search through all the possible words that make up the collection of letters.
Basically you chop off the first letter of your letter collection and add it to the current word you are forming. If it makes a word (eg dictionary lookup) then add it to the current sentence. If you manage to use up all the letters in your collection and form words out of all of them, then you have a full sentence. But, you don't have to stop here. Instead, you keep running, and eventually you will produce all possible sentences.
Pseudo-code would look something like this:
FindWords(vector<Sentence> sentences, Sentence s, Word w, Letters l)
{
if (l.empty() and w.empty())
add s to sentences;
return;
if (l.empty())
return;
add first letter from l to w;
if w in dictionary
{
add w to s;
FindWords(sentences, s, empty word, l)
remove w from s
}
FindWords(sentences, s, w, l)
put last letter from w back onto l
}
There are, of course, a number of optimizations you could perform to make it go fast. For instance checking if the word is the stem of any word in the dictionary. But, this is the basic approach that will give you all possible sentences.
Solving this problem is much harder than anything you'll find in a framework. Notice that even in your example, there are other "solutions": "Bob a tea green apple," for one.
A very naive (and not very functional) approach might be to use a spell-checker to try to isolate one "real word" at a time in the string; of course, in this example, that would only work because "Bob" happens to be an English word.
This is not to say that there is no way to accomplish what you want, but the way you phrase this question indicates to me that it might be a lot more complicated than what you're expecting. Maybe someone can give you an acceptable solution, but I bet they'll need to know a lot more about what exactly you're trying to do.
Edit: in response to your edit, it would probably take less effort to run some kind of OCR tool on a PDF and correct its output than it would just to correct what this system might give you, let alone program it
I implemented a solution, the code is avaible on code project:
http://www.codeproject.com/Tips/704003/How-to-add-spaces-between-spaceless-strings
My idea was to prioritize results that use up most of the characters (preferable all of them) then favor the ones with the longest words, because 2,3 or 4 character long words can often come up by chance from leftout characters. Most of the times this provides the correct solution.
To find all possible permutations I used recursion. The code is quite fast even with big dictionaries (tested with 50 000 words).

Identifying a substitution cipher random key. (English text)

input:
Crypted English normal text (A-Z) using a random generated substitution cipher.
output:
key
ideas:
read the whole text storing in some arrays the frequencies for each character/bigram/trigram and comparing them to:
http://en.wikipedia.org/wiki/Letter_frequencies
http://en.wikipedia.org/wiki/Bigram
http://en.wikipedia.org/wiki/Trigram
cons: letters/bigrams/trigrams with close percentage (like "c" and "u")
my software should be able to guess the max. possible characters from the crypted text (minimum 2000 characters).
I have to guess at least 18-20 letters.
questions:
is there a way/known algorithm to guess all the characters => full key ?
or can you give me some useful references or advices on how I could improve the whole guessing process ?
I think you're on the right track. The only way you could recover the full key would be if the all characters (or all-1) are present in the plain text.
I'd be thinking along the lines of making some statistical guesses and then statictically checking the results for the plaintext Bigrams/Trigrams which result. Or checking whole words (if you know where the word boundaries are) against a word list.