How to treat numbers inside text strings when vectorizing words? - tensorflow

If I have a text string to be vectorized, how should I handle numbers inside it? Or if I feed a Neural Network with numbers and words, how can I keep the numbers as numbers?
I am planning on making a dictionary of all my words (as suggested here). In this case all strings will become arrays of numbers. How should I handle characters that are numbers? how to output a vector that does not mix the word index with the number character?
Does converting numbers to strings weakens the information i feed the network?

Expanding your discussion with #user1735003 - Lets consider both ways of representing numbers:
Treating it as string and considering it as another word and assign an ID to it when forming a dictionary. Or
Converting the numbers to actual words : '1' becomes 'one', '2' as 'two' and so on.
Does the second one change the context in anyway?. To verify it we can find similarity of two representations using word2vec. The scores will be high if they have similar context.
For example,
1 and one have a similarity score of 0.17, 2 and two have a similarity score of 0.23. They seem to suggest that the context of how they are used is totally different.
By treating the numbers as another word, you are not changing the
context but by doing any other transformation on those numbers, you
can't guarantee its for better. So, its better to leave it untouched and treat it as another word.
Note: Both word-2-vec and glove were trained by treating the numbers as strings (case 1).

The link you provide suggests that everything resulting from a .split(' ') is indexed -- words, but also numbers, possibly smileys, aso. (I would still take care of punctuation marks). Unless you have more prior knowledge about your data or your problem you could start with that.
EDIT
Example literally using your string and their code:
corpus = {'my car number 3'}
dictionary = {}
i = 1
for tweet in corpus:
for word in tweet.split(" "):
if word not in dictionary: dictionary[word] = i
i += 1
print(dictionary)
# {'my': 1, '3': 4, 'car': 2, 'number': 3}

The following paper can be helpful: http://people.csail.mit.edu/mcollins/6864/slides/bikel.pdf
Specifically, page 7.
Before they use an <unknown> tag they try to replace alphanumeric symbol combination with common pattern names tags, such as:
FourDigits (good for years)
I've tried to implement it and it gave great results.

Related

The set of atomic irrational numbers used to express the character table and corresponding (unitary) representations

I want to calculate the irrational number, expressed by the following formula in gap:
3^(1/7). I've read through the related description here, but still can't figure out the trick. Will numbers like this appear in the computation of the character table and corresponding (unitary) representations?
P.S. Basically, I want to figure out the following question: For the computation of the character table and corresponding (unitary) representations, what is the minimum complete set of atomic irrational numbers used to express the results?
Regards,
HZ
You can't do that with GAP's standard cyclotomic numbers, as seventh roots of 3 are not cyclotomic. Indeed, suppose $r$ is such a root, i.e. a rot of the polynomial $f = x^7-3 \in \mathbb{Q}[x]$. Then $r$ is cyclotomic if and only if the field extension \mathbb{Q}[x] is a subfield of a cyclotomic field. By Kronecker-Weber this is equivalent to that field being an abelian extension, i.e., the Galois group is abelian. One can check that this is not the case here (the Galois group is a semidirect product of C_7 with C_6).
So, $r$ is not cyclotomic.

Use MeCab to separate Japanese sentences into words not morphemes in vb.net

I am using the following code to split Japanese sentences into its words:
Dim parameter = New MeCabParam()
Dim tagger = MeCabTagger.Create(parameter)
For Each node In tagger.ParseToNodes(sentence)
If node.CharType > 0 Then
Dim features = node.Feature.Split(",")
Console.Write(node.Surface)
Console.WriteLine(" (" & features(7) & ") " & features(1))
End If
Next
An input of それに応じて大きくになります。 outputs morphemes:
それ (それ) 代名詞
に (に) 格助詞
応じ (おうじ) 自立
て (て) 接続助詞
大きく (おおきく) 自立
に (に) 格助詞
なり (なり) 自立
ます (ます) *
。 (。) 句点
Rather than words like so:
それ
に
応じて
大きく
に
なります
。
Is there a way I can use a parameter to get MeCab to output the latter? I am very new to coding so would appreciate it if you explain simply. Thanks.
This is actually pretty hard to do. MeCab, Kuromoji, Sudachi, KyTea, Rakuten-MA—all of these Japanese parsers and the dictionary databases they consume (IPADIC, UniDic, Neologd, etc.) have chosen to parse morphemes, the smallest units of meaning, instead of what you call "words", which as your example shows often contain multiple morphemes.
There are some strategies that usually folks combine to improve on this.
Experiment with different dictionaries. I've noticed that UniDic is sometimes more consistent than IPADIC.
Use a bunsetsu chunker like J.DepP, which consumes the output of MeCab to chunk together morphemes into bunsetsu. Per this paper, "We use the notion of a bunsetsu which roughly corresponds to a minimum phrase in English and consists of a content words (basically nouns or verbs) and the functional words surrounding them." The bunsetsu output by J.DepP often correspond to "words". I personally don't think of, say, a noun + particle phrase as a "word" but you might—these two are usually in a single bunsetsu. (J.DepP is also pretttty fancy, in that it also outputs a dependency tree between bunsetsu, so you can see which one modifies or is secondary to which other one. See my example.)
A last technique that you shouldn't overlook is scanning the dictionary (JMdict) for runs of adjacent morphemes; this helps find idioms or set phrases. It can get complicated because the dictionary may have a deconjugated form of a phrase in your sentence, so you might have to search both the literal sentence form and the deconjugated (lemma) form of MeCab output.
I have an open-source package that combines all of the above called Curtiz: it runs text through MeCab, chunks them into bunsetsu with J.DepP to find groups of morphemes that belong together, identifies vocabulary by looking them up in the dictionary, separates particles and conjugated phrases, etc. It is likely not going to be useful for you, since I use it to support my activities in learning Japanese and making Japanese learning tools but it shows how the above pieces can be combined to get to what you need in Japanese NLP.
Hopefully that's helpful. I'm happy to elaborate more on any of the above topics.

CountVectorizer method get_feature_names() produces codes but not words

I'm trying to vectorize some text with sklearn CountVectorizer. After, I want to look at features, which generate vectorizer. But instead, I got a list of codes, not words. What does this mean and how to deal with the problem? Here is my code:
vectorizer = CountVectorizer(min_df=1, stop_words='english')
X = vectorizer.fit_transform(df['message_encoding'])
vectorizer.get_feature_names()
And I got the following output:
[u'00',
u'000',
u'0000',
u'00000',
u'000000000000000000',
u'00001',
u'000017',
u'00001_copy_1',
u'00002',
u'000044392000001',
u'0001',
u'00012',
u'0004',
u'0005',
u'00077d3',
and so on.
I need real feature names (words), not these codes. Can anybody help me please?
UPDATE:
I managed to deal with this problem, but now when I want to look at my words I see many words that actually are not words, but senseless sets of letters (see screenshot attached). Anybody knows how to filter this words before I use CountVectorizer?
You are using min_df = 1 which will include all the words which are found in at least one document ie. all the words. min_df could be considered a hyperparameter itself to remove the most commonly used words. I would recommend using spacy to tokenize the words and join them as strings before giving it as input to the Count Vectorizer.
Note: The feature names that you see are actually part of your vocabulary. It's just noise. If you want to remove them, then set min_df >1.
Here is what you can do get what you exactly want:
vectorizer=CountVectorizer()
vectorizer.fit_transform(df['message_encoding'])
feat_dict=vectorizer.vocabulary_.keys()
instead of vectorizer.get_feature_names() you can write vectorizer.vocabulary_.keys() to get the words.

Wiki API - Parsing sentences from JSON extracts in JavaScript?

Is there a way to have wiki display extracts in an array of sentences?
Or does anyone have any ideas other than using string.split(".") to parse? There are cases where the sentence may include a . and I don't want to split if it occurs mid-sentence.
For example, "The Eagles were No. 1 in the U.S. in 1970" would be split into 4 sentences using str.split(), and that's not what I want.
Wiki must have some sort of determination of what defines a sentence as it works when you limit the number of existence in a call (they don't break a sentence on an in-line period). Is there a way to get them individually?
Looking for a solution in JavaScript to parse a JSON excerpt string.
I ended up figuring out a work-around. Using exsentences, I made 10 calls, each with one more sentence than the previous call. I stored the results of each call in an array. So when the 10 calls were complete, I had 10 strings, ranging from one sentence in position 0, up to 10 sentences in the 9th position. Then I just iterated through the array, from 0 to length - 2, subtracting the string in the current position from the string at position [i + 1] (with string[i + 1].slice(string[i].length)), to get the nth string.

How to recognize if word has no meaning, maybe some impossible syllables?

Initially, I have m arrays of n characters, where each array contains unknown (for me) character of needed word (condition: word has meaning).
For example, m = 4, n = 3: array0 = {'t', 'e', 'c'}, array1 = {'g' 'o' 'a'}, array2 = {'w' 'd' 'y'}, array3 = {'e' 'o' 's'}. Each array contains only one correct letter: in array0 is first letter, in array1 - second... So, the probable secret word is 'code': array0[2] = 'c', array1[1] = 'o', array2[1] = 'd', array3[0] = 'e'.
I need to find all of existing letter-combinations, i.e. exclude generated meaningless words.
Are there any rules/regularities of 'impossible' syllables/letter-combinations in English?
I'm attacking Vigenere's cipher. So, I know the length of key and its probable characters. I'm shuffling my arrays and getting many meaningless words. Problem is to filter them. As I get it, some conditions can help to recognize incorrect words. For example, if word length is > 4 then all vowel chars, or all consonant chars word is wrong. Some syllables, such as kk *hh* ww, in general, are impossible too. Where can I find such rules?
I'm supposing what you mean by the "word has meaning" is that it is an English dictionary word.
I believe that you should approach the problem from the other direction, as GregS suggests, and go through a dictionary. English has many exceptions when it comes to letters and spelling, and the number of words that look English are much greater than the actual number of English words. You won't be able to cut down your search very much in that way.
But because you know the length and probable characters you are able to quickly throw out many dictionary words. Also, if the message isn't too short, it would also be very fast to attempt a decoding of the message with possible words, and throw out unlikely decodings by letter, digram or trigram frequencies.
I'm not sure I follow your strategy for attacking a Vigenere cipher. However, in response to:
I need to find all of existing letter-combinations, i.e. exclude generated meaningless words. Are there any rules/regularities of 'impossible' syllables/letter-combinations in English?
Yes, indeed there is a plethora of such rules. There are two ways of learning and implementing these rules:
Carefully study the morphology of English, and meticulously implement the rules.
Train a Markov model on a corpus of English text.
1 will be substantially less work for little additional benefit.