I'm trying to vectorize some text with sklearn CountVectorizer. After, I want to look at features, which generate vectorizer. But instead, I got a list of codes, not words. What does this mean and how to deal with the problem? Here is my code:
vectorizer = CountVectorizer(min_df=1, stop_words='english')
X = vectorizer.fit_transform(df['message_encoding'])
vectorizer.get_feature_names()
And I got the following output:
[u'00',
u'000',
u'0000',
u'00000',
u'000000000000000000',
u'00001',
u'000017',
u'00001_copy_1',
u'00002',
u'000044392000001',
u'0001',
u'00012',
u'0004',
u'0005',
u'00077d3',
and so on.
I need real feature names (words), not these codes. Can anybody help me please?
UPDATE:
I managed to deal with this problem, but now when I want to look at my words I see many words that actually are not words, but senseless sets of letters (see screenshot attached). Anybody knows how to filter this words before I use CountVectorizer?
You are using min_df = 1 which will include all the words which are found in at least one document ie. all the words. min_df could be considered a hyperparameter itself to remove the most commonly used words. I would recommend using spacy to tokenize the words and join them as strings before giving it as input to the Count Vectorizer.
Note: The feature names that you see are actually part of your vocabulary. It's just noise. If you want to remove them, then set min_df >1.
Here is what you can do get what you exactly want:
vectorizer=CountVectorizer()
vectorizer.fit_transform(df['message_encoding'])
feat_dict=vectorizer.vocabulary_.keys()
instead of vectorizer.get_feature_names() you can write vectorizer.vocabulary_.keys() to get the words.
Related
I tried to use difflib to get_close_matches in a tuple data...but it does not work...I have earlier used difflib in a JSON file but couldn't use it in an SQL...Result expectationI want to find words similar to the input given..even if there is any spelling mistake...for example...if the input is treeeee or TREEEEE or Treeea...my program should consider the nearest match...that is a tree...Similar to the Did you mean? function in GOOGLE. I also tried SELECT * FROM Dictionary WHERE Expression LIKE '%s but the problem persists. Please help me solve this. Thanks in advance.
SQL functions Soundex and DIFFERENCE look like the closest fit.
I notice that in django when there is a sentence containing PLAZA/MASTERPIECE then when we search masterpiece I can't find this sentence. Is this a limitation of PostgreSQL full text search. Or how to solve this?
finalquery = SearchQuery("keyword")
vector = SearchVector('thefieldIwanttosearch')
self.search_results = self.search_results.annotate(search=vector).filter(search=finalquery).annotate(rank=SearchRank(vector, finalquery))
Is there any document about this? Thanks!
Yes, this is all documented.
When you write filter(search=finalquery) you're not specifying a lookup type.
As a convenience when no lookup type is provided (like in Entry.objects.get(id=14)) the lookup type is assumed to be exact.
So you're filtering on an exact match for "masterpiece". What you probably want is contains or icontains.
If I have a text string to be vectorized, how should I handle numbers inside it? Or if I feed a Neural Network with numbers and words, how can I keep the numbers as numbers?
I am planning on making a dictionary of all my words (as suggested here). In this case all strings will become arrays of numbers. How should I handle characters that are numbers? how to output a vector that does not mix the word index with the number character?
Does converting numbers to strings weakens the information i feed the network?
Expanding your discussion with #user1735003 - Lets consider both ways of representing numbers:
Treating it as string and considering it as another word and assign an ID to it when forming a dictionary. Or
Converting the numbers to actual words : '1' becomes 'one', '2' as 'two' and so on.
Does the second one change the context in anyway?. To verify it we can find similarity of two representations using word2vec. The scores will be high if they have similar context.
For example,
1 and one have a similarity score of 0.17, 2 and two have a similarity score of 0.23. They seem to suggest that the context of how they are used is totally different.
By treating the numbers as another word, you are not changing the
context but by doing any other transformation on those numbers, you
can't guarantee its for better. So, its better to leave it untouched and treat it as another word.
Note: Both word-2-vec and glove were trained by treating the numbers as strings (case 1).
The link you provide suggests that everything resulting from a .split(' ') is indexed -- words, but also numbers, possibly smileys, aso. (I would still take care of punctuation marks). Unless you have more prior knowledge about your data or your problem you could start with that.
EDIT
Example literally using your string and their code:
corpus = {'my car number 3'}
dictionary = {}
i = 1
for tweet in corpus:
for word in tweet.split(" "):
if word not in dictionary: dictionary[word] = i
i += 1
print(dictionary)
# {'my': 1, '3': 4, 'car': 2, 'number': 3}
The following paper can be helpful: http://people.csail.mit.edu/mcollins/6864/slides/bikel.pdf
Specifically, page 7.
Before they use an <unknown> tag they try to replace alphanumeric symbol combination with common pattern names tags, such as:
FourDigits (good for years)
I've tried to implement it and it gave great results.
I am finding the tokenization code quite complicated and I still couldn't find where in the code the sentences are split.
For example, how does the tokenizer know that
Mr. Smitt stayed at home. He was tired
should not be split in "Mr." and should be split before "He".? And where in the code does the split before "He" happens?
(In fact, I am unsure actually unsure if I am looking at the right place: if I search for sents in tokenizer.pyx I don't find any occurrence)
You access the splits via the doc object, with the generator:
doc.sents
The output of the generator is a series of spans.
As for how the splits are chosen, the document is parsed for dependency relationships. Understanding the parser is not trivial - you'll have to read into it if you want to understand it - it's using a neural network to inform the decision about how to construct the dependency trees; but the splits are those gaps between tokens which are not crossed by dependencies. This is not simply where you find a full-stop, and the method is more robust as a result.
I'm on OS X, and in objective-c I'm trying to convert
for example,
"Bobateagreenapple"
into
"Bob ate a green apple"
Is there any way to do this efficiently? Would something involving a spell checker work?
EDIT: Just some extra information:
I'm attempting to build something that takes some misformatted text (for example, text copy pasted from old pdfs that end up without spaces, especially from internet archives like JSTOR). Since the misformatted text is probably going to be long... well, I'm just trying to figure out whether this is feasibly possible before I actually attempt to actually write system only to find out it takes 2 hours to fix a paragraph of text.
One possibility, which I will describe this in a non-OS specific manner, is to perform a search through all the possible words that make up the collection of letters.
Basically you chop off the first letter of your letter collection and add it to the current word you are forming. If it makes a word (eg dictionary lookup) then add it to the current sentence. If you manage to use up all the letters in your collection and form words out of all of them, then you have a full sentence. But, you don't have to stop here. Instead, you keep running, and eventually you will produce all possible sentences.
Pseudo-code would look something like this:
FindWords(vector<Sentence> sentences, Sentence s, Word w, Letters l)
{
if (l.empty() and w.empty())
add s to sentences;
return;
if (l.empty())
return;
add first letter from l to w;
if w in dictionary
{
add w to s;
FindWords(sentences, s, empty word, l)
remove w from s
}
FindWords(sentences, s, w, l)
put last letter from w back onto l
}
There are, of course, a number of optimizations you could perform to make it go fast. For instance checking if the word is the stem of any word in the dictionary. But, this is the basic approach that will give you all possible sentences.
Solving this problem is much harder than anything you'll find in a framework. Notice that even in your example, there are other "solutions": "Bob a tea green apple," for one.
A very naive (and not very functional) approach might be to use a spell-checker to try to isolate one "real word" at a time in the string; of course, in this example, that would only work because "Bob" happens to be an English word.
This is not to say that there is no way to accomplish what you want, but the way you phrase this question indicates to me that it might be a lot more complicated than what you're expecting. Maybe someone can give you an acceptable solution, but I bet they'll need to know a lot more about what exactly you're trying to do.
Edit: in response to your edit, it would probably take less effort to run some kind of OCR tool on a PDF and correct its output than it would just to correct what this system might give you, let alone program it
I implemented a solution, the code is avaible on code project:
http://www.codeproject.com/Tips/704003/How-to-add-spaces-between-spaceless-strings
My idea was to prioritize results that use up most of the characters (preferable all of them) then favor the ones with the longest words, because 2,3 or 4 character long words can often come up by chance from leftout characters. Most of the times this provides the correct solution.
To find all possible permutations I used recursion. The code is quite fast even with big dictionaries (tested with 50 000 words).