I'm newbie to text analysis in R, is there an easy way to apply the syuzhet: get_nrc_sentiment to a corpus with x elements (loaded individual text files)? I'm guessing you need turn the corpus back to a combined plain text file and use that?
s<-get_nrc_sentiment(.....)
thanks
Solved: use unlist on the corpus, e.g. data.frame(text=unlist(sapply(a, [)), stringsAsFactors=F) then get_nrc_sentiment on the text column of that dataframe
Related
Recently, I came across Harfbuzz for text shaping, specifically for Indic texts. In my previous experience, I used ArabicShaping for shaping Arabic characters. In this case, the input is the pre-shaped text and the output is the shaped one.
In Harfbuzz, however, I can see the shape method shapes the text and returns the glyphs and the clusters instead. My objective is to convert the pre-shaped text to a shaped one. I don't want to draw/view the text. I just want a char[] which will contain the shaped one (just like in case of ArabicShaping).
Is there any way the above can be achieved using Harfbuzz? If not, is there any workaround?
Am I using Harfbuzz for solving the correct problem? Is there any other library that I can use to achieve this?
ArabicShaping must have confused you. There's no such thing as "pre-shaped text" in general. What do you mean "convert the pre-shaped text to a shaped one"? Shaping, what HarfBuzz does, converts from characters to glyphs. The reverse is a non-deterministic process that HarfBuzz does NOT provide.
I am looking for ways to change the paper size throughout a pdf document. I know that I can specify classotion: a3paper for the entire document in the yaml header. I also know that I can change margins with the geometry package (\newgeometry{· · ·} and \restoregeometry) throughout a document. Unfortunately, there is no option to change paper sizes, with the geomerty package, throughout a document though.
I would like to do something like this but with paper size instead.
Is it even possible?
I am asking because I have some wide tables in my document where letters and numbers overlap when having a(4|5|6)paper specified. Other tables are narrow and I would like to have them bigger.
My table output is not from kable or any other easily modifiable package outputs e.g xtable. So what I am saying is I can't modify the dimensions of my table in my code.
Any help is much appreciated. Thank you.
The geometry package knows about a3paper, so the following works for me
---
output: pdf_document
geometry: a3paper
---
test
producing a PDF with page size "841.89 x 1190.55 pts" (A4 would be "595.276 x 841.89 pts"). For readability you should use at least two columns for the text, though.
I'm trying to vectorize some text with sklearn CountVectorizer. After, I want to look at features, which generate vectorizer. But instead, I got a list of codes, not words. What does this mean and how to deal with the problem? Here is my code:
vectorizer = CountVectorizer(min_df=1, stop_words='english')
X = vectorizer.fit_transform(df['message_encoding'])
vectorizer.get_feature_names()
And I got the following output:
[u'00',
u'000',
u'0000',
u'00000',
u'000000000000000000',
u'00001',
u'000017',
u'00001_copy_1',
u'00002',
u'000044392000001',
u'0001',
u'00012',
u'0004',
u'0005',
u'00077d3',
and so on.
I need real feature names (words), not these codes. Can anybody help me please?
UPDATE:
I managed to deal with this problem, but now when I want to look at my words I see many words that actually are not words, but senseless sets of letters (see screenshot attached). Anybody knows how to filter this words before I use CountVectorizer?
You are using min_df = 1 which will include all the words which are found in at least one document ie. all the words. min_df could be considered a hyperparameter itself to remove the most commonly used words. I would recommend using spacy to tokenize the words and join them as strings before giving it as input to the Count Vectorizer.
Note: The feature names that you see are actually part of your vocabulary. It's just noise. If you want to remove them, then set min_df >1.
Here is what you can do get what you exactly want:
vectorizer=CountVectorizer()
vectorizer.fit_transform(df['message_encoding'])
feat_dict=vectorizer.vocabulary_.keys()
instead of vectorizer.get_feature_names() you can write vectorizer.vocabulary_.keys() to get the words.
If I have a text string to be vectorized, how should I handle numbers inside it? Or if I feed a Neural Network with numbers and words, how can I keep the numbers as numbers?
I am planning on making a dictionary of all my words (as suggested here). In this case all strings will become arrays of numbers. How should I handle characters that are numbers? how to output a vector that does not mix the word index with the number character?
Does converting numbers to strings weakens the information i feed the network?
Expanding your discussion with #user1735003 - Lets consider both ways of representing numbers:
Treating it as string and considering it as another word and assign an ID to it when forming a dictionary. Or
Converting the numbers to actual words : '1' becomes 'one', '2' as 'two' and so on.
Does the second one change the context in anyway?. To verify it we can find similarity of two representations using word2vec. The scores will be high if they have similar context.
For example,
1 and one have a similarity score of 0.17, 2 and two have a similarity score of 0.23. They seem to suggest that the context of how they are used is totally different.
By treating the numbers as another word, you are not changing the
context but by doing any other transformation on those numbers, you
can't guarantee its for better. So, its better to leave it untouched and treat it as another word.
Note: Both word-2-vec and glove were trained by treating the numbers as strings (case 1).
The link you provide suggests that everything resulting from a .split(' ') is indexed -- words, but also numbers, possibly smileys, aso. (I would still take care of punctuation marks). Unless you have more prior knowledge about your data or your problem you could start with that.
EDIT
Example literally using your string and their code:
corpus = {'my car number 3'}
dictionary = {}
i = 1
for tweet in corpus:
for word in tweet.split(" "):
if word not in dictionary: dictionary[word] = i
i += 1
print(dictionary)
# {'my': 1, '3': 4, 'car': 2, 'number': 3}
The following paper can be helpful: http://people.csail.mit.edu/mcollins/6864/slides/bikel.pdf
Specifically, page 7.
Before they use an <unknown> tag they try to replace alphanumeric symbol combination with common pattern names tags, such as:
FourDigits (good for years)
I've tried to implement it and it gave great results.
I'm using libsvm(3.11) tool for implementation of SVM classification in my project(Text Classification using Multi Agent). But every time when I'm predicting the result it is giving the same label to all the test Documents i.e., either +1 or -1, though I'm using different kinds of data.
I'm using the following procedure for executing libsvm classification for a plain text documents:
-> There will be a set of training text documents
-> I'm converting these text documents into libsvm supported format using TF-IDF weights(I'm taking two folders, that represents two classes .. for 1st folder I assigned label -1 and for 2nd folder it is +1 follows TF-IDF values for that text document)
-> After that I took those bag of words into one plain text document .. and then by using those words I'm generating test document vector with some label(I'm taking only one test document, so IDF will be 1 always and there ll be only one vector ... I hope label doesn't matter) ...
-> After that I'm applying the libsvm functions svm_train and svm_predict with default options
Am I doing in correct procedure?? .. If there is any wrong procedure plz feel free to inform me .. It ll really helps me ..
and Y this libsvm is always giving the result as only one label?? .. Is it any fault with my procedure?? .. or problem with tool??
Thanks in Advance..
Why are you using a new criteria to make test documents? The testing and training document sets should all be derived from your original set of "training text documents". I put these in quotes because you could take a subset of these and use them for testing. Ultimately, make sure your training and testing text document sets are distinct and from the original set.