How to find ngram frequency of a column in a pandas dataframe? - pandas

Below is the input pandas dataframe I have.
I want to find the frequency of unigrams & bigrams. A sample of what I am expecting is shown below
How to do this using nltk or scikit learn?
I wrote the below code which takes a string as input. How to extend it to series/dataframe?
from nltk.collocations import *
desc='john is a guy person you him guy person you him'
tokens = nltk.word_tokenize(desc)
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.ngram_fd.viewitems()

If your data is like
import pandas as pd
df = pd.DataFrame([
'must watch. Good acting',
'average movie. Bad acting',
'good movie. Good acting',
'pathetic. Avoid',
'avoid'], columns=['description'])
You could use the CountVectorizer of the package sklearn:
from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['description'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])
Which gives you :
frequency
good 3
pathetic 1
average movie 1
movie bad 2
watch 1
good movie 1
watch good 3
good acting 2
must 1
movie good 2
pathetic avoid 1
bad acting 1
average 1
must watch 1
acting 1
bad 1
movie 1
avoid 1
EDIT
fit will just "train" your vectorizer : it will split the words of your corpus and create a vocabulary with it. Then transform can take a new document and create vector of frequency based on the vectorizer vocabulary.
Here your training set is your output set, so you can do both at the same time (fit_transform). Because you have 5 documents, it will create 5 vectors as a matrix. You want a global vector, so you have to make a sum.
EDIT 2
For big dataframes, you can speed up the frequencies computation by using:
frequencies = sum(sparse_matrix).data
or
frequencies = sparse_matrix.sum(axis=0).T

Related

Plotting a graph of the top 15 highest values

I am working on a dataset which shows the budget spent on movies. I want make a plot which contains the top 15 highest budget movies.
#sort the 'budget' column in decending order and store it in the new dataframe.
info = pd.DataFrame(dp['budget'].sort_values(ascending = False))
info['original_title'] = dp['original_title']
data = list(map(str,(info['original_title'])))
#extract the top 10 budget movies data from the list and dataframe.
x = list(data[:10])
y = list(info['budget'][:10])
This was the ouput i got
C:\Users\Phillip\AppData\Local\Temp\ipykernel_7692\1681814737.py:2: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
y = list(info['budget'][:5])
I'm new to the data analysis scene so i'm confused on how else to go about the problem
A simple example using a movie dataset I found online:
import pandas as pd
url = "https://raw.githubusercontent.com/erajabi/Python_examples/master/movie_sample_dataset.csv"
df = pd.read_csv(url)
# Bar plot of 15 highest budgets:
df.nlargest(n=15, columns="budget").plot.bar(x="movie_title", y="budget")
You can customize your plot in various ways by adding arguments to the .bar(...) call.

outlier dedection with z-score, but

I wrote a code to outlier dedection with Python. I used the z-score method to do this. You can see my data and my codes below.
data =[5,10,15,20,25,30,36,22]
data.append(180)
data = pd.DataFrame(data, columns = ["Data"])
z = np.abs(stats.zscore(data))
print(z)
print(np.where( z > 1.5))
I wrote this code to detect outliers. Actually, I wanted to getthe indices of values with z-score higher than 1.5. But I think something is wrong with output.
Data
0 0.649600
1 0.551506
2 0.453412
3 0.355318
4 0.257224
5 0.159130
6 0.041417
7 0.316080
8 2.783688
(array([8], dtype=int64), array([0], dtype=int64))
The 8th element of the data's z-score is higher than 1.5 and it's already written on output, I'm okay with this but the 0th's z-score 0.64. What am i doing wrong?
You could do something like this:
import numpy as np
from scipy import stats
data =[5,10,15,20,25,30,36,22]
data.append(180)
z = stats.zscore(data)
np.where(z > 1.5)[0]
output:
array([8])

Python CountVectorizer for Pandas DataFrame

I have got a pandas dataframe which looks like the following:
df.head()
categorized.Hashtags
0 icietmaintenant supyoga standuppaddleportugal ...
1 instapaysage bretagne labellebretagne bretagne...
2 bretagne lescrepescestlavie quimper bzh labret...
3 bretagne mer paysdiroise magnifique phare plou...
4 bateaux baiededouarnenez voiliers vieuxgreemen..
Now instead of using pandas get_dummmies() command I would like to use CountVectorizer to create the same output. Because get_dummies takes too much time.
df_x = df["categorized.Hashtags"]
vect = CountVectorizer(min_df=0.,max_df=1.0)
X = vect.fit_transform(df_x)
count_vect_df = pd.DataFrame(X.todense(), columns = vect.get_feature_names())
When I now output the respective data frame "count_vect_df" then the data frame contains a lot of columns which are empty/ contains only zero values. How can I avoid this?
Cheers,
Andi
From scikit-learn CountVectorizer docs:
Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts
using scipy.sparse.csr_matrix.
The CountVectorizer returns a sparse-matrix, which contains most of zero values, where non-zero values represent the number of times that specific term has appeared in the particular document.

Pandas df to lists and sublists

I have a pandas data frame with one column and 100 rows(each cell is a paragraph). I would like to create a list of sublists to perform LDA and get the topics.
Ex:
S.No Text
0 abc
1 def
2 ghi
3 jkl
4 mno
I want the result to be a list of sublists
"[[abc]
[def]
[ghi]
[jkl]
[mno]]"
So that I can tokenize the sentences into words and perform LDA
Any ideas?
I think you don't need list of sublist to convert your sentences into tokens. You can do this way (below). Further you can modify from here, whatever way you want the output:
from nltk.tokenize import word_tokenize
# example
df = pd.DataFrame({'text': ['how are you','paris is good','fish is in water','we play tomorrow']})
# tokenize sentences
df['token_text'] = df.text.apply(word_tokenize)
print(df)
text token_text
0 how are you [how, are, you]
1 paris is good [paris, is, good]
2 fish is in water [fish, is, in, water]
3 we play tomorrow [we, play, tomorrow]
YOLO's answer is very good an my be what you're looking for. alternatively if you are trying to use LDA and want the "list of sublists" it may be better to use arrays which will work on any numpy function. To do so you can just us:
df.values
of if you only want specific columns you could do
df.loc[:, [col1, col2]].values
If you must have them as a list of lists then you can do
[list(x) for x in df.values]

Log values by SFrame column

Please, can anybody tell me, how I can take logarithm from every value in SFrame, graphlab (or DataFrame, pandas) column, without to iterate through the whole length of the SFrame column?
I specially interest on similar functionality, like by Groupby Aggregators for the log-function. Couldn't find it someself...
Important: Please, I don't interest for the for-loop iteration for the whole length of the column. I only interest for specific function, which transform all values to the log-values for the whole column.
I'm also very sorry, if this function is in the manual. Please, just give me a link...
numpy provides implementations for a wide number of basic mathematical transformations. You can use those on all data structures that build on numpy's ndarray.
import pandas as pd
import numpy as np
data = pd.Series([np.exp(1), np.exp(2), np.exp(3)])
np.log(data)
Outputs:
0 1
1 2
2 3
dtype: float64
This example is for pandas data types, but it works for all data structures that are based on numpy arrays.
The same "apply" pattern works for SFrames as well. You could do:
import graphlab
import math
sf = graphlab.SFrame({'a': [1, 2, 3]})
sf['b'] = sf['a'].apply(lambda x: math.log(x))
#cel
I think, in my case it could be possible also to use next pattern.
import numpy
import pandas
import graphlab
df
a b c
1 1 1
1 2 3
2 1 3
....
df['log c'] = df.groupby('a')['c'].apply(lambda x: numpy.log(x))
for SFrame (sf instead df object) it could look little be different
logvals = numpy.log(sf['c'])
log_sf = graphlab.SFrame(logvals)
sf = sf.join(log_sf, how = 'outer')
Probably with numpy the code fragment is a little bit to long, but it works...
The main problem is of course time perfomance. I did hope, I can fnd some specific function to minimise my time....