map vectorised terms to the original dataframe - pandas

I have a dataframe column contains domain names i.e. newyorktimes.com. I split by '.' and apply CountVectorizer to "newyorktimes".
The dataframe
domain split country
newyorktimes.com newyorktimes usa
newyorkreport.com newyorkreport usa
"newyorktimes" is also added as a new dataframe column called 'split'
I'm able to get the term frequencies
vectoriser = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
X = vectoriser.fit_transform(df['split'])
features = vectoriser.get_feature_names()
count = x.toarray().sum(axis=0)
dic = dict(zip(features, count))
dic = sorted(dic.items(), key=lambda x: x[1], reverse=True)
But I also need the 'country' information from the original dataframe and I don't know how to map the terms back to the original dataframe.
Expected output
term country domain count
new york usa 2
york times usa 1
york report usa 1

I cannot reproduce the example you provided, not very sure if you provided the correct input for the countvectorizer. If it is a matter of adding the count matrix back to the data frame, you can do it like this:
df = pd.DataFrame({'corpus':['This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
})
vectoriser = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
X = vectoriser.fit_transform(df['corpus'])
features = vectoriser.get_feature_names()
pd.concat([df,pd.DataFrame(X.toarray(),columns=features,index=df.index)],axis=1)
corpus document second second document
0 This is the first document. 0 0
1 This document is the second document. 1 1
2 And this is the third one. 0 0
3 Is this the first document? 0 0

Related

How to make dataframe from different parts of an Excel sheet given specific keywords?

I have one Excel file where multiple tables are placed in same sheet. My requirement is to read certain tables based on keyword. I have read tables using skip rows and nrows method, which is working as of now, but in future it won't work due to dynamic table length.
Is there any other workaround apart from skip rows & nrows method to read table as shown in picture?
I want to read data1 as one table & data2 as another table. Out of which in particular I want columns "RR","FF" & "WW" as two different data frames.
Appreciate if some one can help or guide to do this.
Method I have tried:
all_files=glob.glob(INPATH+"*sample*")
df1 = pd.read_excel(all_files[0],skiprows=11,nrows= 3)
df2 = pd.read_excel(all_files[0],skiprows=23,nrows= 3)
This works fine, the only problem is table length will vary every time.
With an Excel file identical to the one of your image, here is one way to do it:
import pandas as pd
df = pd.read_excel("file.xlsx").dropna(how="all").reset_index(drop=True)
# Setup
targets = ["Data1", "Data2"]
indices = [df.loc[df["Unnamed: 0"] == target].index.values[0] for target in targets]
dfs = []
for i in range(len(indices)):
# Slice df starting from first indice to second one
try:
data = df.loc[indices[i] : indices[i + 1] - 1, :]
except IndexError:
data = df.loc[indices[i] :, :]
# For one slice, get only values where row starts with 'rr'
r_idx = data.loc[df["Unnamed: 0"] == "rr"].index.values[0]
data = data.loc[r_idx:, :].reset_index(drop=True).dropna(how="all", axis=1)
# Cleanup
data.columns = data.iloc[0]
data.columns.name = ""
dfs.append(data.loc[1:, :].iloc[:, 0:3])
And so:
for item in dfs:
print(item)
# Output
rr ff ww
1 car1 1000000 sellout
2 car2 1500000 to be sold
3 car3 1300000 sellout
rr ff ww
1 car1 1000000 sellout
2 car2 1500000 to be sold
3 car3 1300000 sellout

Use a index and column from one lookup dataframe to create a new column in another dataframe

I have a dataframe for looking up values:
ruralw2 = [[0.1,0.3,0.5], [0.1,0.2,0.8], [0.1,0.2,0.7], [0.1,0,0.3]]
rw2 = pd.DataFrame(data=ruralw2, columns=['city','suburbs','rural'],index=['low','med','high','v-high'])
and then I have a another dataframe where I want to get 'p' values based on data in rw2 dataframe:
df = pd.DataFrame(columns=['location','income','p'])
df['location'] = ['city','city','suburbs','rural','rural']
df['income'] = ['low','med','high','v-high','med']
What I expect is this:
It's possible to use for loop but its an antipattern in Pandas and I think there should be a better way.
for i in np.arange(df.shape[0]):
df['p'][i] = rw2.loc[df['income'][i],df['location'][i]]
Another possibility is to write very long np.where(... logic but it doesn't feel right either and it wouldn't be very scalable.
you can use stack on rw2 and reindex with both columns income and location of df like:
df['p'] = rw2.stack().reindex(df[['income', 'location']]).to_numpy()
location income p
0 city low 0.1
1 city med 0.1
2 suburbs high 0.2
3 rural v-high 0.3
4 rural med 0.8
You can use reset_index to bring the income values into the data frame, followed by pd.melt to restructure it in your result format. You can then merge this new data frame with df
Step 1:
rw2_reset = rw2.reset_index()
rw2_reset
Step2:
rw2_melt = pd.melt(rw2_reset, id_vars='index', value_vars=['city', 'suburbs', 'rural'])
rw2_melt.rename(columns={'index':'income', 'variable':'location','value':'p'}, inplace=True)
rw2_melt
Step3:
result = pd.merge(df, rw2_melt, on=['location', 'income'], how='left').drop(columns='p_x').rename(columns={'p_y':'p'})
result

Find rows in dataframe column containing questions

I have a TSV file that I loaded into a pandas dataframe to do some preprocessing and I want to find out which rows have a question in it, and output 1 or 0 in a new column. Since it is a TSV, this is how I'm loading it:
import pandas as pd
df = pd.read_csv('queries-10k-txt-backup', sep='\t')
Here's a sample of what it looks like:
QUERY FREQ
0 hindi movies for adults 595
1 are panda dogs real 383
2 asuedraw winning numbers 478
3 sentry replacement keys 608
4 rebuilding nicad battery packs 541
After dropping empty rows, duplicates, and the FREQ column(not needed for this), I wrote a simple function to check the QUERY column to see if it contains any words that make the string a question:
df_test = df.drop_duplicates()
df_test = df_test.dropna()
df_test = df_test.drop(['FREQ'], axis = 1)
def questions(row):
questions_list =
["what","when","where","which","who","whom","whose","why","why don't",
"how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'] in questions_list:
return 1
else:
return 0
df_test['QUESTIONS'] = df_test.apply(questions, axis=1)
But once I check the new dataframe, even though it creates the new column, all the values are 0. I'm not sure if my logic is wrong in the function, I've used something similar with dataframe columns which just have one word and if it matches, it'll output a 1 or 0. However, that same logic doesn't seem to be working when the column contains a phrase/sentence like this use case. Any input is really appreciated!
If you wish to check exact matches of any substring from question_list and of a string from dataframe, you should use str.contains method:
questions_list = ["what","when","where","which","who","whom","whose","why",
"why don't", "how","how far","how long","how many",
"how much","how old","how come","?"]
pattern = "|".join(questions_list) # generate regex from your list
df_test['QUESTIONS'] = df_test['QUERY'].str.contains(pattern)
Simplified example:
df = pd.DataFrame({
'QUERY': ['how do you like it', 'what\'s going on?', 'quick brown fox'],
'ID': [0, 1, 2]})
Create a pattern:
pattern = '|'.join(['what', 'how'])
pattern
Out: 'what|how'
Use it:
df['QUERY'].str.contains(pattern)
Out[12]:
0 True
1 True
2 False
Name: QUERY, dtype: bool
If you're not familiar with regexes, there's a quick python re reference. Fot symbol '|', explanation is
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way
IIUC, you need to find if the first word in the string in the question list, if yes return 1, else 0. In your function, rather than checking if the entire string is in question list, split the string and check if the first element is in question list.
def questions(row):
questions_list = ["are","what","when","where","which","who","whom","whose","why","why don't","how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'].split()[0] in questions_list:
return 1
else:
return 0
df['QUESTIONS'] = df.apply(questions, axis=1)
You get
QUERY FREQ QUESTIONS
0 hindi movies for adults 595 0
1 are panda dogs real 383 1
2 asuedraw winning numbers 478 0
3 sentry replacement keys 608 0
4 rebuilding nicad battery packs 541 0

groupby all results without resorting

Sort in groupby does not work the way I thought it would.
In the following example, I do not want to group "USA" together because there is one row of "Russia".
from io import StringIO
myst="""india, 905034 , 19:44
USA, 905094 , 19:33
Russia, 905154 , 21:56
USA, 345345, 45:55
USA, 34535, 65:45
"""
u_cols=['country', 'index', 'current_tm']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
When I use groupby I get the following:
df.groupby('country', sort=False).size()
country
india 1
USA 3
Russia 1
dtype: int64
Is there anyway I can get results something like this...
country
india 1
USA 1
Russia 1
USA 2
You could try this bit of code instead of a direct groupby:
country = [] #initialising lists
count = []
for i, g in df.groupby([(df.country != df.country.shift()).cumsum()]): #Creating a list that increases by 1 for every time a unique value appears in the dataframe country column.
country.append(g.country.tolist()[0]) #Adding the name of country to list.
count.append(len(g.country.tolist())) #Adding the number of times that country appears to list.
pd.DataFrame(data = {'country': country, 'count':count}) #Binding the lists all into a dataframe.
This df.groupby([(df.country != df.country.shift()).cumsum()]) creates a dataframe that gives a unique number (cumulatively) to every change of country in the country column.
In the for loop, i represents the unique cumulative number assigned to each country appearance and g represents the corresponding full row(s) from your original dataframe.
g.country.tolist() outputs a list of the country names for each unique appearance (aka i) i.e.
['india']
['USA']
['Russia']
['USA', 'USA']
for your given data.
Therefore, the first item is the name of the country and the length represents the number of appearances. This info can then be (recorded in a list and then) put together into a dataframe and give the output you require.
You could also use list comprehensions rather than the for loop:
cumulative_df = df.groupby([(df.country != df.country.shift()).cumsum()]) #The cumulative count dataframe
country = [g.country.tolist()[0] for i,g in cumulative_df] #List comprehension for getting country names.
count = [len(g.country.tolist()) for i,g in cumulative_df] #List comprehension for getting count for each country.
Reference: Pandas DataFrame: How to groupby consecutive values
Using the trick given in #user2285236 's comment
df['Group'] = (df.country != df.country.shift()).cumsum()
df.groupby(['country', 'Group'], sort=False).size()

Use a dictionary to replace values within a dataframe column

I have a dataframe column of strings. Now I want to replace specific words in these strings with a value from another dataframe which has the meaning of the word to be replaced. I am currently using iterrrows() which takes about 2 minutes for 25000 rows. I would like to know if there is a more efficient way of doing this.
syn = pd.ExcelFile("C:/Key-Value.xlsx")
df_syn = syn.parse("Keys")
for idx, row in df_syn.iterrows():
df['col'] = df['col'].str.replace(r"\b"+row['synonym']+r"\b", row['word'])
IIUC:
Setup
df_syn = pd.DataFrame(dict(synonym=['hug', 'kiss'], word=['warm', 'tender']))
df = pd.DataFrame(dict(col=['I want a hug', 'a kiss would be great']))
print(df_syn, df, sep='\n\n')
synonym word
0 hug warm
1 kiss tender
col
0 I want a hug
1 a kiss would be great
Solution
mapping = df_syn.assign(
synonym=df_syn.synonym.radd(r'\b').add(r'\b')
).set_index('synonym').word.to_dict()
df.replace({'col': mapping}, regex=True)
col
0 I want a warm
1 a tender would be great