Pandas manipulation: matching data from other columns to one column, applied uniquely to all rows - pandas

I have a model that predicts 10 words for a particular course in order of likelihood, and I'd like the first 5 words of those words that appear in the course's description.
This is the format of the data:
course_name course_title course_description predicted_word_10 predicted_word_9 predicted_word_8 predicted_word_7 predicted_word_6 predicted_word_5 predicted_word_4 predicted_word_3 predicted_word_2 predicted_word_1
Xmath 32 Precalculus Polynomial and rational functions, exponential... directed scholars approach build african different visual cultures placed global
Xphilos 2 Morality Introduction to ethical and political philosop... make presentation weekly european ways general range questions liberal speakers
My idea is for each row to start iterating from predicted_word_1 until I get the first 5 that are in the description. I'd like to save those words in the order they appear into additional columns description_word_1 ... description_word_5. (If there are <5 predicted words in the description I plan to return NAN in the corresponding columns).
To clarify with an example: if the course_description of a course is 'Polynomial and rational functions, exponential and logarithmic functions, trigonometry and trigonometric functions. Complex numbers, fundamental theorem of algebra, mathematical induction, binomial theorem, series, and sequences. ' and its first few predicted words are irrelevantword1, induction, exponential, logarithmic, irrelevantword2, polynomial, algebra...
I would want to return induction, exponential, logarithmic, polynomial, algebra for that in that order and do the same for the rest of the courses.
My attempt was to define an apply function that will take in a row and iterate from the first predicted word until it finds the first 5 that are in the description, but the part I am unable to figure out is how to create these additional columns that have the correct words for each course. This code will currently only keep the words for one course for all the rows.
def find_top_description_words(row):
print(row['course_title'])
description_words_index=1
for i in range(num_words_per_course):
description = row.loc['course_description']
word_i = row.loc['predicted_word_' + str(i+1)]
if (word_i in description) & (description_words_index <=5) :
print(description_words_index)
row['description_word_' + str(description_words_index)] = word_i
description_words_index += 1
df.apply(find_top_description_words,axis=1)
The end goal of this data manipulation is to keep the top 10 predicted words from the model and the top 5 predicted words in the description so the dataframe would look like:
course_name course_title course_description top_description_word_1 ... top_description_word_5 predicted_word_1 ... predicted_word_10
Any pointers would be appreciated. Thank you!

If I understand correctly:
Create new DataFrame with just 100 predicted words:
pred_words_lists = df.apply(lambda x: list(x[3:].dropna())[::-1], axis = 1)
Please note that, there are lists in each row with predicted words. The order is nice, I mean the first, not empty, predicted word is on the first place, the second on the second place and so on.
Now let's create a new DataFrame:
pred_words_df = pd.DataFrame(pred_words_lists.tolist())
pred_words_df.columns = df.columns[:2:-1]
And The final DataFrame:
final_df = df[['course_name', 'course_title', 'course_description']].join(pred_words_df.iloc[:,0:11])
Hope this works.
EDIT
def common_elements(xx, yy):
temp = pd.Series(range(0, len(xx)), index= xx)
return list(df.reindex(yy).sort_values()[0:10].dropna().index)
pred_words_lists = df.apply(lambda x: common_elements(x[2].replace(',','').split(), list(x[3:].dropna())), axis = 1)
Does it satisfy your requirements?
Adapted solution (OP):
def get_sorted_descriptions_words(course_description, predicted_words, k):
description_words = course_description.replace(',','').split()
predicted_words_list = list(predicted_words)
predicted_words = pd.Series(range(0, len(predicted_words_list)), index=predicted_words_list)
predicted_words = predicted_words[~predicted_words.index.duplicated()]
ordered_description = predicted_words.reindex(description_words).dropna().sort_values()
ordered_description_list = pd.Series(ordered_description.index).unique()[:k]
return ordered_description_list
df.apply(lambda x: get_sorted_descriptions_words(x['course_description'], x.filter(regex=r'predicted_word_.*'), k), axis=1)

Related

comapring compressed distribution per cohort

How can I easily compare the distributions of multiple cohorts?
Usually, https://seaborn.pydata.org/generated/seaborn.distplot.html would be a great tool to visually compare distributions. However, due to the size of my dataset, I needed to compress it and only keep the counts.
It was created as:
SELECT age, gender, compress_distributionUDF(collect_list(struct(target_y_n, count, distribution_value))) GROUP BY age, gender
where compress_distributionUDF simply takes a list of tuples and returns the counts per group.
This leaves me with a list of
Row(distribution_value=60.0, count=314251, target_y_n=0)
nested inside a pandas.Series, but one per each chohort.
Basically, it is similar to:
pd.DataFrame({'foo':[1,2], 'bar':['first', 'second'], 'baz':[{'target_y_n': 0, 'value': 0.5, 'count':1000},{'target_y_n': 1, 'value': 1, 'count':10000}]})
and I wonder how to compare distributions:
within a cohort 0 vs. 1 of target_y_n
over multiple cohorts
in a way which is visually still understandable and not only a mess.
edit
For a single cohort Plotting pre aggregated data in python could be the answer, but how can multiple cohorts be compared (not just in a loop) as this leads to too many plots to compare?
I am still quite confused but we can start from this and see where it goes. From your example, I am focusing on baz as it is not clear to me what foo and bar are (I assume cohorts).
So let focus on baz and plot the different distributions according to target_y_n.
sns.catplot('value','count',data=df, kind='bar',hue='target_y_n',dodge=False,ci=None)
sns.catplot('value','count',data=df, kind='box',hue='target_y_n',dodge=False)
plt.bar(df[df['target_y_n']==0]['value'],df[df['target_y_n']==0]['count'],width=1)
plt.bar(df[df['target_y_n']==1]['value'],df[df['target_y_n']==1]['count'],width=1)
plt.legend(['Target=0','Target=1'])
sns.barplot('value','count',data=df, hue = 'target_y_n',dodge=False,ci=None)
Finally try to have a look at the FacetGrid class to extend your comparison (see here).
g=sns.FacetGrid(df,col='target_y_n',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
In your case you would have something like:
g=sns.FacetGrid(df,col='target_y_n',row='cohort',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
And a qqplot option:
from scipy import stats
def qqplot(x, y, **kwargs):
_, xr = stats.probplot(x, fit=False)
_, yr = stats.probplot(y, fit=False)
plt.scatter(xr, yr, **kwargs)
g=sns.FacetGrid(df,col='cohort',hue = 'target_y_n')
g=g.map(qqplot,'value','count')

Data Selection - Finding relations between dataframe attributes

let's say i have a dataframe of 80 columns and 1 target column,
for example a bank account table with 80 attributes for each record (account) and 1 target column which decides if the client stays or leaves.
what steps and algorithms should i follow to select the most effective columns with the higher impact on the target column ?
There are a number of steps you can take, I'll give some examples to get you started:
A correlation coefficient, such as Pearson's Rho (for parametric data) or Spearman's R (for ordinate data).
Feature importances. I like XGBoost for this, as it includes the handy xgb.ggplot.importance / xgb.plot_importance methods.
One of the many feature selection options, such as python's sklearn.feature_selection methods.
This one way to do it using the Pearson correlation coefficient in Rstudio, I used it once when exploring the red_wine dataset my targeted variable or column was the quality and I wanted to know the effect of the rest of the columns on it.
see below figure shows the output of the code as you can see the blue color represents positive relation and red represents negative relations and the closer the value to 1 or -1 the darker the color
c <- cor(
red_wine %>%
# first we remove unwanted columns
dplyr::select(-X) %>%
dplyr::select(-rating) %>%
mutate(
# now we translate quality to a number
quality = as.numeric(quality)
)
)
corrplot(c, method = "color", type = "lower", addCoef.col = "gray", title = "Red Wine Variables Correlations", mar=c(0,0,1,0), tl.cex = 0.7, tl.col = "black", number.cex = 0.9)

Creating a function to count the number of pos in a pandas instance

I've used NLTK to pos_tag sentences in a pandas dataframe from an old Yelp competition. This returns a list of tuples (word, POS). I'd like to count the number of parts of speech for each instance. How would I, say, create a function to count the number of being verbs in each review? I know how to apply functions to features - no problem there. I just can't wrap my head around how to count things inside tuples inside lists inside a pd feature.
The head is here, as a tsv: https://pastebin.com/FnnBq9rf
Thank you #zhangyulin for your help. After two days, I learned some incredibly important things (as a novice programmer!). Here's the solution!
def NounCounter(x):
nouns = []
for (word, pos) in x:
if pos.startswith("NN"):
nouns.append(word)
return nouns
df["nouns"] = df["pos_tag"].apply(NounCounter)
df["noun_count"] = df["nouns"].str.len()
As an example, for dataframe df, noun count of the column "reviews" can be saved to a new column "noun_count" using this code.
def NounCount(x):
nounCount = sum(1 for word, pos in pos_tag(word_tokenize(x)) if pos.startswith('NN'))
return nounCount
df["noun_count"] = df["reviews"].apply(NounCount)
df.to_csv('./dataset.csv')
There are a number of ways you can do that and one very straight forward way is to map the list (or pandas series) of tuples to indicator of whether the word is a verb, and count the number of 1's you have.
Assume you have something like this (please correct me if it's not, as you didn't provide an example):
a = pd.Series([("run", "verb"), ("apple", "noun"), ("play", "verb")])
You can do something like this to map the Series and sum the count:
a.map(lambda x: 1 if x[1]== "verb" else 0).sum()
This will return you 2.
I grabbed a sentence from the link you shared:
text = nltk.word_tokenize("My wife took me here on my birthday for breakfast and it was excellent.")
tag = nltk.pos_tag(text)
a = pd.Series(tag)
a.map(lambda x: 1 if x[1]== "VBD" else 0).sum()
# this returns 2

Tukey-Test Grouping and plotting in SciPy

I'm trying to plot results from a Tukey test, but I am struggling with putting data into groups based on a P-Value. This is the equivalent in R which I am trying to replicate. I have been using the SciPy one-way ANOVA tests and the Tukey test statsmodel but can't get these groups done in the same way.
Any help is greatly appreciated
I've also just found this another example in R of what I want to do in python
I have been struggling to do the same thing. I found a paper that tells you how to code the letters.
Hans-Peter Piepho (2004) An Algorithm for a Letter-Based Representation of All-Pairwise Comparisons, Journal of Computational and Graphical Statistics, 13:2, 456-466, DOI: 10.1198/1061860043515
Doing the coding was a little tricky as you need to check and replicate columns and then combine columns. I tried to add some comments to the colde. I figured out a method where you can run tukeyhsd and then from the results compute the letters. It should be possible to turn this into a function. Or hopefully part of tukeyhsd. My data is not posted but it is a column of data and then a column describing the groups. The groups for me are the five boroughs of NYC. You can also just change the comments and use random data the first time.
# Read data. Comment out the next ones to use random data.
df=pd.read_excel('anova_test.xlsx')
#n=1000
#df = pd.DataFrame(columns=['Groups','Data'],index=np.arange(n))
#df['Groups']=np.random.randint(1, 4,size=n)
#df['Data']=df['Groups']*np.random.random_sample(size=n)
# define columns for data and then grouping
col_to_group='Groups'
col_for_data='Data'
#Now take teh data and regroup for anova
samples = [cols[1] for cols in df.groupby(col_to_group)[col_for_data]] #I am not sure how this works but it makes an numpy array for each group
f_val, p_val = stats.f_oneway(*samples) # I am not sure what this star does but this passes all the numpy arrays correctly
#print('F value: {:.3f}, p value: {:.3f}\n'.format(f_val, p_val))
# this if statement can be uncommmented if you don't won't to go furhter with out p<0.05
#if p_val<0.05: #If the p value is less than 0.05 it then does the tukey
mod = MultiComparison(df[col_for_data], df[col_to_group])
thsd=mod.tukeyhsd()
#print(mod.tukeyhsd())
#this is a function to do Piepho method. AN Alogrithm for a letter based representation of al-pairwise comparisons.
tot=len(thsd.groupsunique)
#make an empty dataframe that is a square matrix of size of the groups. #set first column to 1
df_ltr=pd.DataFrame(np.nan, index=np.arange(tot),columns=np.arange(tot))
df_ltr.iloc[:,0]=1
count=0
df_nms = pd.DataFrame('', index=np.arange(tot), columns=['names']) # I make a dummy dataframe to put axis labels into. sd stands for signifcant difference
for i in np.arange(tot): #I loop through and make all pairwise comparisons.
for j in np.arange(i+1,tot):
#print('i=',i,'j=',j,thsd.reject[count])
if thsd.reject[count]==True:
for cn in np.arange(tot):
if df_ltr.iloc[i,cn]==1 and df_ltr.iloc[j,cn]==1: #If the column contains both i and j shift and duplicat
df_ltr=pd.concat([df_ltr.iloc[:,:cn+1],df_ltr.iloc[:,cn+1:].T.shift().T],axis=1)
df_ltr.iloc[:,cn+1]=df_ltr.iloc[:,cn]
df_ltr.iloc[i,cn]=0
df_ltr.iloc[j,cn+1]=0
#Now we need to check all columns for abosortpion.
for cleft in np.arange(len(df_ltr.columns)-1):
for cright in np.arange(cleft+1,len(df_ltr.columns)):
if (df_ltr.iloc[:,cleft].isna()).all()==False and (df_ltr.iloc[:,cright].isna()).all()==False:
if (df_ltr.iloc[:,cleft]>=df_ltr.iloc[:,cright]).all()==True:
df_ltr.iloc[:,cright]=0
df_ltr=pd.concat([df_ltr.iloc[:,:cright],df_ltr.iloc[:,cright:].T.shift(-1).T],axis=1)
if (df_ltr.iloc[:,cleft]<=df_ltr.iloc[:,cright]).all()==True:
df_ltr.iloc[:,cleft]=0
df_ltr=pd.concat([df_ltr.iloc[:,:cleft],df_ltr.iloc[:,cleft:].T.shift(-1).T],axis=1)
count+=1
#I sort so that the first column becomes A
df_ltr=df_ltr.sort_values(by=list(df_ltr.columns),axis=1,ascending=False)
# I assign letters to each column
for cn in np.arange(len(df_ltr.columns)):
df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(1,chr(97+cn))
df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(0,'')
df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(np.nan,'')
#I put all the letters into one string
df_ltr=df_ltr.astype(str)
df_ltr.sum(axis=1)
#print(df_ltr)
#print('\n')
#print(df_ltr.sum(axis=1))
#Now to plot like R with a violing plot
fig,ax=plt.subplots()
df.boxplot(column=col_for_data, by=col_to_group,ax=ax,fontsize=16,showmeans=True
,boxprops=dict(linewidth=2.0),whiskerprops=dict(linewidth=2.0)) #This makes the boxplot
ax.set_ylim([-10,20])
grps=pd.unique(df[col_to_group].values) #Finds the group names
grps.sort() # This is critical! Puts the groups in alphabeical order to make it match the plotting
props=dict(facecolor='white',alpha=1)
for i,grp in enumerate(grps): #I loop through the groups to make the scatters and figure out the axis labels.
x = np.random.normal(i+1, 0.15, size=len(df[df[col_to_group]==grp][col_for_data]))
ax.scatter(x,df[df[col_to_group]==grp][col_for_data],alpha=0.5,s=2)
name="{}\navg={:0.2f}\n(n={})".format(grp
,df[df[col_to_group]==grp][col_for_data].mean()
,df[df[col_to_group]==grp][col_for_data].count())
df_nms['names'][i]=name
ax.text(i+1,ax.get_ylim()[1]*1.1,df_ltr.sum(axis=1)[i],fontsize=10,verticalalignment='top',horizontalalignment='center',bbox=props)
ax.set_xticklabels(df_nms['names'],rotation=0,fontsize=10)
ax.set_title('')
fig.suptitle('')
fig.savefig('anovatest.jpg',dpi=600,bbox_inches='tight')
Results showing the letters above plots using the tukeyhsd
Here is a function that returns letter labels if you have a symmetric matrix of p-values from a Tukey test:
import numpy as np
def tukeyLetters(pp, means=None, alpha=0.05):
'''TUKEYLETTERS - Produce list of group labels for TukeyHSD
letters = TUKEYLETTERS(pp), where PP is a symmetric matrix of
probabilities from a Tukey test, returns alphabetic labels
for each group to indicate clustering. PP may also be a vector
from PAIRWISE_TUKEYHSD.
Optional argument MEANS specifies group means, which is used for
ordering the letters. ("a" gets assigned to the group with lowest
mean.) Without this argument, ordering is arbitrary.
Optional argument ALPHA specifies cutoff for treating groups as
part of the same cluster.'''
if len(pp.shape)==1:
# vector
G = int(3 + np.sqrt(9 - 4*(2-len(pp))))//2
ppp = .5*np.eye(G)
ppp[np.triu_indices(G,1)] = pp
pp = ppp + ppp.T
conn = pp>alpha
G = len(conn)
if np.all(conn):
return ['a' for g in range(G)]
conns = []
for g1 in range(G):
for g2 in range(g1+1,G):
if conn[g1,g2]:
conns.append((g1,g2))
letters = [ [] for g in range(G) ]
nextletter = 0
for g in range(G):
if np.sum(conn[g,:])==1:
letters[g].append(nextletter)
nextletter += 1
while len(conns):
grp = set(conns.pop(0))
for g in range(G):
if all(conn[g, np.sort(list(grp))]):
grp.add(g)
for g in grp:
letters[g].append(nextletter)
for g in grp:
for h in grp:
if (g,h) in conns:
conns.remove((g,h))
nextletter += 1
if means is None:
means = np.arange(G)
means = np.array(means)
groupmeans = []
for k in range(nextletter):
ingroup = [g for g in range(G) if k in letters[g]]
groupmeans.append(means[np.array(ingroup)].mean())
ordr = np.empty(nextletter, int)
ordr[np.argsort(groupmeans)] = np.arange(nextletter)
result = []
for ltr in letters:
lst = [chr(97 + ordr[x]) for x in ltr]
lst.sort()
result.append(''.join(lst))
return result
To make that concrete, here is a full example:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
data = [ 1,2,2,1,4,5,4,5,7,8,7,8,1,3,4,5 ]
group = [ 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3 ]
tuk = pairwise_tukeyhsd(data, group)
letters = tukeyLetters(tuk.pvalues)
This will result in letters containing ['a', 'c', 'b', 'ac']

Apply function with pandas dataframe - POS tagger computation time

I'm very confused on the apply function for pandas. I have a big dataframe where one column is a column of strings. I'm then using a function to count part-of-speech occurrences. I'm just not sure the way of setting up my apply statement or my function.
def noun_count(row):
x = tagger(df['string'][row].split())
# array flattening and filtering out all but nouns, then summing them
return num
So basically I have a function similar to the above where I use a POS tagger on a column that outputs a single number (number of nouns). I may possibly rewrite it to output multiple numbers for different parts of speech, but I can't wrap my head around apply.
I'm pretty sure I don't really have either part arranged correctly. For instance, I can run noun_count[row] and get the correct value for any index but I can't figure out how to make it work with apply how I have it set up. Basically I don't know how to pass the row value to the function within the apply statement.
df['num_nouns'] = df.apply(noun_count(??),1)
Sorry this question is all over the place. So what can I do to get a simple result like
string num_nouns
0 'cat' 1
1 'two cats' 1
EDIT:
So I've managed to get something working by using list comprehension (someone posted an answer, but they've deleted it).
df['string'].apply(lambda row: noun_count(row),1)
which required an adjustment to my function:
def tagger_nouns(x):
list_of_lists = st.tag(x.split())
flat = [y for z in list_of_lists for y in z]
Parts_of_speech = [row[1] for row in flattened]
c = Counter(Parts_of_speech)
nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
return nouns
I'm using the Stanford tagger, but I have a big problem with computation time, and I'm using the left 3 words model. I'm noticing that it's calling the .jar file again and again (java keeps opening and closing in the task manager) and maybe that's unavoidable, but it's really taking far too long to run. Any way I can speed it up?
I don't know what 'tagger' is but here's a simple example with a word count that ought to work more or less the same way:
f = lambda x: len(x.split())
df['num_words'] = df['string'].apply(f)
string num_words
0 'cat' 1
1 'two cats' 2