I have a dataframe column of strings. Now I want to replace specific words in these strings with a value from another dataframe which has the meaning of the word to be replaced. I am currently using iterrrows() which takes about 2 minutes for 25000 rows. I would like to know if there is a more efficient way of doing this.
syn = pd.ExcelFile("C:/Key-Value.xlsx")
df_syn = syn.parse("Keys")
for idx, row in df_syn.iterrows():
df['col'] = df['col'].str.replace(r"\b"+row['synonym']+r"\b", row['word'])
IIUC:
Setup
df_syn = pd.DataFrame(dict(synonym=['hug', 'kiss'], word=['warm', 'tender']))
df = pd.DataFrame(dict(col=['I want a hug', 'a kiss would be great']))
print(df_syn, df, sep='\n\n')
synonym word
0 hug warm
1 kiss tender
col
0 I want a hug
1 a kiss would be great
Solution
mapping = df_syn.assign(
synonym=df_syn.synonym.radd(r'\b').add(r'\b')
).set_index('synonym').word.to_dict()
df.replace({'col': mapping}, regex=True)
col
0 I want a warm
1 a tender would be great
Related
I have two different dataframes that I want to fuzzy match against each other to find and remove duplicates. To make the process faster/more accurate I want to only fuzzy match records from both dataframes in the same cities. So that makes it necessary to create batches based on cities in the one dataframe then running the fuzzy matcher between each batch and a subset of the other dataframe with like cities. I can't find another post that does this and I am stuck. Here is what I have so far. Thanks!
df = pd.DataFrame({'A':[1,1,2,2,2,2,3,3],'B':['Q','Q','R','R','R','P','L','L'],'origin':['file1','file2','file3','file4','file5','file6','file7','file8']})
cols = ['B']
df1 = df[df.duplicated(subset=cols,keep=False)].copy()
df1 = df1.sort_values(cols)
df1['group'] = 'g' + (df1.groupby(cols).ngroup() + 1).astype(str)
df1['duplicate_count'] = df1.groupby(cols)['origin'].transform('size')
df1_g1 = df1.loc[df1['group'] == 'g1']
print(df1_g1)
which will not factor in anything that isn't duplicated so if a value only appears once then it will be skipped as is the case with 'P' in column B. It also requires me to go in and hard-code the group in each time which is not ideal. I haven't been able to figure out a for loop or any other method to solve this. Thanks!
You can pass to locals
variables = locals()
for i,j in df1.groupby('group'):
variables["df1_{0}".format(i)] = j
df1_g1
Out[314]:
A B origin group duplicate_count
6 3 L file7 g1 2
7 3 L file8 g1 2
I have the following Problem: I have a dataframe that looks like this:
A B
0 1 [5]
1 3 [1]
2 3 [118]
3 5 [34]
Now, I Need column B to only contain numbers, otherwise I can't work with the data. I already tried to use the replace-function and simply replace "[]" with "", but that didn't work out.
Is there any other way? Maybe I can convert the whole column to only keep the numbers as integers? That would be even better than just dropping the parenthesis.
I'm grateful for any help, I've been stuck with this for 2h now.
If your B column contains a string, use:
df['B'] = df['B'].str[1:-1].astype(int)
If your B column contains a list of one element, use:
df['B'] = df['B'].str[0]
Update
df['B'] = df['B'].str.extract(r'\[(.*)\]', expand=False).astype(int)
As Title says, I'm looking for a perfect solution to replace exact string in a series ignoring case.
ls = {'CAT':'abc','DOG' : 'def','POT':'ety'}
d = pd.DataFrame({'Data': ['cat','dog','pot','Truncate','HotDog','ShuPot'],'Result':['abc','def','ety','Truncate','HotDog','ShuPot']})
d
In the above code, ref hold the key-value pair where key is the existing value in a dataframe column and value is value to replace with.
Issue with this case is, service that pass the dictionary always holds dictionary key in upper case where dataframe might have value in lowercase.
expected output is stored in 'Result Column.
I tried including re.ignore = True which changes the last 2 values.
following code but that is not working as expected. it also converting values to upper case from previous iteration.
for k,v in ls.items():
print (k,v)
d['Data'] = d['Data'].astype(str).str.upper().replace({k:v})
print (d)
I'd appreciate any help.
Create a mapping series from the given dictionary, then transform the index of the mapping series to lower case, then using Series.map map the values in Data column to the values in mappings, then use Series.fillna to fill the missing values in the mapped series:
mappings = pd.Series(ls)
mappings.index = mappings.index.str.lower()
d['Result'] = d['Data'].str.lower().map(mappings).fillna(d['Data'])
# print(d)
Data Result
0 cat abc
1 dog def
2 pot ety
3 Truncate Truncate
4 HotDog HotDog
5 ShuPot ShuPot
I have a dataframe for looking up values:
ruralw2 = [[0.1,0.3,0.5], [0.1,0.2,0.8], [0.1,0.2,0.7], [0.1,0,0.3]]
rw2 = pd.DataFrame(data=ruralw2, columns=['city','suburbs','rural'],index=['low','med','high','v-high'])
and then I have a another dataframe where I want to get 'p' values based on data in rw2 dataframe:
df = pd.DataFrame(columns=['location','income','p'])
df['location'] = ['city','city','suburbs','rural','rural']
df['income'] = ['low','med','high','v-high','med']
What I expect is this:
It's possible to use for loop but its an antipattern in Pandas and I think there should be a better way.
for i in np.arange(df.shape[0]):
df['p'][i] = rw2.loc[df['income'][i],df['location'][i]]
Another possibility is to write very long np.where(... logic but it doesn't feel right either and it wouldn't be very scalable.
you can use stack on rw2 and reindex with both columns income and location of df like:
df['p'] = rw2.stack().reindex(df[['income', 'location']]).to_numpy()
location income p
0 city low 0.1
1 city med 0.1
2 suburbs high 0.2
3 rural v-high 0.3
4 rural med 0.8
You can use reset_index to bring the income values into the data frame, followed by pd.melt to restructure it in your result format. You can then merge this new data frame with df
Step 1:
rw2_reset = rw2.reset_index()
rw2_reset
Step2:
rw2_melt = pd.melt(rw2_reset, id_vars='index', value_vars=['city', 'suburbs', 'rural'])
rw2_melt.rename(columns={'index':'income', 'variable':'location','value':'p'}, inplace=True)
rw2_melt
Step3:
result = pd.merge(df, rw2_melt, on=['location', 'income'], how='left').drop(columns='p_x').rename(columns={'p_y':'p'})
result
I have a TSV file that I loaded into a pandas dataframe to do some preprocessing and I want to find out which rows have a question in it, and output 1 or 0 in a new column. Since it is a TSV, this is how I'm loading it:
import pandas as pd
df = pd.read_csv('queries-10k-txt-backup', sep='\t')
Here's a sample of what it looks like:
QUERY FREQ
0 hindi movies for adults 595
1 are panda dogs real 383
2 asuedraw winning numbers 478
3 sentry replacement keys 608
4 rebuilding nicad battery packs 541
After dropping empty rows, duplicates, and the FREQ column(not needed for this), I wrote a simple function to check the QUERY column to see if it contains any words that make the string a question:
df_test = df.drop_duplicates()
df_test = df_test.dropna()
df_test = df_test.drop(['FREQ'], axis = 1)
def questions(row):
questions_list =
["what","when","where","which","who","whom","whose","why","why don't",
"how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'] in questions_list:
return 1
else:
return 0
df_test['QUESTIONS'] = df_test.apply(questions, axis=1)
But once I check the new dataframe, even though it creates the new column, all the values are 0. I'm not sure if my logic is wrong in the function, I've used something similar with dataframe columns which just have one word and if it matches, it'll output a 1 or 0. However, that same logic doesn't seem to be working when the column contains a phrase/sentence like this use case. Any input is really appreciated!
If you wish to check exact matches of any substring from question_list and of a string from dataframe, you should use str.contains method:
questions_list = ["what","when","where","which","who","whom","whose","why",
"why don't", "how","how far","how long","how many",
"how much","how old","how come","?"]
pattern = "|".join(questions_list) # generate regex from your list
df_test['QUESTIONS'] = df_test['QUERY'].str.contains(pattern)
Simplified example:
df = pd.DataFrame({
'QUERY': ['how do you like it', 'what\'s going on?', 'quick brown fox'],
'ID': [0, 1, 2]})
Create a pattern:
pattern = '|'.join(['what', 'how'])
pattern
Out: 'what|how'
Use it:
df['QUERY'].str.contains(pattern)
Out[12]:
0 True
1 True
2 False
Name: QUERY, dtype: bool
If you're not familiar with regexes, there's a quick python re reference. Fot symbol '|', explanation is
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way
IIUC, you need to find if the first word in the string in the question list, if yes return 1, else 0. In your function, rather than checking if the entire string is in question list, split the string and check if the first element is in question list.
def questions(row):
questions_list = ["are","what","when","where","which","who","whom","whose","why","why don't","how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'].split()[0] in questions_list:
return 1
else:
return 0
df['QUESTIONS'] = df.apply(questions, axis=1)
You get
QUERY FREQ QUESTIONS
0 hindi movies for adults 595 0
1 are panda dogs real 383 1
2 asuedraw winning numbers 478 0
3 sentry replacement keys 608 0
4 rebuilding nicad battery packs 541 0