Find multiple strings in a given column - pandas

I'm not sure whether it is possible to do easily.
I have 2 dataframes. In the first one (df1) there is a column with texts ('Texts') and in the second one there are 2 columns, one with some sort texts ('subString') and the second with a score ('Score').
What I want is to sum up all the scores associated to the subString field in the second dataframe when these subString are a substring of the text column in the first dataframe.
For example, if I have a dataframe like this:
df1 = pd.DataFrame({
'ID':[1,2,3,4,5,6],
'Texts':['this is a string',
'here we have another string',
'this one is completly different',
'one more',
'this is one more',
'and the last one'],
'c':['C','C','C','C','C','C'],
'd':['D','D','D','D','NaN','NaN']
}, columns = ['ID','Texts','c','d'])
df1
Out[2]:
ID Texts c d
0 1 this is a string C D
1 2 here we have another string C D
2 3 this one is completly different C D
3 4 one more C D
4 5 this is one more C NaN
5 6 and the last one C NaN
And another dataframe like this:
df2 = pd.DataFrame({
'SubString':['This', 'one', 'this is', 'is one'],
'Score':[0.5, 0.2, 0.75, -0.5]
}, columns = ['SubString','Score'])
df2
Out[3]:
SubString Score
0 This 0.50
1 one 0.20
2 this is 0.75
3 is one -0.50
I want to get something like this:
df1['Score'] = 0.0
for index1, row1 in df1.iterrows():
score = 0
for index2, row2 in df2.iterrows():
if row2['SubString'] in row1['Texts']:
score += row2['Score']
df1.set_value(index1, 'Score', score)
df1
Out[4]:
ID Texts c d Score
0 1 this is a string C D 0.75
1 2 here we have another string C D 0.00
2 3 this one is completly different C D -0.30
3 4 one more C D 0.20
4 5 this is one more C NaN 0.45
5 6 and the last one C NaN 0.20
Is there a less garbled and faster way to do it?
Thanks!

Option 1
In [691]: np.array([np.where(df1.Texts.str.contains(x.SubString), x.Score, 0)
for _, x in df2.iterrows()]
).sum(axis=0)
Out[691]: array([ 0.75, 0. , -0.3 , 0.2 , 0.45, 0.2 ])
Option 2
In [674]: df1.Texts.apply(lambda x: df2.Score[df2.SubString.apply(lambda y: y in x)].sum())
Out[674]:
0 0.75
1 0.00
2 -0.30
3 0.20
4 0.45
5 0.20
Name: Texts, dtype: float64
Note: apply doesn't get rid of loops, it just hides them.

Related

Pandas - groupby one column and get mean of all other columns

I have a dataframe, with columns:
cols = ['A', 'B', 'C']
If I groupby one column, say, 'A', like so:
df.groupby('A')['B'].mean()
It works.
But I need to groupby one column and then get the mean of all other columns. I've tried:
df[cols].groupby('A').mean()
But I get the error:
KeyError: 'A'
What am I missing?
Please try:
df.groupby('A').agg('mean')
sample data
B C A
0 1 4 K
1 2 6 S
2 4 7 K
3 6 3 K
4 2 1 S
5 7 3 K
6 8 9 K
7 9 3 K
print(df.groupby('A').agg('mean'))
B C
A
K 5.833333 4.833333
S 2.000000 3.500000
You can use df.groupby('col').mean(). For example to calcualte mean for columns 'A', 'B' and 'C':
A B C D
0 1 NaN 1 1
1 1 2.0 2 1
2 2 3.0 1 1
3 1 4.0 1 1
4 2 5.0 2 1
df[['A', 'B', 'C']].groupby('A').mean()
or
df.groupby('A')[['A', 'B', 'C']].mean()
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
If you need mean for all columns:
df.groupby('A').mean()
Output:
B C D
A
1 3.0 1.333333 1.0
2 4.0 1.500000 1.0
Perhaps the missing column is string rather than numeric?
df = pd.DataFrame({
'A': ['big', 'small','small', 'small'],
'B': [1,0,0,0],
'C': [1,1,1,0],
'D': ['1','0','0','0']
})
df.groupby(['A']).mean()
Output:
A
B
C
big
1.0
1.0
small
0.0
0.6666666666666666
Here, converting the column to a numeric type such as int or float produces the desired result:
df.D = df.D.astype(int)
df.groupby(['A']).mean()
Output:
A
B
C
D
big
1.0
1.0
1.0
small
0.0
0.6666666666666666
0.0

Calculate a value inside a group

Suppose that I have a Pandas DataFrame name is df:
Origin Dest T R
0 N N 100 3
1 N A 2 6
2 A B 356 7
3 A B 789 8
4 B N 999 9
5 B A 345 2
6 N A 456 3
I want to produce a DataFrame that for each group by Origin do the following calculation:
Sum values in column 'T' then divide with sum of the values in 'R' for each groups. I want to see this result in a origin_dest matrix form.
I am trying to following, but does not work.
Matrix_Origin =df.pivot_table(values=['T','R'], index='Origin', columns ='Dest', fill_value=0, aggfunc=[lambda x: df['T'].sum()/df['R'].sum() ])
This is what I want to produce:
Origin N A B
N 33.33 50.88 0
A 0 0 76.33
B 111 172.5 0
Any help will be appreciated.
A combination of groupby, with unstack can yield your desired outcome :
res = df.groupby(["Origin", "Dest"]).sum().unstack()
#divide column T with column R
outcome = (
res["T"]
.div(res["R"])
.reindex(index=["N", "A", "B"], columns=["N", "A", "B"])
.fillna(0)
#optional
.round(2)
)
outcome
Dest N A B
Origin
N 33.33 50.89 0.00
A 0.00 0.00 76.33
B 111.00 172.50 0.00

splitting string columns while the first part before the splitting pattern is missing

I'm trying to split a string column into different columns and tried
How to split a column into two columns?
The pattern of the strings look like the following:
import pandas as pd
import numpy as np
>>> data = {'ab': ['a - b', 'a - b', 'b', 'c', 'whatever']}
>>> df = pd.DataFrame(data=data)
ab
0 a - b
1 a - b
2 b
3 c
4 whatever
>>> df['a'], df['b'] = df['ab'].str.split('-', n=1).str
ab a b
0 a - b a b
1 a - b a b
2 b b NaN
3 c c NaN
4 whatever whatever NaN
The expected result is
ab a b
0 a - b a b
1 a - b a b
2 b NaN b
3 c NaN c
4 whatever NaN whatever
The method I came up with is
df.loc[~ df.ab.str.contains(' - '), 'b'] = df['ab']
df.loc[~ df.ab.str.contains(' - '), 'a'] = np.nan
Is there more generic/efficient way to do this task?
we can extractall as long as we know the specific strings to extract:
df.ab.str.extract(r"(a)?(?:\s-\s)?(b)?")
Out[47]:
0 1
0 a b
1 a b
2 NaN b
3 a NaN
data used:
data = {'ab': ['a - b', 'a - b', 'b','a']}
df = pd.DataFrame(data=data)
with your edit, it seems your aim is to put anything that is by itself on the second column. You could do:
df.ab.str.extract(r"(\S*)(?:\s-\s)?(\b\S+)")
Out[59]:
0 1
0 a b
1 a b
2 b
3 c
4 whatever
I will using get_dummies
s=df['ab'].str.get_dummies(' - ')
s=s.mask(s.eq(1),s.columns.tolist()).mask(s.eq(0))
s
Out[7]:
a b
0 a b
1 a b
2 NaN b
Update
df.ab.str.split(' - ',expand=True).apply(lambda x : pd.Series(sorted(x,key=pd.notnull)),axis=1)
Out[22]:
0 1
0 a b
1 a b
2 None b
3 None c
4 None whatever

Re-index to insert missing rows in a multi-indexed dataframe

I have a MultiIndexed DataFrame with three levels of indices. I would like to expand my third level to contain all values in a given range, but only for the existing values in the two upper levels.
For example, assume the first level is name, the second level is date and the third level is hour. I would like to have rows for all 24 possible hours (even if some are currently missing), but only for the already existing names and dates. The values in new rows can be filled with zeros.
So a simple example input would be:
>>> import pandas as pd
>>> df = pd.DataFrame([[1,1,1,3],[2,2,1,4], [3,3,2,5]], columns=['A', 'B', 'C','val'])
>>> df.set_index(['A', 'B', 'C'], inplace=True)
>>> df
val
A B C
1 1 1 3
2 2 1 4
3 3 2 5
if the required values for C are [1,2,3], the desired output would be:
val
A B C
1 1 1 3
2 0
3 0
2 2 1 4
2 0
3 0
3 3 1 0
2 5
3 0
I know how to achieve this using groupby and applying a defined function for each group, but I was wondering if there was a cleaner way of doing this with reindex (I couldn't make this one work for a MultiIndex case, but perhaps I'm missing something)
Use -
partial_indices = [ i[0:2] for i in df.index.values ]
C_reqd = [1, 2, 3]
final_indices = [j+(i,) for j in partial_indices for i in C_reqd]
index = pd.MultiIndex.from_tuples(final_indices, names=['A', 'B', 'C'])
df2 = pd.DataFrame(pd.Series(0, index), columns=['val'])
df2.update(df)
Output
df2
val
A B C
1 1 1 3.0
2 0.0
3 0.0
2 2 1 4.0
2 0.0
3 0.0
3 3 1 0.0
2 5.0
3 0.0

Assigning to a slice from another DataFrame requires matching column names?

If I want to set (replace) part of a DataFrame with values from another, I should be able to assign to a slice (as in this question) like this:
df.loc[rows, cols] = df2
Not so in this case, it nulls out the slice instead:
In [32]: df
Out[32]:
A B
0 1 -0.240180
1 2 -0.012547
2 3 -0.301475
In [33]: df2
Out[33]:
C
0 x
1 y
2 z
In [34]: df.loc[:,'B']=df2
In [35]: df
Out[35]:
A B
0 1 NaN
1 2 NaN
2 3 NaN
But it does work with just a column (Series) from df2, which is not an option if I want multiple columns:
In [36]: df.loc[:,'B']=df2['C']
In [37]: df
Out[37]:
A B
0 1 x
1 2 y
2 3 z
Or if the column names match:
In [47]: df3
Out[47]:
B
0 w
1 a
2 t
In [48]: df.loc[:,'B']=df3
In [49]: df
Out[49]:
A B
0 1 w
1 2 a
2 3 t
Is this expected? I don't see any explanation for it in docs or Stackoverflow.
Yes, this is expected. Label alignment is one of the core features of pandas. When you use df.loc[:,'B'] = df2 it needs to align two DataFrames:
df.align(df2)
Out:
( A B C
0 1 -0.240180 NaN
1 2 -0.012547 NaN
2 3 -0.301475 NaN, A B C
0 NaN NaN x
1 NaN NaN y
2 NaN NaN z)
The above shows how each DataFrame looks when aligned as a tuple (the first one is df and the second one is df2). If your df2 also had a column named B with values [1, 2, 3], it would become:
df.align(df2)
Out:
( A B C
0 1 -0.240180 NaN
1 2 -0.012547 NaN
2 3 -0.301475 NaN, A B C
0 NaN 1 x
1 NaN 2 y
2 NaN 3 z)
Since B's are aligned, your assignment would result in
df.loc[:,'B'] = df2
df
Out:
A B
0 1 1
1 2 2
2 3 3
When you use a Series, the alignment will be on a single axis (on index in your example). Since they exactly match, there will be no problem and it will assign the values from df2['C'] to df['B'].
You can either rename the labels before the alignment or use a data structure that doesn't have labels (a numpy array, a list, a tuple...).
You can use the underlying NumPy array:
df.loc[:,'B'] = df2.values
df
A B
0 1 x
1 2 y
2 3 z
Pandas indexing is always sensitive to labeling of both rows and columns. In this case, your rows check out, but your columns do not. (B != C).
Using the underlying NumPy array makes the operation index-insensitive.
The reason that this does work when df2 is a Series is because Series have no concept of columns. The only alignment is on the rows, which are aligned.