Replcae multiple substrings in dataframe column based on other columns - dataframe

I'm stuck. I need a (deceptive) simple operation on a tibble...
One of the columns is a string. I also have vars that is a char vector that matches names on tibble.
So I need to replace all my vars in my_tib$thestring by the corresponding value in the tibble.
Here is an example
vars <- c("Yes", "No", "Maybe")
my_tib <- tribble(
~Yes, ~No, ~Maybe, ~thestring,
1, 0, 2 , "Sometimes Yes is YES",
1, 0, 3 , "Sometimes Yes others is No or Maybe",
1, 0, 4 , "Sometimes Yes while Maybe...",
1, 0, 5 , "Sometimes Yes is Yes and No and maybe",
)
# Intended Result
my_tib_result <- tribble(
~yes, ~no, ~maybe, ~thestring,
1, 0, 2 , "Sometimes 1 is YES",
1, 0, 3 , "Sometimes 1 others is 0 or 3",
1, 0, 4 , "Sometimes 1 while 4...",
1, 0, 5 , "Sometimes 1 is 1 and 0 and 5",
)
I'm sure it's simple (:) or not :))... but I'm not moving from this point... so I need a Most welcome push.
Thank you very much for your comments and help.
AC

I have found a method... not the most elegant, but it works... so sharing the solution.
If any one has a better idea I would appreciatte.
My solution:
# Create a function
chg_each <- function(str, tb){
tb %>% mutate(
# Note the 'as.character' and 'get' ... for map2
text = map2(text, as.character(get(str)),
~if_else(is.na(.x), "",
str_replace_all(.x, str, .y)))
}
# Iterate over all vars to change
end_my_tib <- my_tib
for(var in vars){
end_my_tib <- chg_each (var, end_my_tib)
}

Related

Adding and updating a pandas column based on conditions of other columns

So I have a dataframe of over 1 million rows
One column called 'activity', which has numbers from 1 - 12.
I added a new empty column called 'label'
The column 'label' needs to be filled with 0 or 1, based on the values of the column 'activity'
So if activity is 1, 2, 3, 6, 7, 8 label will be 0, otherwise it will be 1
Here is what I am currently doing:
df = pd.read_csv('data.csv')
df['label'] = ''
for index, row in df.iterrows():
if (row['activity'] == 1 or row['activity'] == 2 or row['activity'] == 3 or row['activity'] == 6 row['activity'] == 7 or row['activity'] == 8):
df.loc[index, 'label'] == 0
else:
df.loc[index, 'label'] == 1
df.to_cvs('data.csv', index = False)
This is very inefficient, and takes too long to run. Is there any optimizations? Possible use of numpy arrays? And any way to make the code cleaner?
Use numpy.where with Series.isin:
df['label'] = np.where(df['activity'].isin([1, 2, 3, 6, 7, 8]), 0, 1)
Or True, False mapping to 0, 1 by inverting mask:
df['label'] = (~df['activity'].isin([1, 2, 3, 6, 7, 8])).astype(int)

find the first element in a list beyond some index and satisfying some condition

I have as Input:
A givenIndex
A list
I want to find the index of the first positive element in that list but ignoring all the indices that are strictly smaller than givenIndex
For example if givenIndex=2 and the list is listOf(1, 0, 0, 0, 6, 8, 2), the expected output is 4 (where the value is 6).
The following code gives the first positive element but It doesn't take into account ignoring all the indices that are smaller than givenIndex.
val numbers = listOf(1, 0, 0, 0, 6, 8, 2)
val output = numbers.indexOfFirst { it > 0 } //output is 0 but expected is 4
val givenIndex = 2
val output = numbers.withIndex().indexOfFirst { (index, value) -> index >= givenIndex && value > 0 }

Convert the value of a dictionary in a column into a particular number in pandas

I have a dataframe as shown below
Date Aspect
21-01-2020 {word1:'positive', word2:'negative', word3:'neutral'}
22-01-2020 {word1:'negative', word2:'negative', word3:'neutral', word4:'neutral'}
23-01-2020 {word1:'positive', word2:'positive', word3:'negative'}
I would like to replace positive to 1, negative to -1 and neutral to 0.
Expected Output:
Date Aspect
21-01-2020 {word1:1, word2:-1, word3:0}
22-01-2020 {word1:-1, word2:-1, word3:0, word4:0}
23-01-2020 {word1:1, word2:1, word3:-1}
If column Aspect is filled by dictionaries use dict comprehension with mapping by helper dict:
d = {'positive':1, 'negative':-1, 'neutral':0}
df['Aspect'] = df['Aspect'].apply(lambda x: {k: d[v] for k, v in x.items()})
#alternative
#df['Aspect'] = [{k: d[v] for k, v in x.items()} for x in df['Aspect']]
print (df)
Date Aspect
0 21-01-2020 {'word1': 1, ' word2': -1, 'word3': 0}
1 22-01-2020 {'word1': -1, 'word2': -1, 'word3': 0, 'word4'...
2 23-01-2020 {'word1': 1, 'word2': 1, 'word3': -1}

Selecting values with Pandas multiindex using lists of tuples

I have a DataFrame with a MultiIndex with 3 levels:
id foo bar col1
0 1 a -0.225873
2 a -0.275865
2 b -1.324766
3 1 a -0.607122
2 a -1.465992
2 b -1.582276
3 b -0.718533
7 1 a -1.904252
2 a 0.588496
2 b -1.057599
3 a 0.388754
3 b -0.940285
Preserving the id index level, I want to sum along the foo and bar levels, but with different values for each id.
For example, for id = 0 I want to sum over foo = [1] and bar = [["a", "b"]], for id = 3 I want to sum over foo = [2] and bar = [["a", "b"]], and for id = 7 I want to sum over foo = [[1,2]] and bar = [["a"]]. Giving the result:
id col1
0 -0.225873
3 -3.048268
7 -1.315756
I have been trying something along these lines:
df.loc(axis = 0)[[(0, 1, ["a","b"]), (3, 2, ["a","b"]), (7, [1,2], "a")].sum()
Not sure if this is even possible. Any elegant solution (possibly removing the MultiIndex?) would be much appreciated!
The list of tuples is not the problem. The fact that each tuple does not correspond to a single index is the problem (Since a list isn't a valid key). If you want to index a Dataframe like this, you need to expand the lists inside each tuple to their own entries.
Define your options like the following list of dictionaries, then transform using a list comprehension and index using all individual entries.
d = [
{
'id': 0,
'foo': [1],
'bar': ['a', 'b']
},
{
'id': 3,
'foo': [2],
'bar': ['a', 'b']
},
{
'id': 7,
'foo': [1, 2],
'bar': ['a']
},
]
all_idx = [
(el['id'], i, j)
for el in d
for i in el['foo']
for j in el['bar']
]
# [(0, 1, 'a'), (0, 1, 'b'), (3, 2, 'a'), (3, 2, 'b'), (7, 1, 'a'), (7, 2, 'a')]
df.loc[all_idx].groupby(level=0).sum()
col1
id
0 -0.225873
3 -3.048268
7 -1.315756
A more succinct solution using slicers:
sections = [(0, 1, slice(None)), (3, 2, slice(None)), (7, slice(1,2), "a")]
pd.concat(df.loc[s] for s in sections).groupby("id").sum()
col1
id
0 -0.225873
3 -3.048268
7 -1.315756
Two things to note:
This may be less memory-efficient than the accepted answer since pd.concat creates a new DataFrame.
The slice(None)'s are mandatory, otherwise the index columns of the df.loc[s]'s mismatch when calling pd.concat.

How do I sum the coefficients of a polynomial in Maxima?

I came up with this nice thing, which I am calling 'partition function for symmetric groups'
Z[0]:1;
Z[n]:=expand(sum((n-1)!/i!*z[n-i]*Z[i], i, 0, n-1));
Z[4];
6*z[4]+8*z[1]*z[3]+3*z[2]^2+6*z[1]^2*z[2]+z[1]^4
The sum of the coefficients for Z[4] is 6+8+3+6+1 = 24 = 4!
which I am hoping corresponds to the fact that the group S4 has 6 elements like (abcd), 8 like (a)(bcd), 3 like (ab)(cd), 6 like (a)(b)(cd), and 1 like (a)(b)(c)(d)
So I thought to myself, the sum of the coefficients of Z[20] should be 20!
But life being somewhat on the short side, and fingers giving trouble, I was hoping to confirm this automatically. Can anyone help?
This sort of thing points a way:
Z[20],z[1]=1,z[2]=1,z[3]=1,z[4]=1,z[5]=1,z[6]=1,z[7]=1,z[8]=1;
But really...
I don't know a straightforward way to do that; coeff seems to handle only a single variable at a time. But here's a way to get the list you want. The basic idea is to extract the terms of Z[20] as a list, and then evaluate each term with z[1] = 1, z[2] = 1, ..., z[20] = 1.
(%i1) display2d : false $
(%i2) Z[0] : 1 $
(%i3) Z[n] := expand (sum ((n - 1)!/i!*z[n - i]*Z[i], i, 0, n-1)) $
(%i4) z1 : makelist (z[i] = 1, i, 1, 20);
(%o4) [z[1] = 1,z[2] = 1,z[3] = 1,z[4] = 1,z[5] = 1,z[6] = 1,z[7] = 1, ...]
(%i5) a : args (Z[20]);
(%o5) [121645100408832000*z[20],128047474114560000*z[1]*z[19],
67580611338240000*z[2]*z[18],67580611338240000*z[1]^2*z[18],
47703960944640000*z[3]*z[17],71555941416960000*z[1]*z[2]*z[17], ...]
(%i6) a1 : ev (a, z1);
(%o6) [121645100408832000,128047474114560000,67580611338240000, ...]
(%i7) apply ("+", a1);
(%o7) 2432902008176640000
(%i8) 20!;
(%o8) 2432902008176640000