Given the following dataframe:
df = pd.DataFrame({'A': ["EQ", "CB", "CB", "FF", "EQ", "EQ", "CB", "CB"],
'B': ["ANT", "ANT", "DQ", "DQ", "BQ", "VGQ", "GHB", "VGQ"]})
How can I keep the rows of column B if it meets the condition of exist for both EQ and CB. For example, I would want to keep ANT because it exists for both EQ and CB, while DQ would be deleted. So the expected output for the df would be:
out = pd.DataFrame({'A': ["EQ", "CB", "EQ", "CB"],
'B': ["ANT", "ANT", "VGQ", "VGQ"]})
Thanks!
Let us try filter
s=df.groupby('B').filter(lambda x : pd.Series(['EQ','CB']).isin(x['A']).all())
Out[7]:
A B
0 EQ ANT
1 CB ANT
5 EQ VGQ
7 CB VGQ
Then
s=s[s.A.isin(['EQ','CB'])]
Here is a solution not using groupby() if you want code that may be easier to think about:
equities = df.B[df.A == 'EQ']
bonds = df.B[df.A == 'CB']
both = equities[equities.isin(bonds)]
That gives you:
0 ANT
5 VGQ
Which makes the last part easy:
df[df.B.isin(both)]
Out:
A B
0 EQ ANT
1 CB ANT
5 EQ VGQ
7 CB VGQ
This is 3x faster on small data sets than groupby().filter().
Another way uses transform and slicing
m = df.groupby('B').A.transform(lambda x: (x.nunique() >= 2)
& (x.isin(['EQ', 'CB']).sum() >= 2))
df_final = df[m]
Out[623]:
A B
0 EQ ANT
1 CB ANT
5 EQ VGQ
7 CB VGQ
Related
Thank you for your interest in this question.
I have the data as below.
a<- data.frame("Grade"=c(1, 2, 3, 4), "Prob"=c(0.01, 0.25, 0.45, 0.29))
b<- data.frame("Pot"= c(letters[1:18]))
Based on the codes below, I'd like to make a function that can loop 4 Grade numbers based on the Prob probability (replace=TRUE) and four random letters with the same probability (replace=FALSE). For instance, this loop may look like below:
3 2 3 2 d f k g
1 3 4 2 a k r b
I'd like to make a function that can compute not only the results in which the Grades result is only lower than 3, and the four alphabets that I selected appear, but the number of trials to get this result. So, if I want Pot to have "a", "b", "c", and "d" the result will look like:
Trial Grade Pot
15 3 2 1 3 a b c d
39 2 1 2 2 d b a c
2 3 3 3 3 d a b d
77 3 2 3 3 c d b a
I could learn the below code thanks to a very kind person, but I can't edit it to get the results I hope to see. Can you please help me?
samplefun <- function(a) {
c <- sample(a$Grade, size=4, prob=a$Prob, replace=TRUE)
res <- tibble(
Trial = which(c < 3)[1],
Result = c[which(c < 3)[1]]
)
nsamples <- 1000
x<-map_dfr(1:nsamples, ~ samplefun(a))
Thank you for reading this question.
Here's a solution to what I think you're after. I haven't specified a probability vector when sampling b$Pot, because you didn't give one that was 18 elements long in your question (see my comment).
library(tidyverse)
a<- data.frame(Grade =c(1, 2, 3, 4), Prob = c(0.01, 0.25, 0.45, 0.29))
b<- data.frame(Pot = letters[1:18])
chosenletters <- c("a", "b", "c", "d")
samplefun <- function(a, b, chosenletters) {
ntrials <- 0
repeat {
grades <- sample(a$Grade, size = 4, prob = a$Prob, replace = T)
chars <- sample(b$Pot, size = 4, replace = F)
ntrials <- ntrials + 1
if (all(grades < 4) & all(chars %in% chosenletters)) {break}
}
return( tibble(Trial = ntrials, Grade = list(grades), Letters = list(chars)) )
}
nsamples <- 5
res <- map_dfr(1:nsamples, ~ samplefun(a, b, chosenletters))
This dataframe res gives the correct Grades and Letters embedded in lists inside each dataframe cell, plus the trial at which the result was generated.
# A tibble: 5 x 3
Trial Grade Letters
<dbl> <list> <list>
1 20863 <dbl [4]> <fct [4]>
2 8755 <dbl [4]> <fct [4]>
3 15129 <dbl [4]> <fct [4]>
4 1033 <dbl [4]> <fct [4]>
5 5264 <dbl [4]> <fct [4]>
A better view of the nested lists:
> glimpse(res)
Rows: 5
Columns: 3
$ Trial <dbl> 20863, 8755, 15129, 1033, 5264
$ Grade <list> <3, 3, 3, 3>, <3, 2, 2, 2>, <3, 3, 2, 2>, <3, 3, 2, 3>, <3, 2, 3, 3>
$ Letters <list> <b, a, c, d>, <b, a, c, d>, <c, a, b, d>, <b, d, c, a>, <a, b, d, c>
I have a pandas data frame which is shown below:
>>> x = [[1,2,3,4,5],[1,2,4,4,3],[2,4,5,6,7]]
>>> columns = ['a','b','c','d','e']
>>> df = pd.DataFrame(data = x, columns = columns)
>>> df
a b c d e
0 1 2 3 4 5
1 1 2 4 4 3
2 2 4 5 6 7
I have an array of objects (conditions) as shown below:
[
{
'header' : 'a',
'condition' : '==',
'values' : [1]
},
{
'header' : 'b',
'condition' : '==',
'values' : [2]
},
...
]
and an assignHeader which is:
assignHeader = decision
now I want to do an operation which builds up all the conditions from the conditions array by looping through it, for example something like this:
pConditions = []
for eachCondition in conditions:
header = eachCondition['header']
values = eachCondition['values']
if eachCondition['condition'] == "==":
pConditions.append(df[header].isin(values))
else:
pConditions.append(~df[header].isin(values))
df[assignHeader ] = and(pConditions)
I was thinking of using all operator in pandas but am unable to crack the right syntax to do so. The list I shared can go big and dynamic and so I want to use this nested approach and check for the equality. Does anyone know a way to do so?
Final Output:
conditons = [df['a']==1,df['b']==2]
>>> df['decision'] = (df['a']==1) & (df['b']==2)
>>> df
a b c d e decision
0 1 2 3 4 5 True
1 1 2 4 4 3 True
2 2 4 5 6 7 False
Here conditions array will be variable. And I want to have a function which takes df, 'newheadernameandconditions` as input and returns the output as shown below:
>>> df
a b c d e decision
0 1 2 3 4 5 True
1 1 2 4 4 3 True
2 2 4 5 6 7 False
where newheadername = 'decision'
I was able to solve the problem using the code shown below. I am not sure if this is kind of fast way of getting things done, but would love to know your inputs in case you have any specific thing to point out.
def andMerging(conditions, mergeHeader, df):
if len(conditions) != 0:
df[mergeHeader] = pd.concat(conditions, axis = 1).all(axis = 1)
return df
where conditions are an array of pd.Series with boolean values.
And conditions are formatted as shown below:
def prepareForConditionMerging(conditionsArray, df):
conditions = []
for prop in conditionsArray:
condition = prop['condition']
values = prop['values']
header = prop['header']
if type(values) == str:
values = [values]
if condition=="==":
conditions.append(df[header].isin(values))
else:
conditions.append(~df[header].isin(values))
# Here we can add more conditions such as greater than less than etc.
return conditions
I have a table that looks like this:
A B C
1 foo
2 foobar blah
3
I want to count up the non empty columns from A, B and C to get a summary column like this:
A B C sum
1 foo 1
2 foobar blah 2
3 0
Here is how I'm trying to do it:
import pandas as pd
df = { 'A' : ["foo", "foobar", ""],
'B' : ["", "blah", ""],
'C' : ["","",""]}
df = pd.DataFrame(df)
print(df)
df['sum'] = df[['A', 'B', 'C']].notnull().sum(axis=1)
df['sum'] = (df[['A', 'B', 'C']] != "").sum(axis=1)
These last two lines are different ways to get what I want but they aren't working. Any suggestions?
df['sum'] = (df[['A', 'B', 'C']] != "").sum(axis=1)
Worked. Thanks for the assistance.
This one-liner worked for me :)
df["sum"] = df.replace("", np.nan).T.count().reset_index().iloc[:,1]
my dataframe
Terrain
M1
M2
F
G
S
B1
B2
I want to open another column Terrain_Type and assign values for example if Terrain is M1,M2,B1,B2 as Composite in Terrain_Type and S in Terrain as Sod in Terrain_Type and instead of F and G i would like to assign Gravel in Terrain Type column.
I have tried tried this code
data['Terrain_Type'] = data['Terrain'].map({['M1','M2','B1','B2']:'Composite', 'S':'Sod',['F','G']:'Gravel'})
But it didnt work out. Could anyone suggest me how to solve this error in my code
You need to map with a valid dictionary, and in what you have, you are using a list as a key which can be problematic. So let's suppose the dictionary is like this:
import pandas as pd
data = pd.DataFrame({'Terrain':['M1','M2','F','G','S','B1','B2']})
d = {'Composite':['M1','M2','B1','B2'],'Sod':['S'],'Gravel':['F','G']}
We can create a reverse of this, which maps the terrain to the type:
new_dic = {}
for k,v in d.items():
for x in v:
new_dic[x]=k
new_dic
{'M1': 'Composite',
'M2': 'Composite',
'B1': 'Composite',
'B2': 'Composite',
'S': 'Sod',
'F': 'Gravel',
'G': 'Gravel'}
Then this will work:
data["Terain_Type"] = data["Terrain"].map(new_dic)
data
Terrain Terain_Type
0 M1 Composite
1 M2 Composite
2 F Gravel
3 G Gravel
4 S Sod
5 B1 Composite
6 B2 Composite
L1 = ['M1','M2','B1','B2']
d1 = dict.fromkeys(L1, 'Composite')
L2 = ['F','G']
d2 = dict.fromkeys(L2, 'Gravel')
L3 = ['S']
d3 = dict.fromkeys(L3, 'Sod')
d = {**d1, **d2, **d3}
Map:
df['Terrain_Type'] = df['Terrain'].map(d)
Output:
Terrain Terrain_Type
0 M1 Composite
1 M2 Composite
2 F Gravel
3 G Gravel
4 S Sod
5 B1 Composite
6 B2 Composite
I believe the following will work for you :)
def get_terrain_type(row):
if row in ["M1", "M2", "B1", "B2"]:
return "Composite"
elif row == "S":
return "Sod"
else:
return "Gravel"
data["Terain_Type"] = data["Terrain"].map(lambda x: get_terrain_type(x))
I would like to select a subset of a dataframe that satisfies multiple conditions on multiple rows. I know I could this sequentially -- first selecting the subset that matches the first condition, then the portion of those that match the second, etc, but it seems like it should be able to be done in a single step. The following seems like it should work, but doesn't. Apparently it does work like this in other languages' implementations of DataFrame. Any thoughts?
using DataFrames
df = DataFrame()
df[:A]=[ 1, 3, 4, 7, 9]
df[:B]=[ "a", "c", "c", "D", "c"]
df[(df[:A].<5)&&(df[:B].=="c"),:]
type: non-boolean (DataArray{Bool,1}) used in boolean context
while loading In[18], in expression starting on line 5
This is a Julia thing, not so much a DataFrame thing: you want & instead of &&. For example:
julia> [true, true] && [false, true]
ERROR: TypeError: non-boolean (Array{Bool,1}) used in boolean context
julia> [true, true] & [false, true]
2-element Array{Bool,1}:
false
true
julia> df[(df[:A].<5)&(df[:B].=="c"),:]
2x2 DataFrames.DataFrame
| Row | A | B |
|-----|---|-----|
| 1 | 3 | "c" |
| 2 | 4 | "c" |
FWIW, this works the same way in pandas in Python:
>>> df[(df.A < 5) & (df.B == "c")]
A B
1 3 c
2 4 c
I have the same now as https://stackoverflow.com/users/5526072/jwimberley , occurring on my update to julia 0.6 from 0.5, and now using dataframes v 0.10.1.
Update: I made the following change to fix:
r[(r[:l] .== l) & (r[:w] .== w), :] # julia 0.5
r[.&(r[:l] .== l, r[:w] .== w), :] # julia 0.6
but this gets very slow with long chains (time taken \propto 2^chains)
so maybe Query is the better way now:
# r is a dataframe
using Query
q1 = #from i in r begin
#where i.l == l && i.w == w && i.nl == nl && i.lt == lt &&
i.vz == vz && i.vw == vw && i.vδ == vδ &&
i.ζx == ζx && i.ζy == ζy && i.ζδx == ζδx
#select {absu=i.absu, i.dBU}
#collect DataFrame
end
for example. This is fast. It's in the DataFrames documentation.