Is there a to give me the index of the tuple when I use endswith function.
eg:
lst = ('jpg','mp4','mp3')
b = "filemp4"
If I use b.endswith(lst), this give me True or false.
but I need to find the index of 'lst' that match's the the string provided in 'b'
Related
I'm working with a dataframe of chemical formulas (str objects). Example
formula
Na0.2Cl0.4O0.7Rb1
Hg0.04Mg0.2Ag2O4
Rb0.2AgO
...
I want to filter it out based on specified elements. For example I want to produce an output which only contains the elements 'Na','Cl','Rb' therefore the desired output should result in:
formula
Na0.2Cl0.4O0.7Rb1
What I've tried to do is the following
for i, formula in enumerate(df['formula'])
if ('Na' and 'Cl' and 'Rb' not in formula):
df = df.drop(index=i)
but it seems not to work.
You can use use contains with or condition for multiple string pattern matching for matching only one of them
df[df['formula'].str.contains("Na|Cl|Rb", na=False)]
Or you can use pattern with contains if you want to match all of them
df[df['formula'].str.contains(r'^(?=.*Na)(?=.*Cl)(?=.*Rb)')]
Your requirements are unclear, but assuming you want to filter based on a set of elements.
Keeping formulas where all elements from the set are used:
s = {'Na','Cl','Rb'}
regex = f'({"|".join(s)})'
mask = (
df['formula']
.str.extractall(regex)[0]
.groupby(level=0).nunique().eq(len(s))
)
df.loc[mask[mask].index]
output:
formula
0 Na0.2Cl0.4O0.7Rb1
Keeping formulas where only elements from the set are used:
s = {'Na','Cl','Rb'}
mask = (df['formula']
.str.extractall('([A-Z][a-z]*)')[0]
.isin(s)
.groupby(level=0).all()
)
df[mask]
output: no rows for this dataset
I am looking to return the top 5% of responses in a column using pandas. So, for col_1, basically, I want a list of the responses that make up at least 5% of the responses in that column.
The following returns the list of ALL responses in the col_1 that meet the condition, as well as those that do not (returns boolean True and False):
df['col_1'].value_counts(normalize = True) >= .05
While this is somewhat helpful, I would like to return ONLY those that evaluate to true. Should I use a dictionary and loop? If so, how do I signal that I am using value_counts(normalize = True) >= .05 to append to that dictionary?
Thank you for your help!
If need filter by boolean indexing:
s = df['col_1'].value_counts(normalize = True)
L = s.index[s >= .05].tolist()
print (L)
I would like to add a column based on another column and fill it with all the values that do NOT contain "jpg"
so the negation of this:
filter(value.split(","), v, v.contains("jpg")).join("|")
How can I write "does not contain"?
contains gives a boolean output i.e. true or false. So we have:
v = "picture.jpg" -> v.contains("jpg") = TRUE
v = "picture.gif" -> v.contains("jpg") = FALSE
filter finds all values in an array which return TRUE for whatever condition you use in the filter. There are a couple of ways you could filter an array to find the values that don't contain a string, but using contains the simplest is probably to use not to reverse the result of your condition:
filter(value.split(","), v, not(v.contains("jpg"))).join("|")
I have the following dataframe:
df = pd.DataFrame(np.random.randn(4, 1), index=['mark13', 'luisgimenez', 'miguel72', 'luis34'],columns=['probability'])
probability
mark13 -1.054687
luisgimenez 0.081224
miguel72 -0.893619
luis34 -1.576941
I would like to remove the rows where the last character in the index string does not contain a number .
The desired output would look something like this :
(dropping the row where the index does not finishes with a number)
probability
mark13 -1.054687
miguel72 -0.893619
luis34 -1.576941
I am sure the direction I need to get is the boolean indexing but I do not know how could I reference the last character in the index name
#use isdigt to check last char of your index to be used as a mask array to filter rows.
df[[e[-1].isdigit() for e in df.index]]
Out[496]:
probability
mark13 -0.111338
miguel72 0.548725
luis34 0.682949
You can use the str accessor to check if the last character is a number:
df[df.index.str[-1].str.isdigit()]
Out:
probability
mark13 -0.350466
miguel72 1.220434
luis34 -0.962123
I'm iterating through a dataframe (called hdf) and applying changes on a row by row basis. hdf is sorted by group_id and assigned a 1 through n rank on some criteria.
# Groupby function creates subset dataframes (a dataframe per distinct group_id).
grouped = hdf.groupby('group_id')
# Iterate through each subdataframe.
for name, group in grouped:
# This grabs the top index for each subdataframe
index1 = group[group['group_rank']==1].index
# If criteria1 == 0, flag all rows for removal
if(max(group['criteria1']) == 0):
for x in range(rank1, rank1 + max(group['group_rank'])):
hdf.loc[x,'remove_row'] = 1
I'm getting the following error:
TypeError: int() argument must be a string or a number, not 'Int64Index'
I get the same error when I try to cast rank1 explicitly I get the same error:
rank1 = int(group[group['auction_rank']==1].index)
Can someone explain what is happening and provide an alternative?
The answer to your specific question is that index1 is an Int64Index (basically a list), even if it has one element. To get that one element, you can use index1[0].
But there are better ways of accomplishing your goal. If you want to remove all of the rows in the "bad" groups, you can use filter:
hdf = hdf.groupby('group_id').filter(lambda group: group['criteria1'].max() != 0)
If you only want to remove certain rows within matching groups, you can write a function and then use apply:
def filter_group(group):
if group['criteria1'].max() != 0:
return group
else:
return group.loc[other criteria here]
hdf = hdf.groupby('group_id').apply(filter_group)
(If you really like your current way of doing things, you should know that loc will accept an index, not just an integer, so you could also do hdf.loc[group.index, 'remove_row'] = 1).
call tolist() on Int64Index object. Then the list can be iterated as int values.
simply add [0] to insure the getting the first value from the index
rank1 = int(group[group['auction_rank']==1].index[0])