How to return ONLY top 5% of responses in a column PANDAS - pandas

I am looking to return the top 5% of responses in a column using pandas. So, for col_1, basically, I want a list of the responses that make up at least 5% of the responses in that column.
The following returns the list of ALL responses in the col_1 that meet the condition, as well as those that do not (returns boolean True and False):
df['col_1'].value_counts(normalize = True) >= .05
While this is somewhat helpful, I would like to return ONLY those that evaluate to true. Should I use a dictionary and loop? If so, how do I signal that I am using value_counts(normalize = True) >= .05 to append to that dictionary?
Thank you for your help!

If need filter by boolean indexing:
s = df['col_1'].value_counts(normalize = True)
L = s.index[s >= .05].tolist()
print (L)

Related

Filter out entries of datasets based on string matching

I'm working with a dataframe of chemical formulas (str objects). Example
formula
Na0.2Cl0.4O0.7Rb1
Hg0.04Mg0.2Ag2O4
Rb0.2AgO
...
I want to filter it out based on specified elements. For example I want to produce an output which only contains the elements 'Na','Cl','Rb' therefore the desired output should result in:
formula
Na0.2Cl0.4O0.7Rb1
What I've tried to do is the following
for i, formula in enumerate(df['formula'])
if ('Na' and 'Cl' and 'Rb' not in formula):
df = df.drop(index=i)
but it seems not to work.
You can use use contains with or condition for multiple string pattern matching for matching only one of them
df[df['formula'].str.contains("Na|Cl|Rb", na=False)]
Or you can use pattern with contains if you want to match all of them
df[df['formula'].str.contains(r'^(?=.*Na)(?=.*Cl)(?=.*Rb)')]
Your requirements are unclear, but assuming you want to filter based on a set of elements.
Keeping formulas where all elements from the set are used:
s = {'Na','Cl','Rb'}
regex = f'({"|".join(s)})'
mask = (
df['formula']
.str.extractall(regex)[0]
.groupby(level=0).nunique().eq(len(s))
)
df.loc[mask[mask].index]
output:
formula
0 Na0.2Cl0.4O0.7Rb1
Keeping formulas where only elements from the set are used:
s = {'Na','Cl','Rb'}
mask = (df['formula']
.str.extractall('([A-Z][a-z]*)')[0]
.isin(s)
.groupby(level=0).all()
)
df[mask]
output: no rows for this dataset

df.groupby() giving me wrong total calculations. pandas. numpy

So I was just checking the final results after grouping, but the column sums are not matching. Here's my code (the last logical statement is failing even though the sums should be the same):
dfM = pd.concat([df1,df2])
dfM_V = sum(dfM['SumOfPCS'])
A = ['SOLDTO', 'PICKUP', 'ORIZIP3','ORIGINFACILITYCODE', 'PRODUCT_ID','ACTUALRECIPIENTCOUNTRY', 'LB_BRK','COUNTRY', 'MANIFESTEDDSPPRODUCT']
V = ['SumOfPCS', 'SumOfLBS']
dfM2 = dfM.groupby(A).agg([np.sum])[V]
dfM2 = dfM2.reset_index()
dfM2.columns = dfM2.columns.get_level_values(0)
dfM2_V = sum(dfM2['SumOfPCS']
print(dfM2_V == dfM_V)
By the way, A + V = list(dfM.columns) and there no empty rows nor cells in the dataset. (When I do the exact same grouping on MS Access, the logical condition tested at the end is met, so there's nothing inherently wrong with the dataset.)

How to return different rows from the same table on single query as a list

I have a table which has a boolean column, this column is used to filter some responses. I'm in the need to return a response as a tuple as {claimed, unclaimed} (Imagine the table is called winnings)
While working on it I've done two separate queries to return claimed then unclaimed rows and manually constructing the response, then I went with returning all rows without checking the boolean column and splitting it outside of the query. Now I'm wondering if there's a way I can run a single query on the same table and return both claimed and unclaimed as separate results mainly for performance hoping it runs better. I've tried doing it with joins but its returning a list of two items tuples like:
[{claimed, unclaimed}, {claimed, unclaimed}]...
While I want:
{claimed, unclaimed}
# OR
[{claimed, unclaimed}]
At most, no more tuples. Note that I'm not running the raw queries but using a library so excuse if the terminology is not right.
This is the last query I ran:
SELECT w0."claimed", w1."claimed"
FROM "winnings" AS w0
INNER JOIN "winnings" AS w1 ON TRUE
WHERE (w0."claimed" AND NOT (w1."claimed"))
LIMIT 10;
EDIT: More details.
When I run the query from above this is the result I get:
=> SELECT w0."claimed", w1."claimed" FROM "winnings" AS w0 INNER JOIN "winnings" AS w1 ON TRUE WHERE (w0."claimed" AND NOT (w1."claimed")) LIMIT 10;
claimed | claimed
---------+---------
t | f
t | f
t | f
t | f
t | f
t | f
t | f
t | f
t | f
t | f
(10 rows)
This is converted to the following on Elixir which is the language I'm using:
[
true: false,
true: false,
true: false,
true: false,
true: false,
true: false,
true: false,
true: false,
true: false,
true: false
]
This is a keyword list which internally is a list of tuples as [{true, false}, {true, false}] - I want: [{[true, true], [false, false]}]
Means that I want 2 lists, each list with their respective rows, only claimed on one and only unclaimed on the other one.
I don't really mind the type it outputs as long as it includes two lists with their rows how I said.
To get the first column from a list of rows, you can use Enum.map/2 to get the first element of each tuple:
Enum.map(rows, &elem(&1, 0))
If you're a newcomer to elixir, the & syntax may be a bit confusing. That code is shorthand for
Enum.map(rows, fn field -> elem(field, 0) end)
You could make that into a function that does that for all your columns like this:
def columnize(rows = [first_row | _]) when is_tuple(first_row) do
for column <- 1..tuple_size(first_row), do: Enum.map(rows, &elem(&1, column - 1))
end
def columnize([]) do
[]
end
hd/1 is a function used to get the first tuple in the list. The = [first_row | _] part guarantees the argument is a list with at least one element; the when is_tuple(first_row) assures at least the first row is a tuple.
If you'd like to know more about the for, it's a comprehension.
Note that this function assumes all rows have the same number of columns and are tuples (which should be true in the results of a query; but may not be in other cases), and will throw an ArgumentError if the first row is wider than any other row. It will also cause errors if other elements of the list are not tuples.
This feels very unidiomatic though. Is this maybe an XY problem?

negative of "contains" in openrefine

I would like to add a column based on another column and fill it with all the values that do NOT contain "jpg"
so the negation of this:
filter(value.split(","), v, v.contains("jpg")).join("|")
How can I write "does not contain"?
contains gives a boolean output i.e. true or false. So we have:
v = "picture.jpg" -> v.contains("jpg") = TRUE
v = "picture.gif" -> v.contains("jpg") = FALSE
filter finds all values in an array which return TRUE for whatever condition you use in the filter. There are a couple of ways you could filter an array to find the values that don't contain a string, but using contains the simplest is probably to use not to reverse the result of your condition:
filter(value.split(","), v, not(v.contains("jpg"))).join("|")

convert Int64Index to Int

I'm iterating through a dataframe (called hdf) and applying changes on a row by row basis. hdf is sorted by group_id and assigned a 1 through n rank on some criteria.
# Groupby function creates subset dataframes (a dataframe per distinct group_id).
grouped = hdf.groupby('group_id')
# Iterate through each subdataframe.
for name, group in grouped:
# This grabs the top index for each subdataframe
index1 = group[group['group_rank']==1].index
# If criteria1 == 0, flag all rows for removal
if(max(group['criteria1']) == 0):
for x in range(rank1, rank1 + max(group['group_rank'])):
hdf.loc[x,'remove_row'] = 1
I'm getting the following error:
TypeError: int() argument must be a string or a number, not 'Int64Index'
I get the same error when I try to cast rank1 explicitly I get the same error:
rank1 = int(group[group['auction_rank']==1].index)
Can someone explain what is happening and provide an alternative?
The answer to your specific question is that index1 is an Int64Index (basically a list), even if it has one element. To get that one element, you can use index1[0].
But there are better ways of accomplishing your goal. If you want to remove all of the rows in the "bad" groups, you can use filter:
hdf = hdf.groupby('group_id').filter(lambda group: group['criteria1'].max() != 0)
If you only want to remove certain rows within matching groups, you can write a function and then use apply:
def filter_group(group):
if group['criteria1'].max() != 0:
return group
else:
return group.loc[other criteria here]
hdf = hdf.groupby('group_id').apply(filter_group)
(If you really like your current way of doing things, you should know that loc will accept an index, not just an integer, so you could also do hdf.loc[group.index, 'remove_row'] = 1).
call tolist() on Int64Index object. Then the list can be iterated as int values.
simply add [0] to insure the getting the first value from the index
rank1 = int(group[group['auction_rank']==1].index[0])