how to sum rows in my dataframe Pandas with specific condition? - pandas

Could anyone help me ?
I want to sum the values with the format:
print (...+....+)
for example:
a b
France 2
Italie 15
Croatie 7
I want to make the sum of France and Croatie.
Thank you for your help !

One of possible solutions:
set column a as the index,
using loc select rows for the "wanted" values,
take column b,
sum the values found.
So the code can be:
result = df.set_index('a').loc[['France', 'Croatie']].b.sum()
Note double square brackets. The outer pair is the "container" of index values
passed to loc.
The inner part, and what is inside, is a list of values.
To subtract two sums (one for some set of countries and the second for another set),
you can run e.g.:
wrk = df.set_index('a').b
result = wrk.loc[['Italie', 'USA']].sum() - wrk.loc[['France', 'Croatie']].sum()

Related

Finding the mean of a column; but excluding a singular value

Imagine I have a dataset that is like so:
ID birthyear weight
0 619040 1962 0.1231231
1 600161 1963 0.981742
2 25602033 1963 1.3123124
3 624870 1987 10,000
and I want to get the mean of the column weight, but the obvious 10,000 is hindering the actual mean. In this situation I cannot change the value but must work around it, this is what I've got so far, but obviously it's including that last value.
avg_num_items = df_cleaned['trans_quantity'].mean()
translist = df_cleaned['trans_quantity'].tolist()
my dataframe is df_cleaned and the column I'm actually working with is 'trans_quantity' so how do I go about the mean while working around that value?
Since you added SQL in your tags, In SQL you'd want to exclude it in the WHERE clause:
SELECT AVG(trans_quantity)
FROM your_data_base
WHERE trans_quantity <> 10,000
In Pandas:
avg_num_items = df_cleaned[df_cleaned["trans_quantity"] != 10000]["trans_quantity"].mean()
You can also replace your value with a NAN and skip it in the mean:
avg_num_items = df_cleaned["trans_quantity"].replace(10000, np.nan).mean(skipna=True)
With pandas, ensure you have numeric data (10,000 is a string), filter the values above threshold and use the mean:
(pd.to_numeric(df['weight'], errors='coerce')
.loc[lambda x: x<10000]
.mean()
)
output: 0.8057258333333334

Pandas Sequebtial Count of members within a group and the sum

If I want to have a sequential count within a group I can do something like
df['GID'] = df.groupby(['G_COL1','G_COL2]).cumcount()
I cannot however figure out how to generate a column that contains the total number of values within the group. So if the group had three members df['GID'] would contain 0,1 & 2 and df['COUNT'] would contain the value 3 for each of the three members
df["count_zeros"] = pd.DataFrame((df["GID"]==0)).cumsum()
df["COUNT"] = df.groupby("count_zeros").transform(lambda x: len(x))["GID"]
I think the above gives what you want. The GID column starts from zero whenever a new group starts taking place and then we count how many zeros, i.e. new group "starts" we have with len.
As Scott Boston, commented,
df["COUNT"] = df.groupby("count_zeros")['GID'].transform('count')
works and looks great :)

Need explanation on how pandas.drop is working here

I have a data frame, lets say xyz. I have written code to find out the % of null values each column possess in the dataframe. my code below:
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)
let say i got following results:
abc 26.63
def 36.58
ghi 78.46
I want to drop column ghi because it has more than 70% of null values.
I achieved it using the following code:
xyz = xyz.drop(xyz.loc[:,round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70].columns, 1)
but , i did not understand how does this code works, can anyone please explain it?
the code is doing the following:
xyz.drop( [...], 1)
removes the specified elements for a given axis, either by row or by column. In this particular case, df.drop( ..., 1) means you're dropping by axis 1, i.e, column
xyz.loc[:, ... ].columns
will return a list with the column names resulting from your slicing condition
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70
this instruction is counting the number of nulls, adding them up and normalizing by the number of rows, effectively computing the percentage of nan in each column. Then, the amount is rounded to have only 2 decimal positions and finally you return True is the number of nan is more than 70%. Hence, you get a mapping between columns and a True/False array.
Putting everything together: you're first producing a Boolean array that marks which columns have more than 70% nan, then, using .loc you use Boolean indexing to look only at the columns you want to drop ( nan % > 70%), then using .columns you recover the name of such columns, which then are used by the .drop instruction.
Hopefully this clear things up!
If you code is hard to understand , you can just check dropna with thresh, since pandas already cover this case.
df=df.dropna(axis=1,thresh=round(len(df)*0.3))

Mark accumulated values on a QlikView column if condition is fulfilled

I have a table in Qlikview with 2 columns:
A B
a 10
b 45
c 30
d 15
Based on this table, I have a formula with full acumulation defined as:
SUM(a)/SUM(TOTAL a)
As a result,
A B D
b 45 45/100=0.45
c 30 75/100=0.75
d 15 90/100=0.90
a 10 100/100=1
My question is. how do I mark in colour the values in column A that have on column D <=0.8)?
The challenge is that D is defined with full accumulation, but if I reference D in a formula, it doesn't consider the full accumulation!
I tried with defining a formula E=if(D>0.8,'Y','N') but this formula doesn't take the visible (accumulated) value for D unfortunately, instead it takes the D with no accumulation. If this worked, I would have tried to hide (not disable) E and reference it from the dimensions column of the table , Text colour option. Any ideas please?? Thanks
You can't get an expression column's value from within a dimension or it's properties, because the expression columns rely on the dimensions provided. It would create an endless loop. Your options are:
Apply your background colour to the expression columns, not the dimensions. This would actually make more sense as the accumulated values would have the colour, not the dimension.
When loading this specific table, have QlikView create a new column that contains the accumulated values of B. This would mean, however, that the order of your chart-table would need to be fixed for the accumulations to make any sense.
Use aggregation to create a temporary table and accumulate the values using RangeSum(). Note this will only accumulate properly if the table is ordered in Ascending order of Column A
=IF(Aggr(RangeSum(Above(Sum(B),0,10)),A)/100>0.8,
rgb(0,0,0),
rgb(255,0,0)
)

Add values from a column when two other columns match

I have an ecology data table with about 12,000 rows. There are three columns: site, species, and value. I need to add up the values for each set of matching site and species - for example, all "red maple" values at "site A". I have the data sorted by site and species, so I can do it by hand, but it's slow going. The number of site/species matches varies, so I can't just add up the values in sets of three or anything.
Similar types of questions have talked about pivot tables, but none have needed to match two columns and add a third column, and I haven't been able to figure out how to extrapolate to my situation.
I'm reasonably comfortable coding and would like to do something that looks like this pseudocode, but I'm not clear on the syntax in VBA:
For each row
if a(x) = a(x+1) and b(x) = b(x+1) then
sum = sum + c(x)
else
d(x) = sum
sum = 0
next
Any ideas?
In a PivotTable, put site in Row Labels and species in Column Labels (or vice versa) and Sum of value in Σ Values: