Count rows across certain columns in a dataframe if they are greater than another value and groupby another column - pandas

I have a dataframe:
df = pd.DataFrame({
'BU': ['Total', 'Total', 'Total', 'CRS', 'CRS', 'CRS'],
'Line_Item': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'1Q16': [100, 120, 0, 200, 190, 210],
'2Q16': [100, 0, 130, 200, 190, 210],
'3Q16': [200, 250, 0, 120, 0, 190]})
I wish to count the number of rows in 1Q16, 2Q16, 3Q16 by "BU" that are greater than zero. To count rows in 1Q16, 2Q16, 3Q16 I was just explained, I can use:
cols = ['1Q16','2Q16','3Q16']
df[cols].gt(0).sum()
In addition, I want to group them by BU

With your shown samples, please try following.
cols = ['1Q16','2Q16','3Q16']
df[cols].gt(0).groupby(df['BU']).sum()
Output will be as follows:
1Q16 2Q16 3Q16
BU
CRS 3.0 3.0 2.0
Total 2.0 2.0 2.0
Explanation: Following is detailed explanation for above.
Creating cols list which has columns names in it where we want to perform tasks.
Using gt function to get values which are more than 0 in mentioned cols.
Then using groupby and passing df['BU'] to get groupby values related to BU column.
Then applying sum function to get total sum of values greater than 0.

Related

Compute moving average on a dynamic rolling window

OK, need some help here! I have the following dataframe.
df2 = {'Value': [123, 126, 120, 121, 123, 126, 120, 121, 123, 126],
'Look-back': [2, 3, 4, 5, 3, 6, 2, 4, 2, 1]}
df2 = pd.DataFrame(df2)
df2
I'd like to add a third row that shows the simple moving average of the 'Value' column with the rolling look-back period of the 'Look-back' column. My thought was to do this.
df2['Average'] = df2['Value'].rolling(df['Look-back']).mean()
Of course this doesn't work because the rolling() function needs an integer key value and I'm supplying a series.
How do I get what I'm after here?

Pandas columns by given value in last row

Below my dataframe "df" made of 34 columns (pairs of stocks) and 530 rows (their respective cumulative returns). 'Date' is the index
Now, my target is to consider last row (Date=3 Febraury 2021). I want to plot ONLY those columns (pair stocks) that have a positive return on last Date.
I started with:
n=list()
for i in range(len(df.columns)):
if df.iloc[-1,i] >0:
n.append(i)
Output: [3, 11, 12, 22, 23, 25, 27, 28, 30]
Now, final step is to create a subset dataframe of 'df' containing only columns belonging to those numbers in this list. This is where I have problems. Have you any idea? Thanks
Does this solve your problem?
n = []
for i, col in enumerate(df.columns):
if df.iloc[-1,i] > 0:
n.append(col)
df[n]
Here you are ;)
sample df:
a b c
date
2017-04-01 0.5 -0.7 -0.6
2017-04-02 1.0 1.0 1.3
df1.loc[df1.index.astype(str) == '2017-04-02', df1.ge(1.2).any()]
c
date
2017-04-02 1.3
the logic will be same for your case also.
If I understand correctly, you want columns with IDs [3, 11, 12, 22, 23, 25, 27, 28, 30], am I right?
You should use DataFrame.iloc:
column_ids = [3, 11, 12, 22, 23, 25, 27, 28, 30]
df_subset = df.iloc[:, column_ids].copy()
The ":" on the left side of df.iloc means "all rows". I suggest using copy method in case you want to perform additional operations on df_subset without the risk to affect the original df, or raising Warnings.
If instead of a list of column IDs, you have a list of column names, you should just replace .iloc with .loc.

How to calculate average of values of a column for a particular value in another column?

I have a data frame that looks like this.
How can I get the average doc/duration for each window into another data frame?
I need it in the following way
Dataframe should contain only one column i.e mean. If there are 3000 windows then there should be 3000 rows in axis 0 which represent the windows and the mean will contain the average value. If that particular window is not present in the initial data frame the corresponding value for that window needs to be 0.
Use .groupby() method and then compute the mean:
import pandas as pd
df = pd.DataFrame({'10s_windows': [304, 374, 374, 374, 374, 3236, 3237, 3237, 3237],
'doc/duration': [0.1, 0.1, 0.2, 0.2, 0.12, 0.34, 0.32, 0.44, 0.2]})
new_df = df.groupby('10s_windows').mean()
Which results in:
Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

reindex group to add missing rows

I am trying to reindex groups to extend dataframes with missing values. Similar as resample works for time indexes, I am trying to achieve this for normal integer values.
So, for a group belonging to a certain group key (proID in my case) a maximum existent integer value shall be determined (specifying the end point of the resampling process). The group shall be extended (I was trying to achieve it with reindex) by the missing values of this integer value.
I have a dataframe having many rows per proID and a integer bin value which can range from 0 to 100 and some meaningless columns. Basically, the bin value shall be filled if some data are missing similarly as resample would do for time indexes.
def rsmpint(df):
mx = df.bin.max() #identify maximal existing bin value in dataframe (group)
no = (mx * 20 / 100).astype(np.int64) + 1 #calculate number of bin values
idx = pd.Index(np.linspace(0,mx,no), name='bin') # define full bin-Index for df (group)
df.set_index('bin').reindex(idx).ffill().reset_index(drop=True, inplace=True)
return df
DF.groupby('proID').apply(rsmpint)
Let assume for a specific proID there are currently 5 bin values [0, 15, 20, 40, 65] (i.e. 5 rows of the original proID group). The output shall be an extended proID group with bin values [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 65] with the content of the "meaningless" columns filled using ffill().

Find subgroups of a numpy array

I have a numpy array like this one:
A = ([249, 250, 3016, 3017, 5679, 5680, 8257, 8258,
10756, 10757, 13178, 13179, 15531, 15532, 17824, 17825,
20058, 20059, 22239, 22240, 24373, 24374, 26455, 26456,
28491, 28492, 30493, 30494, 32452, 32453, 34377, 34378,
36264, 36265, 38118, 38119, 39939, 39940, 41736, 41737,
43501, 43502, 45237, 45238, 46950, 46951, 48637, 48638])
I would like to write a small script that finds a subgroup of values of the array for which the difference is smaller than a certain threshold, let say 3, and that returns the highest value of the subgroup. In the case of A array the output should be:
A_out =([250,3017,5680,8258,10757,13179,...])
Is there a numpy function for that?
Here's a vectorized Numpy approach.
First, the data (in a numpy array) and the threshold:
In [41]: A = np.array([249, 250, 3016, 3017, 5679, 5680, 8257, 8258,
10756, 10757, 13178, 13179, 15531, 15532, 17824, 17825,
20058, 20059, 22239, 22240, 24373, 24374, 26455, 26456,
28491, 28492, 30493, 30494, 32452, 32453, 34377, 34378,
36264, 36265, 38118, 38119, 39939, 39940, 41736, 41737,
43501, 43502, 45237, 45238, 46950, 46951, 48637, 48638])
In [42]: threshold = 3
The following produces the array delta. It is almost the same as delta = np.diff(A), but I want to include one more value that is greater than the threshold at the end of delta.
In [43]: delta = np.hstack((diff(A), threshold + 1))
Now the group maxima are simply A[delta > threshold]:
In [46]: A[delta > threshold]
Out[46]:
array([ 250, 3017, 5680, 8258, 10757, 13179, 15532, 17825, 20059,
22240, 24374, 26456, 28492, 30494, 32453, 34378, 36265, 38119,
39940, 41737, 43502, 45238, 46951, 48638])
Or, if you want, A[delta >= threshold]. That gives the same result for this example:
In [47]: A[delta >= threshold]
Out[47]:
array([ 250, 3017, 5680, 8258, 10757, 13179, 15532, 17825, 20059,
22240, 24374, 26456, 28492, 30494, 32453, 34378, 36265, 38119,
39940, 41737, 43502, 45238, 46951, 48638])
There is a case where this answer differs from #DrV's answer. From your description, it isn't clear to me how a set of values such as 1, 2, 3, 4, 5, 6 should be handled. The consecutive differences are all 1, but the difference between the first and last is 5. The numpy calculation above will treat these as a single group. #DrV's answer will create two groups.
Interpretation 1: The value of an item in a group must not differ more than 3 units from that of the first item of the group
This is one of the things where NumPy's capabilities are at their limits. As you will have to iterate through the list, I suggest a pure Python approach:
first_of_group = A[0]
previous = A[0]
group_lasts = []
for a in A[1:]:
# if this item no longer belongs to the group
if abs(a - first_of_group) > 3:
group_lasts.append(previous)
first_of_group = a
previous = a
# add the last item separately, because it is always a last of the group
group_lasts.append(a)
Now you have the group lasts in group_lasts.
Using any NumPy array functionality here does not seem to provide much help.
Interpretation 2: The value of an item in a group must not differ more than 3 units from the previous item
This is easier, as we can easily form a list of group breaks as in Warren Weckesser's answer. Here NumPy is of a lot of help.