reindex group to add missing rows - pandas

I am trying to reindex groups to extend dataframes with missing values. Similar as resample works for time indexes, I am trying to achieve this for normal integer values.
So, for a group belonging to a certain group key (proID in my case) a maximum existent integer value shall be determined (specifying the end point of the resampling process). The group shall be extended (I was trying to achieve it with reindex) by the missing values of this integer value.
I have a dataframe having many rows per proID and a integer bin value which can range from 0 to 100 and some meaningless columns. Basically, the bin value shall be filled if some data are missing similarly as resample would do for time indexes.
def rsmpint(df):
mx = df.bin.max() #identify maximal existing bin value in dataframe (group)
no = (mx * 20 / 100).astype(np.int64) + 1 #calculate number of bin values
idx = pd.Index(np.linspace(0,mx,no), name='bin') # define full bin-Index for df (group)
df.set_index('bin').reindex(idx).ffill().reset_index(drop=True, inplace=True)
return df
DF.groupby('proID').apply(rsmpint)
Let assume for a specific proID there are currently 5 bin values [0, 15, 20, 40, 65] (i.e. 5 rows of the original proID group). The output shall be an extended proID group with bin values [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 65] with the content of the "meaningless" columns filled using ffill().

Related

Calculate Average/mean() of a column in Python/Pandas/Numpy based on a different values in another column

I'd like to calculate an average of a column using pandas based on different numbers in another column.
I have two columns A,B: I'd like to have an extra column showing the average of B when values of A are between => 0 and < 20 , =>20 and <40, =>40 and <60, =>60 and <80 , =>80 and <100 and so on.. 100 as a maximum is an example .. lets say until the max number column A which could be 20000
enter image description here
I have tried using an if statement but that only works for limited values, what about if I have 20000 as my max value and want the average for a range = 5 for A values?
enter image description here
Use cut + groupby.transform:
bins = [0, 20, 40, 60, 80, 101]
df['C'] = df['B'].groupby(pd.cut(df['A'], bins=bins, right=False)).transform('mean')
If you want to generate the bins programmatically:
import numpy as np
MAX = 100
STEP = 20
bins = np.arange(0, MAX+1, STEP)
bins[-1] += 1

comparing and removing rows in pandas

I am trying to create a new object by comparing two list. If the rows are matching the row should be removed form the splitted row_list or appended to a new list containing only the differences between both lists.
results = []
for row in splitted_row_list:
print(row)
for row1 in all_rows:
if row1 == row:
splitted_row_list.remove(row)
else:
results.append(row)
print(results)
However, this code just returns all the rows. Does anyone have a suggestion?
Sample data
all_rows[0]:'1390', '139080', '13980', '1380', '139080', '13080'
splitted_row_list[0]:'35335','53527','353529','242424','5222','444'
As I understand you want to compare two lists by index and keep the differences and you want to do it with pandas (because of the tag):
So here are two lists for example:
ls1=[0,10,20,30,40,50,60,70,80,90]
ls2=[0,15,20,35,40,55,60,75,80,95]
I make a pandas dataframe with these lists, and build a mask to filter out the the matching values:
df= pd.DataFrame(data={'ls1':ls1, 'ls2':ls2})
mask= df['ls1']!=df['ls2']
I can then call the different values for each list using the mask:
# list 1
df[mask]['ls1'].values
out: array([10, 30, 50, 70, 90])
and
# list 2
df[mask]['ls2'].values
out: array([15, 35, 55, 75, 95])

Pandas columns by given value in last row

Below my dataframe "df" made of 34 columns (pairs of stocks) and 530 rows (their respective cumulative returns). 'Date' is the index
Now, my target is to consider last row (Date=3 Febraury 2021). I want to plot ONLY those columns (pair stocks) that have a positive return on last Date.
I started with:
n=list()
for i in range(len(df.columns)):
if df.iloc[-1,i] >0:
n.append(i)
Output: [3, 11, 12, 22, 23, 25, 27, 28, 30]
Now, final step is to create a subset dataframe of 'df' containing only columns belonging to those numbers in this list. This is where I have problems. Have you any idea? Thanks
Does this solve your problem?
n = []
for i, col in enumerate(df.columns):
if df.iloc[-1,i] > 0:
n.append(col)
df[n]
Here you are ;)
sample df:
a b c
date
2017-04-01 0.5 -0.7 -0.6
2017-04-02 1.0 1.0 1.3
df1.loc[df1.index.astype(str) == '2017-04-02', df1.ge(1.2).any()]
c
date
2017-04-02 1.3
the logic will be same for your case also.
If I understand correctly, you want columns with IDs [3, 11, 12, 22, 23, 25, 27, 28, 30], am I right?
You should use DataFrame.iloc:
column_ids = [3, 11, 12, 22, 23, 25, 27, 28, 30]
df_subset = df.iloc[:, column_ids].copy()
The ":" on the left side of df.iloc means "all rows". I suggest using copy method in case you want to perform additional operations on df_subset without the risk to affect the original df, or raising Warnings.
If instead of a list of column IDs, you have a list of column names, you should just replace .iloc with .loc.

How to plot outliers with regard to unique ids

I have item_code column in my data and another column, sales, which represents sales quantity for the particular item.
The data can have a particular item id many times. There are other columns tell apart these entries.
I want to plot only the outlier sales for each item (because data has thousands of different item ids, plotting every entry can be difficult).
Since I'm very new to this, what is the right way and tool to do this?
you can use pandas. You should choose a method to detect outliers, but I have an example for you:
If you want to get outliers for all sales (not in groups), you can use apply with function (example - lambda function) to have outliers indexes.
import numpy as np
%matplotlib inline
df = pd.DataFrame({'item_id': [1, 1, 2, 1, 2, 1, 2],
'sales': [0, 2, 30, 3, 30, 30, 55]})
df[df.apply(lambda x: np.abs(x.sales - df.sales.mean()) / df.sales.std() > 1, 1)
].set_index('item_id').plot(style='.', color='red')
In this example we generated data sample and search indexes of points what are more then mean / std + 1 (you can try another method). And then just plot them where y is count of sales and x is item id. This method detected points 0 and 55. If you want search outliers in groups, you can group data before.
df.groupby('item_id').apply(lambda data: data.loc[
data.apply(lambda x: np.abs(x.sales - data.sales.mean()) / data.sales.std() > 1, 1)
]).set_index('item_id').plot(style='.', color='red')
In this example we have points 30 and 55, because 0 isn't outlier for group where item_id = 1, but 30 is.
Is it what you want to do? I hope it helps start with it.

Panda's Update information based on bins / cutting

I'm working on a dataset which has a large amount of missing information.
I understand I could use FillNA but i'd like to base my updates on the binned values of another column.
Selection of missing data:
missing = train[train['field'].isnull()]
Bin the data (this works correctly):
filter_values = [0, 42, 63, 96, 118, 160]
labels = [1,2,3,4,5]
out = pd.cut(missing['field2'], bins = filter_values, labels=labels)
counts = pd.value_counts(out)
print(counts)
Now, based on the bin assignments, I would like to set the correct bin label, to the missing/train['field'] for all data assigned to this bin.
IIUC:
You just need to fillna
train['field'] = train['field'].fillna(out)