Related
i have a dataframe like this:
data = {'costs': [150, 400, 300, 500, 350], 'month':[1, 2, 2, 1, 1]}
df = pd.DataFrame(data)
i want to use groupby(['month']).sum() but first row not to be
cobmined with fourth and fifth rows so the result of costs would be
like this
list(df['costs'])= [150, 700, 850]
Try:
x = (
df.groupby((df.month != df.month.shift(1)).cumsum())
.agg({"costs": "sum", "month": "first"})
.reset_index(drop=True)
)
print(x)
Prints:
costs month
0 150 1
1 700 2
2 850 1
I need each column to generate random integers with specified range for col 1 (random.randint(1, 50) for col 2 random.randint(51, 100)...etc
import numpy
import random
import pandas
from random import randint
wsn = numpy.arange(1, 6)
taskn = 3
t1 = numpy.random.randint((random.randint(2, 50),random.randint(51, 100),
random.randint(101, 150),random.randint(151, 200),random.randint(201, 250)),size=(5,5))
t2 = numpy.random.randint((random.randint(2, 50),random.randint(51, 100),
random.randint(101, 150),random.randint(151, 200),random.randint(201, 250)),size=(5,5))
t3= numpy.random.randint((random.randint(2, 50),random.randint(51, 100),
random.randint(101, 150),random.randint(151, 200),random.randint(201, 250)),size=(5,5))
print('\nGenerated Data:\t\n\nNumber \t\t\t Task 1 \t\t\t Task 2 \t\t\t Task 3\n')
ni = len(t1)
for i in range(ni):
print('\t {0} \t {1} \t {2} \t {3}\n'.format(wsn[i], t1[i],t2[i],t3[i]))
print('\n\n')
It prints the following
Generated Data:
Number Task 1 Task 2 Task 3
1 [ 1 13 16 121 18] [ 5 22 34 65 194] [ 10 68 60 134 130]
2 [ 0 2 117 176 46] [ 1 15 111 116 180] [22 41 70 24 85]
3 [ 0 12 121 19 193] [ 0 5 37 109 205] [ 5 53 5 106 15]
4 [ 0 5 97 99 235] [ 0 22 142 11 150] [ 6 79 129 64 87]
5 [ 2 46 71 101 186] [ 3 57 141 37 71] [ 15 32 9 117 77]
soemtimes It even generates 0 when I didn't even specifiy 0 in the ranges
np.random.randint(low, high, size=None) allows for low and high being arrays of length num_intervals.
In that case, when size is not specified, it will generate as many integers as there are intervals defined by the low and high bounds.
If you want to generate multiple integers per interval, you just need to specify the corresponding size argument, which must ends by num_intervals.
Here it is size=(num_tasks, num_samples, num_intervals).
import numpy as np
bounds = np.array([1, 50, 100, 150, 200, 250])
num_tasks = 3
num_samples = 7
bounds_low = bounds[:-1]
bounds_high = bounds[1:]
num_intervals = len(bounds_low)
arr = np.random.randint(
bounds_low, bounds_high, size=(num_tasks, num_samples, num_intervals)
)
Checking the properties:
assert arr.shape == (num_tasks, num_samples, num_intervals)
for itvl_idx in range(num_intervals):
assert np.all(arr[:, :, itvl_idx] >= bounds_low[itvl_idx])
assert np.all(arr[:, :, itvl_idx] < bounds_high[itvl_idx])
An example of output:
array([[[ 45, 61, 100, 185, 216],
[ 36, 78, 117, 152, 222],
[ 18, 77, 112, 153, 221],
[ 9, 70, 123, 178, 223],
[ 16, 84, 118, 157, 233],
[ 42, 78, 108, 179, 240],
[ 40, 52, 116, 152, 225]],
[[ 3, 92, 102, 151, 236],
[ 45, 89, 138, 179, 218],
[ 45, 73, 120, 183, 231],
[ 35, 80, 130, 167, 212],
[ 14, 86, 118, 195, 212],
[ 20, 66, 117, 151, 248],
[ 49, 94, 138, 175, 212]],
[[ 13, 75, 116, 169, 206],
[ 13, 75, 127, 179, 213],
[ 29, 64, 136, 151, 213],
[ 1, 81, 140, 197, 200],
[ 17, 77, 112, 171, 215],
[ 18, 75, 103, 180, 209],
[ 47, 57, 132, 194, 234]]])
I have a dataframe:
df = pd.DataFrame({
'BU': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'],
'Line_Item': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'201901': [100, 120, 130, 200, 190, 210],
'201902': [100, 120, 130, 200, 190, 210],
'201903': [200, 250, 450, 120, 180, 190],
'202001': [200, 250, 450, 120, 180, 190],
'202002': [200, 250, 450, 120, 180, 190],
'202003': [200, 250, 450, 120, 180, 190]
})
The columns represent years and months respectively. I would like to sum the columns for months into a new columns for the year. The result should look like the following:
df = pd.DataFrame({
'BU': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'],
'Line_Item': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'201901': [100, 120, 130, 200, 190, 210],
'201902': [100, 120, 130, 200, 190, 210],
'201903': [200, 250, 450, 120, 180, 190],
'202001': [200, 250, 450, 120, 180, 190],
'202002': [200, 250, 450, 120, 180, 190],
'202003': [200, 250, 450, 120, 180, 190],
'2019': [400, 490, 710, 520, 560, 610],
'2020': [600, 750, 1350, 360, 540, 570]
})
My actual dataset has a number of years and has 12 months for each year. Hoping not to have to add the columns manually.
Try creating a DataFrame that contains the year columns and convert the column names to_datetime :
data_df = df.iloc[:, 2:]
data_df.columns = pd.to_datetime(data_df.columns, format='%Y%m')
2019-01-01 2019-02-01 2019-03-01 2020-01-01 2020-02-01 2020-03-01
0 100 100 200 200 200 200
1 120 120 250 250 250 250
2 130 130 450 450 450 450
3 200 200 120 120 120 120
4 190 190 180 180 180 180
5 210 210 190 190 190 190
resample sum the columns by Year and rename columns to just the year values:
data_df = (
data_df.resample('Y', axis=1).sum().rename(columns=lambda c: c.year)
)
2019 2020
0 400 600
1 490 750
2 710 1350
3 520 360
4 560 540
5 610 570
Then join back to the original DataFrame:
new_df = df.join(data_df)
new_df:
BU Line_Item 201901 201902 201903 202001 202002 202003 2019 2020
0 AA Revenues 100 100 200 200 200 200 400 600
1 AA EBT 120 120 250 250 250 250 490 750
2 AA Expenses 130 130 450 450 450 450 710 1350
3 BB Revenues 200 200 120 120 120 120 520 360
4 BB EBT 190 190 180 180 180 180 560 540
5 BB Expenses 210 210 190 190 190 190 610 570
Are the columns are you summing always the same? That is, are there always 3 2019 columns with those same names, and 3 2020 columns with those names? If so, you can just hardcode those new columns.
df['2019'] = df['201901'] + df['201902'] + df['201903']
df['2020'] = df['202001'] + df['202002'] + df['202003']
n_month = [12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132]
df1 = df.loc[df["nth month"].isin(n_month)]
Along with the values given in n_month, I also want to include NaN values in the nth month column? How to include NaN also? Please suggest
First idea is chain Series.isna by | for bitwise Or:
df = pd.DataFrame({ 'nth month':[12,3,108,7,np.nan,0]})
n_month = [12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132]
df1 = df.loc[df["nth month"].isin(n_month) | df["nth month"].isna()]
Or add missing values to your list:
df1 = df.loc[df["nth month"].isin(n_month + [np.nan])]
print (df1)
0 12.0
2 108.0
4 NaN
I would like to sort a dataframe by certain priority rules.
I've achieved this in the code below but I think this is a very hacky solution.
Is there a more proper Pandas way of doing this?
import pandas as pd
import numpy as np
df=pd.DataFrame({"Primary Metric":[80,100,90,100,80,100,80,90,90,100,90,90,80,90,90,80,80,80,90,90,100,80,80,100,80],
"Secondary Metric Flag":[0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0],
"Secondary Value":[15, 59, 70, 56, 73, 88, 83, 64, 12, 90, 64, 18, 100, 79, 7, 71, 83, 3, 26, 73, 44, 46, 99,24, 20],
"Final Metric":[222, 883, 830, 907, 589, 93, 479, 498, 636, 761, 851, 349, 25, 405, 132, 491, 253, 318, 183, 635, 419, 885, 305, 258, 924]})
Primary_List=list(np.unique(df['Primary Metric']))
Primary_List.sort(reverse=True)
df_sorted=pd.DataFrame()
for p in Primary_List:
lol=df[df["Primary Metric"]==p]
lol.sort_values(["Secondary Metric Flag"],ascending = False)
pt1=lol[lol["Secondary Metric Flag"]==1].sort_values(by=['Secondary Value', 'Final Metric'], ascending=[False, False])
pt0=lol[lol["Secondary Metric Flag"]==0].sort_values(["Final Metric"],ascending = False)
df_sorted=df_sorted.append(pt1)
df_sorted=df_sorted.append(pt0)
df_sorted
The priority rules are:
First sort by the 'Primary Metric', then by the 'Secondary Metric
Flag'.
If the 'Secondary Metric Flag' ==1, sort by 'Secondary Value' then
the 'Final Metric'
If ==0, go right for the 'Final Metric'.
Appreciate any feedback.
You do not need for loop and groupby here , just split them and sort_values
df1=df.loc[df['Secondary Metric Flag']==1].sort_values(by=['Primary Metric','Secondary Value', 'Final Metric'], ascending=[True,False, False])
df0=df.loc[df['Secondary Metric Flag']==0].sort_values(["Primary Metric","Final Metric"],ascending = [True,False])
df=pd.concat([df1,df0]).sort_values('Primary Metric')
sorted with loc
def k(t):
p, s, v, f = df.loc[t]
return (-p, -s, -s * v, -f)
df.loc[sorted(df.index, key=k)]
Primary Metric Secondary Metric Flag Secondary Value Final Metric
9 100 1 90 761
5 100 1 88 93
1 100 1 59 883
3 100 1 56 907
23 100 1 24 258
20 100 0 44 419
13 90 1 79 405
19 90 1 73 635
7 90 1 64 498
11 90 1 18 349
10 90 0 64 851
2 90 0 70 830
8 90 0 12 636
18 90 0 26 183
14 90 0 7 132
15 80 1 71 491
21 80 1 46 885
17 80 1 3 318
24 80 0 20 924
4 80 0 73 589
6 80 0 83 479
22 80 0 99 305
16 80 0 83 253
0 80 0 15 222
12 80 0 100 25
sorted with itertuples
def k(t):
_, p, s, v, f = t
return (-p, -s, -s * v, -f)
idx, *tups = zip(*sorted(df.itertuples(), key=k))
pd.DataFrame(dict(zip(df, tups)), idx)
lexsort
p = df['Primary Metric']
s = df['Secondary Metric Flag']
v = df['Secondary Value']
f = df['Final Metric']
a = np.lexsort([
-p, -s, -s * v, -f
][::-1])
df.iloc[a]
Construct New DataFrame
df.mul([-1, -1, 1, -1]).assign(
**{'Secondary Value': lambda d: d['Secondary Metric Flag'] * d['Secondary Value']}
).pipe(
lambda d: df.loc[d.sort_values([*d]).index]
)