Related
import pandas as pd
df = pd.DataFrame({“Employee_ID”: [192, 561, 440, 264, 112, 374, 230, 251, 893, 562],
“Name”: [“Jose”, “Kent”, “Carl”, “Mary”, “Michael”, “Cindy”, “Greg”, “John”, “Frank”, “Angela”],
“Dept”: [“Production”, “Marketing”, “Operations”, “HR”, “Finance”, “Operations”, “Marketing”, “Production”, “Finance”, “HR”],
“Phone”: [2725373, 3647364, 3184778, 1927472, 2394723, 0874872, 1018374, 2127476, 2973973, 0247462],
“Salary”: [120000, 140000, 115000, 210000, 172000, 95000, 132000, 127000, 133000, 178000]})
df
I tried the following code to get the names and salaries of the IDs
df[(df[“Employee_ID”] == 264) & (df[“Employee_ID”] == 374) & (df[“Employee_ID”] == 893)][[“Name”, “Salary”]]
I was expecting to get their names and salaries
As stated by #abokey in the comments, you can get any desired subset of your data using masking. One way of doing that is .isin():
import pandas as pd
df = pd.DataFrame({"Employee_ID": [192, 561, 440, 264, 112, 374, 230, 251, 893, 562],
"Name": ["Jose", "Kent", "Carl", "Mary", "Michael", "Cindy", "Greg", "John", "Frank", "Angela"],
"Dept": ["Production", "Marketing", "Operations", "HR", "Finance", "Operations", "Marketing", "Production", "Finance", "HR"],
"Phone": [2725373, 3647364, 3184778, 1927472, 2394723, 874872, 1018374, 2127476, 2973973, 247462],
"Salary": [120000, 140000, 115000, 210000, 172000, 95000, 132000, 127000, 133000, 178000]})
df
Output:
Employee_ID Name Dept Phone Salary
0 192 Jose Production 2725373 120000
1 561 Kent Marketing 3647364 140000
2 440 Carl Operations 3184778 115000
3 264 Mary HR 1927472 210000
4 112 Michael Finance 2394723 172000
5 374 Cindy Operations 874872 95000
6 230 Greg Marketing 1018374 132000
7 251 John Production 2127476 127000
8 893 Frank Finance 2973973 133000
9 562 Angela HR 247462 178000
get_ids = [264, 374, 893]
df = df[df["Employee_ID"].isin(get_ids)]
df = df[["Name", "Salary"]]
df
Name Salary
3 Mary 210000
5 Cindy 95000
8 Frank 133000
I have a dataframe
df = pd.DataFrame({
'Col1': ['abc', 'qrt', 130, 200, 190, 210],
'Col2': ['xyz','tuv', 130, 200, 190, 210],
'Col3': ['pqr', 'set', 130, 200, 190, 210],})
I wish to take the first two rows of the dataframe, merge them separated by a hyphen and convert them into a new header. I tried
df.columns = np.concatenate(df.iloc[0], df.iloc[1])
df.columns = new_header
But that does not seem to work. The output should look like
df = pd.DataFrame({
'abc_qrt': [ 130, 200, 190, 210],
'xyz_tuv': [130, 200, 190, 210],
'pqr_set': [ 130, 200, 190, 210],})
Try with
df = df.T.set_index([0,1]).T
df.columns = df.columns.map('_'.join)
df
Out[308]:
abc_qrt xyz_tuv pqr_set
2 130 130 130
3 200 200 200
4 190 190 190
5 210 210 210
You can take the first two rows, join them with _ and then set columns of the rest with that:
df.iloc[2:].set_axis(df.iloc[:2].agg("_".join), axis=1)
to get
abc_qrt xyz_tuv pqr_set
2 130 130 130
3 200 200 200
4 190 190 190
5 210 210 210
I have a dataframe:
df = pd.DataFrame({
'BU': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'],
'Line_Item': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'201901': [100, 120, 130, 200, 190, 210],
'201902': [100, 120, 130, 200, 190, 210],
'201903': [200, 250, 450, 120, 180, 190],
'202001': [200, 250, 450, 120, 180, 190],
'202002': [200, 250, 450, 120, 180, 190],
'202003': [200, 250, 450, 120, 180, 190]
})
The columns represent years and months respectively. I would like to sum the columns for months into a new columns for the year. The result should look like the following:
df = pd.DataFrame({
'BU': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'],
'Line_Item': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'201901': [100, 120, 130, 200, 190, 210],
'201902': [100, 120, 130, 200, 190, 210],
'201903': [200, 250, 450, 120, 180, 190],
'202001': [200, 250, 450, 120, 180, 190],
'202002': [200, 250, 450, 120, 180, 190],
'202003': [200, 250, 450, 120, 180, 190],
'2019': [400, 490, 710, 520, 560, 610],
'2020': [600, 750, 1350, 360, 540, 570]
})
My actual dataset has a number of years and has 12 months for each year. Hoping not to have to add the columns manually.
Try creating a DataFrame that contains the year columns and convert the column names to_datetime :
data_df = df.iloc[:, 2:]
data_df.columns = pd.to_datetime(data_df.columns, format='%Y%m')
2019-01-01 2019-02-01 2019-03-01 2020-01-01 2020-02-01 2020-03-01
0 100 100 200 200 200 200
1 120 120 250 250 250 250
2 130 130 450 450 450 450
3 200 200 120 120 120 120
4 190 190 180 180 180 180
5 210 210 190 190 190 190
resample sum the columns by Year and rename columns to just the year values:
data_df = (
data_df.resample('Y', axis=1).sum().rename(columns=lambda c: c.year)
)
2019 2020
0 400 600
1 490 750
2 710 1350
3 520 360
4 560 540
5 610 570
Then join back to the original DataFrame:
new_df = df.join(data_df)
new_df:
BU Line_Item 201901 201902 201903 202001 202002 202003 2019 2020
0 AA Revenues 100 100 200 200 200 200 400 600
1 AA EBT 120 120 250 250 250 250 490 750
2 AA Expenses 130 130 450 450 450 450 710 1350
3 BB Revenues 200 200 120 120 120 120 520 360
4 BB EBT 190 190 180 180 180 180 560 540
5 BB Expenses 210 210 190 190 190 190 610 570
Are the columns are you summing always the same? That is, are there always 3 2019 columns with those same names, and 3 2020 columns with those names? If so, you can just hardcode those new columns.
df['2019'] = df['201901'] + df['201902'] + df['201903']
df['2020'] = df['202001'] + df['202002'] + df['202003']
I have a huge Dataframe like this:
df = pandas.DataFrame({'date': ["2020-10-1 12:00:00", "2020-10-2 12:00:00", "2020-10-3 12:00:00", "2020-10-4 12:00:00",
"2020-10-5 12:00:00", "2020-10-6 12:00:00", "2020-10-7 12:00:00", "2020-10-8 12:00:00",
"2020-10-9 12:00:00"],
'revenue_A': [100, 250, 300, 300, 300, 300, 200, 100, 300],
'revenue_B': [100, 200, 200, 200, 200, 300, 250, 100, 200]})
I want to split the Dataframe if revenue_A and revenue_B don't change for at least certain number of (e.g. 48) consecutive hours. The expected result would be:
date revenue_A revenue_B
0 2020-10-1 12:00:00 100 100
1 2020-10-2 12:00:00 250 200
2 2020-10-3 12:00:00 300 200
3 2020-10-4 12:00:00 300 200
4 2020-10-5 12:00:00 300 200
and
date revenue_A revenue_B
5 2020-10-6 12:00:00 300 300
6 2020-10-7 12:00:00 200 250
7 2020-10-8 12:00:00 100 100
8 2020-10-9 12:00:00 300 200
Any idea how this is efficiently possible (Dataframe has millions of rows).
I don't know if it's efficient enough, but here is one way to do this:
# compute row indices at which to split
splits = [i for i in range(2, len(df))
if (df.revenue_A[i-2] == df.revenue_A[i-1] == df.revenue_A[i]
and df.revenue_B[i-2] == df.revenue_B[i-1] == df.revenue_B[i])]
# sort in descending order
splits.sort(reverse=True)
# initialize list of subframes
subframes = []
# traverse rows backwards, so that we can add the subframe after the split point
# to the subframe list and drop it from the main frame without messing up the
# remaining indices
for split in splits:
if split < len(df) - 1:
subframes.append(df[split+1:])
df = df.drop(range(split+1, len(df)))
# also add the last remaining subframe to the list
subframes.append(df)
I would like to sort a dataframe by certain priority rules.
I've achieved this in the code below but I think this is a very hacky solution.
Is there a more proper Pandas way of doing this?
import pandas as pd
import numpy as np
df=pd.DataFrame({"Primary Metric":[80,100,90,100,80,100,80,90,90,100,90,90,80,90,90,80,80,80,90,90,100,80,80,100,80],
"Secondary Metric Flag":[0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0],
"Secondary Value":[15, 59, 70, 56, 73, 88, 83, 64, 12, 90, 64, 18, 100, 79, 7, 71, 83, 3, 26, 73, 44, 46, 99,24, 20],
"Final Metric":[222, 883, 830, 907, 589, 93, 479, 498, 636, 761, 851, 349, 25, 405, 132, 491, 253, 318, 183, 635, 419, 885, 305, 258, 924]})
Primary_List=list(np.unique(df['Primary Metric']))
Primary_List.sort(reverse=True)
df_sorted=pd.DataFrame()
for p in Primary_List:
lol=df[df["Primary Metric"]==p]
lol.sort_values(["Secondary Metric Flag"],ascending = False)
pt1=lol[lol["Secondary Metric Flag"]==1].sort_values(by=['Secondary Value', 'Final Metric'], ascending=[False, False])
pt0=lol[lol["Secondary Metric Flag"]==0].sort_values(["Final Metric"],ascending = False)
df_sorted=df_sorted.append(pt1)
df_sorted=df_sorted.append(pt0)
df_sorted
The priority rules are:
First sort by the 'Primary Metric', then by the 'Secondary Metric
Flag'.
If the 'Secondary Metric Flag' ==1, sort by 'Secondary Value' then
the 'Final Metric'
If ==0, go right for the 'Final Metric'.
Appreciate any feedback.
You do not need for loop and groupby here , just split them and sort_values
df1=df.loc[df['Secondary Metric Flag']==1].sort_values(by=['Primary Metric','Secondary Value', 'Final Metric'], ascending=[True,False, False])
df0=df.loc[df['Secondary Metric Flag']==0].sort_values(["Primary Metric","Final Metric"],ascending = [True,False])
df=pd.concat([df1,df0]).sort_values('Primary Metric')
sorted with loc
def k(t):
p, s, v, f = df.loc[t]
return (-p, -s, -s * v, -f)
df.loc[sorted(df.index, key=k)]
Primary Metric Secondary Metric Flag Secondary Value Final Metric
9 100 1 90 761
5 100 1 88 93
1 100 1 59 883
3 100 1 56 907
23 100 1 24 258
20 100 0 44 419
13 90 1 79 405
19 90 1 73 635
7 90 1 64 498
11 90 1 18 349
10 90 0 64 851
2 90 0 70 830
8 90 0 12 636
18 90 0 26 183
14 90 0 7 132
15 80 1 71 491
21 80 1 46 885
17 80 1 3 318
24 80 0 20 924
4 80 0 73 589
6 80 0 83 479
22 80 0 99 305
16 80 0 83 253
0 80 0 15 222
12 80 0 100 25
sorted with itertuples
def k(t):
_, p, s, v, f = t
return (-p, -s, -s * v, -f)
idx, *tups = zip(*sorted(df.itertuples(), key=k))
pd.DataFrame(dict(zip(df, tups)), idx)
lexsort
p = df['Primary Metric']
s = df['Secondary Metric Flag']
v = df['Secondary Value']
f = df['Final Metric']
a = np.lexsort([
-p, -s, -s * v, -f
][::-1])
df.iloc[a]
Construct New DataFrame
df.mul([-1, -1, 1, -1]).assign(
**{'Secondary Value': lambda d: d['Secondary Metric Flag'] * d['Secondary Value']}
).pipe(
lambda d: df.loc[d.sort_values([*d]).index]
)