I am trying to subset a large dataframe (5000+ rows and 15 columns) based on unique values from two columns (both are dtype = object). I want to exclude rows of data that meet the following criteria:
A column called 'Record' equals "MO" AND a column called 'Year' equals "2017" or "2018".
Here is an example of the dataframe:
df = pd.DataFrame({'A': [1001,2002,3003,4004,5005,6006,7007,8008,9009], 'Record' : ['MO','MO','I','I','MO','I','MO','I','I'], 'Year':[2017,2019,2018,2020,2018,2018,2020,2019,2017]})
print(df)
A Record Year
0 1001 MO 2017
1 2002 MO 2019
2 3003 I 2018
3 4004 I 2020
4 5005 MO 2018
5 6006 I 2018
6 7007 MO 2020
7 8008 I 2019
8 9009 I 2017
I would like any row with both "MO" and "2017", as well as both "MO" and "2018" taken out of the dataframe.
Example where the right rows (0 and 4 in dataframe above) are deleted:
df = pd.DataFrame({'A': [2002,3003,4004,6006,7007,8008,9009], 'Record' : ['MO','I','I','I','MO','I','I'], 'Year':[2019,2018,2020,2018,2020,2019,2017]})
print(df)
A Record Year
0 2002 MO 2019
1 3003 I 2018
2 4004 I 2020
3 6006 I 2018
4 7007 MO 2020
5 8008 I 2019
6 9009 I 2017
I have tried the following code, but it does not work (I tried at first for just one year):
df = df[(df['Record'] != "MO" & df['Year'] != "2017")]
I believe you're just missing some parenthesis.
df = df[(df['Record'] != "MO") & (df['Year'] != "2017")]
Edit:
After some clarification:
df = df[~((df['Record']=='MO')&
(df['Year']=='2017')|
(df['Year']=='2018'))]
Related
I have a rather simple request and have not found a suitable solution online. I have a DF that looks like this below and I need to find the cumulative deviation as shown in a new column to the DF. My DF looks like this:
year month Curr Yr LT Avg
0 2022 1 667590.5985 594474.2003
1 2022 2 701655.5967 585753.1173
2 2022 3 667260.5368 575550.6112
3 2022 4 795338.8914 562312.5309
4 2022 5 516510.1103 501330.4306
5 2022 6 465717.9192 418087.1358
6 2022 7 366100.4456 344854.2453
7 2022 8 355089.157 351539.9371
8 2022 9 468479.4396 496831.2979
9 2022 10 569234.4156 570767.1723
10 2022 11 719505.8569 594368.6991
11 2022 12 670304.78 576495.7539
And, I need the cumulative deviation new column in this DF to look like this:
Cum Dev
0.122993392
0.160154637
0.159888559
0.221628609
0.187604073
0.178089327
0.16687643
0.152866293
0.129326033
0.114260993
0.124487107
0.128058305
In Excel, the calculation would look like this with data in Excel columns Z3:Z14, AA3:AA14 for the first row: =SUM(Z$3:Z3)/SUM(AA$3:AA3)-1 and for the next row: =SUM(Z$3:Z4)/SUM(AA$3:AA4)-1 and for the next as follows with the last row looking like this in the Excel example: =SUM(Z$3:Z14)/SUM(AA$3:AA14)-1
Thank you kindly for your help,
You can divide the cumulative sums of those 2 columns element-wise, and then subtract 1 at the end:
>>> (df["Curr Yr"].cumsum() / df["LT Avg"].cumsum()) - 1
0 0.122993
1 0.160155
2 0.159889
3 0.221629
4 0.187604
5 0.178089
6 0.166876
7 0.152866
8 0.129326
9 0.114261
10 0.124487
11 0.128058
dtype: float64
I am trying to divide line items with a start and end date into multiple rows based on months.
Values should be calculated based on number of days in the specific months.
For instance, data of 1 line item:
id
StartDate
EndDate
Annual
abc
12/12/2018
01/12/2019
120,450
expected output:
id
Month
Year
Monthly volume
abc
12
2018
6,600
abc
1
2019
10,230
abc
2
2019
9,240
abc
3
2019
10,230
abc
4
2019
9,900
abc
5
2019
10,230
abc
6
2019
9,900
abc
7
2019
10,230
abc
8
2019
10,230
abc
9
2019
9,900
abc
10
2019
10,230
abc
11
2019
9,900
Few things for next time you ask.
This is a case where there are existing answers, so always try google
first to reduce duplication. Other post is referenced in code below.
You should also always include the code you already tried, SO doesn't
like to do your homework, but we will help you with it.
You should include a more readily reproduced dataframe. I shouldn't have to copy paste to build it, as in below code.
you are clearly doing something to convert the Annual total to a monthly volume but you do not explain this, so do not expect it to be done for you.
Lastly, this code doesn't convert to separate month and year columns, but once you have the date, this should be trivial for you to do (or to google how to do).
import pandas as pd
df = pd.DataFrame(
data = [['abc','12/12/2018','12/01/2019',120450]],
columns = ['id', 'startDate', 'EndDate', 'Annual']
)
df['startDate'] = pd.to_datetime(df['startDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
# pd.bdate_range(start="2020/12/16", end="2020/12/26", freq="C", weekmask="Sat Sun")
# %%
df_start_end = df.melt(id_vars=['id', 'Annual'],value_name='date')
# credit to u/gen
# https://stackoverflow.com/questions/42151886/expanding-pandas-data-frame-with-date-range-in-columns
df = (
df_start_end.groupby('id')
.apply(lambda x: x.set_index('date')
.resample('M').pad())
.drop(columns=['id','variable'])
.reset_index()
)
print(df)
I have a dataframe 'df1' that has 2 columns and i need to shift the 2nd column down a row and then remove the entire top row of the df1.
My data looks like this:
year ER12
0 2017 -2.05
1 2018 1.05
2 2019 -0.04
3 2020 -0.60
4 2021 -99.99
And, I need it to look like this:
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60
We can try this:
df = df.assign(ER12=df.ER12.shift()).dropna().reset_index(drop=True)
print(df)
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60
This works on your example:
import pandas as pd
df = pd.DataFrame({'year':[2017,2018,2019,2020,2021], 'ER12':[-2.05,1.05,-0.04,-0.6,-99.99]})
df['year'] = df['year'].shift(-1)
df = df.dropna()
I'm trying to calculate variability statistics from two df's - one with current data and one df with average data for the month. Suppose I have a df "DF1" that looks like this:
Name year month output
0 A 1991 1 10864.8
1 A 1997 2 11168.5
2 B 1994 1 6769.2
3 B 1998 2 3137.91
4 B 2002 3 4965.21
and a df called "DF2" that contains monthly averages from multiple years such as:
Name month output_average
0 A 1 11785.199
1 A 2 8973.991
2 B 1 8874.113
3 B 2 6132.176667
4 B 3 3018.768
and, i need a new DF calling it "DF3" that needs to look like this with the calculations specific to the change in the "name" column and for each "month" change:
Name year month Variability
0 A 1991 1 -0.078097875
1 A 1997 2 0.24454103
2 B 1994 1 -0.237197002
3 B 1998 2 -0.488287737
4 B 2002 3 0.644782
I have tried options like this below but with errors about duplicating the axis or key errors -
DF3['variability'] =
((DF1.output/DF2.set_index('month'['output_average'].reindex(DF1['name']).values)-1)
Thank you for your help in leaning Python row calculations coming from matlab!
For two columns, you can better use merge instead of set_index:
df3 = df1.merge(df2, on=['Name','month'], how='left')
df3['variability'] = df3['output']/df3['output_average'] - 1
Output:
Name year month output output_average variability
0 A 1991 1 10864.80 11785.199000 -0.078098
1 A 1997 2 11168.50 8973.991000 0.244541
2 B 1994 1 6769.20 8874.113000 -0.237197
3 B 1998 2 3137.91 6132.176667 -0.488288
4 B 2002 3 4965.21 3018.768000 0.644780
I have a data set
id Category Date
1 Sick 2016-10-10
12:10:21
2 Active 2017-09-08
11:09:06
3 Weak 2018-11-12
06:10:04
Now i want to add a new column which only has year in the data set using pandas?
You could do:
import pandas as pd
data = [[1, 'Sick ', '2016-10-10 12:10:21'],
[2, 'Active', '2017-09-08 11:09:06'],
[3, 'Weak ', '2018-11-12 06:10:04']]
df = pd.DataFrame(data=data, columns=['id', 'category', 'date'])
df['year'] = pd.to_datetime(df['date']).dt.year
print(df)
Output
id category date year
0 1 Sick 2016-10-10 12:10:21 2016
1 2 Active 2017-09-08 11:09:06 2017
2 3 Weak 2018-11-12 06:10:04 2018
you can just do df['year'] = pd.DatetimeIndex(df['Date']).year
Output:
id category Date year
0 1 Sick 2016-10-10 12:10:21 2016
1 2 Active 2017-09-08 11:09:06 2017
2 3 Weak 2018-11-12 06:10:04 2018