I am looking to add two columns with different date range
column 1 = values with date index 2 Nov to 23 Nov
column 2 = values with date index 27 Oct to 17 Nov
Resultant = addition of values in column 1 and column 2 of 27 Oct to 23 Nov
Sample pic attached
enter image description here
Column 1 of dataframeA has data from 2 Nov to 23 Nov; each element
has value 100
Column 2 of dataframe B has data from 27 Oct to 17 Nov; each element has value 200
Result will be data sum of these columns with all date included.
df1:
Date Value
0 2-11-2020 21.0
1 3-11-2020 4.0
2 4-11-2020 6.0
df2:
Date Value
0 3-11-2020 2.0
1 4-11-2020 2.0
2 5-11-2020 7.0
It should be:
df = df1.set_index('Date').add(df2.set_index('Date'), fill_value=0).reset_index()
df:
Date Value
0 2-11-2020 21.0
1 3-11-2020 6.0
2 4-11-2020 8.0
3 5-11-2020 7.0
Related
I have a rather simple request and have not found a suitable solution online. I have a DF that looks like this below and I need to find the cumulative deviation as shown in a new column to the DF. My DF looks like this:
year month Curr Yr LT Avg
0 2022 1 667590.5985 594474.2003
1 2022 2 701655.5967 585753.1173
2 2022 3 667260.5368 575550.6112
3 2022 4 795338.8914 562312.5309
4 2022 5 516510.1103 501330.4306
5 2022 6 465717.9192 418087.1358
6 2022 7 366100.4456 344854.2453
7 2022 8 355089.157 351539.9371
8 2022 9 468479.4396 496831.2979
9 2022 10 569234.4156 570767.1723
10 2022 11 719505.8569 594368.6991
11 2022 12 670304.78 576495.7539
And, I need the cumulative deviation new column in this DF to look like this:
Cum Dev
0.122993392
0.160154637
0.159888559
0.221628609
0.187604073
0.178089327
0.16687643
0.152866293
0.129326033
0.114260993
0.124487107
0.128058305
In Excel, the calculation would look like this with data in Excel columns Z3:Z14, AA3:AA14 for the first row: =SUM(Z$3:Z3)/SUM(AA$3:AA3)-1 and for the next row: =SUM(Z$3:Z4)/SUM(AA$3:AA4)-1 and for the next as follows with the last row looking like this in the Excel example: =SUM(Z$3:Z14)/SUM(AA$3:AA14)-1
Thank you kindly for your help,
You can divide the cumulative sums of those 2 columns element-wise, and then subtract 1 at the end:
>>> (df["Curr Yr"].cumsum() / df["LT Avg"].cumsum()) - 1
0 0.122993
1 0.160155
2 0.159889
3 0.221629
4 0.187604
5 0.178089
6 0.166876
7 0.152866
8 0.129326
9 0.114261
10 0.124487
11 0.128058
dtype: float64
I have a dataframe that looks like this:
year
month
valueCounts
2019
1
73.411285
2019
2
53.589128
2019
3
71.103842
2019
4
79.528084
I want valueCounts column's values to be rolled like:
year
month
valueCounts
2019
1
53.589128
2019
2
71.103842
2019
3
79.528084
2019
4
NaN
I can do this by dropping first index of dataframe and assigning last index to NaN but it doesn't look efficient. Is there any simpler method to do this?
Thanks.
Assuming your dataframe are already sorted.
Use shift:
df['valueCounts'] = df['valueCounts'].shift(-1)
print(df)
# Output
year month valueCounts
0 2019 1 53.589128
1 2019 2 71.103842
2 2019 3 79.528084
3 2019 4 NaN
I want to drop both rows in a pandas data frame where the value in one column(account) is not duplicate and the value in some other column (recharge_number) is duplicate given A. An illustrative example:
data = {'account': [43,43,43,43,45,45],
'recharge_number': [17777, 17777, 17999, 17888, 17222, 17999] ,
'year': [2021,2021,2021,2021,2020,2020],
'month': [2,3,5,6,2,9]}
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17999 2021 5
43 17888 2021 6
45 17222 2020 2
45 17999 2020 9
input data
output:
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17888 2021 6
45 17222 2020 2
output data
Another method is to drop rows instead of keep them:
>>> df.drop(df[~df.duplicated(['id', 'number'], keep=False)
& df.duplicated('number', keep=False)].index)
id number
0 5 10
1 5 10
3 6 20
5 7 40
The first condition protect all duplicate ('id', 'number') records. The second condition remove all records where 'number' are the same.
Basically, you want "the full row (or the two columns if larger dataframe) is duplicated" or "number is not duplicated"
You can use duplicated:
df[df['id', 'number'].duplicated(keep=False)|~df['number'].duplicated(keep=False)]
Output:
id number
0 5 10
1 5 10
3 6 20
5 7 40
Solution with .crosstab:
mask = pd.crosstab(df["account"], df["recharge_number"]).ne(0).sum().gt(1)
print(df[~df["recharge_number"].isin(mask[mask].index)])
Prints:
account recharge_number year month
0 43 17777 2021 2
1 43 17777 2021 3
3 43 17888 2021 6
4 45 17222 2020 2
I need to apply an index from one dataframe "df" (not shown) with a condition True/False of whether the row value in df contains NaN. I have found that index and it is called "rowswithnan" and is shown below is a series. I need to use this index to set the row values of a different dataframe "df2" to zero (0). I have tried lots of things and am getting different errors.
the series rowswithnan index looks like this:
0 False
1 True
2 True
3 False
and df2 looks like this:
plant_name power_kwh hour day month year
0 AREC 32963.4 23 31 12 2020
1 AREC 35328.2 22 31 12 2020
2 AREC 37523.6 21 31 12 2020
3 AREC 36446.0 20 31 12 2020
After using the index of "rowswithnan", I need df2 to look like this with the "zero" replacements in the column "power_kwh":
df2
plant_name power_kwh hour day month year
0 AREC 32963.4 23 31 12 2020
1 AREC 0 22 31 12 2020
2 AREC 0 21 31 12 2020
3 AREC 36446.0 20 31 12 2020
Thank you.
Since you have a series that describes where NaN are in the dataframe, I'm guessing there's a simpler way to handle this whole situation. However, here's answer to your request in a comment to drop the rows where the index of the series rowswithnan is False.
First make that series a column of your dataframe:
df['contains_nan'] = `rowswithnan`
Then use boolean filtering:
df = df.loc[~df.contains_nan]
To set the values power_kwh to zero instead of dropping the rows, do this:
df.loc[df.contains_nan, 'power_kwh'] = 0
I have a dataframe df with values as:
df.iloc[1:4, 7:9]
Year Month
38 2020 4
65 2021 4
92 2022 4
I am trying to create a new MonthIdx column as:
df['MonthIdx'] = pd.to_timedelta(df['Year'], unit='Y') + pd.to_timedelta(df['Month'], unit='M') + pd.to_timedelta(1, unit='D')
But I get the error:
ValueError: Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta values durations.
Following is the desired output:
df['MonthIdx']
MonthIdx
38 2020/04/01
65 2021/04/01
92 2022/04/01
So you can pad the month value in a series, and then reformat to get a datetime for all of the values:
month = df.Month.astype(str).str.pad(width=2, side='left', fillchar='0')
df['MonthIdx'] = pd.to_datetime(pd.Series([int('%d%s' % (x,y)) for x,y in zip(df['Year'],month)]),format='%Y%m')
This will give you:
Year Month MonthIdx
0 2020 4 2020-04-01
1 2021 4 2021-04-01
2 2022 4 2022-04-01
You can reformat the date to be a string to match exactly your format:
df['MonthIdx'] = df['MonthIdx'].apply(lambda x: x.strftime('%Y/%m/%d'))
Giving you:
Year Month MonthIdx
0 2020 4 2020/04/01
1 2021 4 2021/04/01
2 2022 4 2022/04/01