I need to apply an index from one dataframe "df" (not shown) with a condition True/False of whether the row value in df contains NaN. I have found that index and it is called "rowswithnan" and is shown below is a series. I need to use this index to set the row values of a different dataframe "df2" to zero (0). I have tried lots of things and am getting different errors.
the series rowswithnan index looks like this:
0 False
1 True
2 True
3 False
and df2 looks like this:
plant_name power_kwh hour day month year
0 AREC 32963.4 23 31 12 2020
1 AREC 35328.2 22 31 12 2020
2 AREC 37523.6 21 31 12 2020
3 AREC 36446.0 20 31 12 2020
After using the index of "rowswithnan", I need df2 to look like this with the "zero" replacements in the column "power_kwh":
df2
plant_name power_kwh hour day month year
0 AREC 32963.4 23 31 12 2020
1 AREC 0 22 31 12 2020
2 AREC 0 21 31 12 2020
3 AREC 36446.0 20 31 12 2020
Thank you.
Since you have a series that describes where NaN are in the dataframe, I'm guessing there's a simpler way to handle this whole situation. However, here's answer to your request in a comment to drop the rows where the index of the series rowswithnan is False.
First make that series a column of your dataframe:
df['contains_nan'] = `rowswithnan`
Then use boolean filtering:
df = df.loc[~df.contains_nan]
To set the values power_kwh to zero instead of dropping the rows, do this:
df.loc[df.contains_nan, 'power_kwh'] = 0
Related
I have a rather simple request and have not found a suitable solution online. I have a DF that looks like this below and I need to find the cumulative deviation as shown in a new column to the DF. My DF looks like this:
year month Curr Yr LT Avg
0 2022 1 667590.5985 594474.2003
1 2022 2 701655.5967 585753.1173
2 2022 3 667260.5368 575550.6112
3 2022 4 795338.8914 562312.5309
4 2022 5 516510.1103 501330.4306
5 2022 6 465717.9192 418087.1358
6 2022 7 366100.4456 344854.2453
7 2022 8 355089.157 351539.9371
8 2022 9 468479.4396 496831.2979
9 2022 10 569234.4156 570767.1723
10 2022 11 719505.8569 594368.6991
11 2022 12 670304.78 576495.7539
And, I need the cumulative deviation new column in this DF to look like this:
Cum Dev
0.122993392
0.160154637
0.159888559
0.221628609
0.187604073
0.178089327
0.16687643
0.152866293
0.129326033
0.114260993
0.124487107
0.128058305
In Excel, the calculation would look like this with data in Excel columns Z3:Z14, AA3:AA14 for the first row: =SUM(Z$3:Z3)/SUM(AA$3:AA3)-1 and for the next row: =SUM(Z$3:Z4)/SUM(AA$3:AA4)-1 and for the next as follows with the last row looking like this in the Excel example: =SUM(Z$3:Z14)/SUM(AA$3:AA14)-1
Thank you kindly for your help,
You can divide the cumulative sums of those 2 columns element-wise, and then subtract 1 at the end:
>>> (df["Curr Yr"].cumsum() / df["LT Avg"].cumsum()) - 1
0 0.122993
1 0.160155
2 0.159889
3 0.221629
4 0.187604
5 0.178089
6 0.166876
7 0.152866
8 0.129326
9 0.114261
10 0.124487
11 0.128058
dtype: float64
I have the following data frame:
Month
Day
Year
Open
High
Low
Close
Week Close
Week
0
1
1
2003
46.593
46.656
46.405
46.468
45.593
1
1
1
2
2003
46.538
46.66
46.47
46.673
45.593
1
2
1
3
2003
46.717
46.781
46.53
46.750
45.593
1
3
1
4
2003
46.815
46.843
46.68
46.750
45.593
1
4
1
5
2003
46.935
47.000
46.56
46.593
45.593
1
...
...
...
...
...
...
...
...
...
7257
10
26
2022
381.619
387.5799
381.350
382.019
389.019
43
7258
10
27
2022
383.07
385.00
379.329
379.98
389.019
43
7259
10
28
2022
379.869
389.519
379.67
389.019
389.019
43
7260
10
31
2022
386.44
388.399
385.26
386.209
385.24
44
7261
11
1
2022
390.14
390.39
383.29
384.519
385.24
44
I want to create a new column titled 'Prior_Week_Close' which will reference the prior week's 'Week Close' value (and the last week of the prior year for the first week of every year). For example, row 7260's value for Prior_Week_Close should equal 389.019
I'm trying:
SPY['prior_week_close'] = np.where(SPY['Week'].shift(1) == (SPY['Week'] - 1), SPY['Week_Close'].shift(1), np.nan)
TypeError: boolean value of NA is ambiguous
I thought about just using shift and creating a new column but some weeks only have 4 days and that would lead to inaccurate values.
Any help is greatly appreciated!
I was able to solve this by creating a new column called 'Overall_Week' (the week number in the entire data set, not just the calendar year) and using the following code:
def fn(s):
result = SPY[SPY.Overall_Week == (s.iloc[0] - 1)]['Week_Close']
if result.shape[0] > 0:
return np.broadcast_to(result.iloc[0], s.shape)
else:
return np.broadcast_to(np.NaN, s.shape)
SPY['Prior_Week_Close'] = SPY.groupby('Overall_Week')['Overall_Week'].transform(fn)```
I want to drop both rows in a pandas data frame where the value in one column(account) is not duplicate and the value in some other column (recharge_number) is duplicate given A. An illustrative example:
data = {'account': [43,43,43,43,45,45],
'recharge_number': [17777, 17777, 17999, 17888, 17222, 17999] ,
'year': [2021,2021,2021,2021,2020,2020],
'month': [2,3,5,6,2,9]}
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17999 2021 5
43 17888 2021 6
45 17222 2020 2
45 17999 2020 9
input data
output:
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17888 2021 6
45 17222 2020 2
output data
Another method is to drop rows instead of keep them:
>>> df.drop(df[~df.duplicated(['id', 'number'], keep=False)
& df.duplicated('number', keep=False)].index)
id number
0 5 10
1 5 10
3 6 20
5 7 40
The first condition protect all duplicate ('id', 'number') records. The second condition remove all records where 'number' are the same.
Basically, you want "the full row (or the two columns if larger dataframe) is duplicated" or "number is not duplicated"
You can use duplicated:
df[df['id', 'number'].duplicated(keep=False)|~df['number'].duplicated(keep=False)]
Output:
id number
0 5 10
1 5 10
3 6 20
5 7 40
Solution with .crosstab:
mask = pd.crosstab(df["account"], df["recharge_number"]).ne(0).sum().gt(1)
print(df[~df["recharge_number"].isin(mask[mask].index)])
Prints:
account recharge_number year month
0 43 17777 2021 2
1 43 17777 2021 3
3 43 17888 2021 6
4 45 17222 2020 2
I am looking to add two columns with different date range
column 1 = values with date index 2 Nov to 23 Nov
column 2 = values with date index 27 Oct to 17 Nov
Resultant = addition of values in column 1 and column 2 of 27 Oct to 23 Nov
Sample pic attached
enter image description here
Column 1 of dataframeA has data from 2 Nov to 23 Nov; each element
has value 100
Column 2 of dataframe B has data from 27 Oct to 17 Nov; each element has value 200
Result will be data sum of these columns with all date included.
df1:
Date Value
0 2-11-2020 21.0
1 3-11-2020 4.0
2 4-11-2020 6.0
df2:
Date Value
0 3-11-2020 2.0
1 4-11-2020 2.0
2 5-11-2020 7.0
It should be:
df = df1.set_index('Date').add(df2.set_index('Date'), fill_value=0).reset_index()
df:
Date Value
0 2-11-2020 21.0
1 3-11-2020 6.0
2 4-11-2020 8.0
3 5-11-2020 7.0
I have a dataframe df with values as:
df.iloc[1:4, 7:9]
Year Month
38 2020 4
65 2021 4
92 2022 4
I am trying to create a new MonthIdx column as:
df['MonthIdx'] = pd.to_timedelta(df['Year'], unit='Y') + pd.to_timedelta(df['Month'], unit='M') + pd.to_timedelta(1, unit='D')
But I get the error:
ValueError: Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta values durations.
Following is the desired output:
df['MonthIdx']
MonthIdx
38 2020/04/01
65 2021/04/01
92 2022/04/01
So you can pad the month value in a series, and then reformat to get a datetime for all of the values:
month = df.Month.astype(str).str.pad(width=2, side='left', fillchar='0')
df['MonthIdx'] = pd.to_datetime(pd.Series([int('%d%s' % (x,y)) for x,y in zip(df['Year'],month)]),format='%Y%m')
This will give you:
Year Month MonthIdx
0 2020 4 2020-04-01
1 2021 4 2021-04-01
2 2022 4 2022-04-01
You can reformat the date to be a string to match exactly your format:
df['MonthIdx'] = df['MonthIdx'].apply(lambda x: x.strftime('%Y/%m/%d'))
Giving you:
Year Month MonthIdx
0 2020 4 2020/04/01
1 2021 4 2021/04/01
2 2022 4 2022/04/01