Find first and last positive value of every season over 50 years - pandas

i've seen some similar question but can't figure out how to handle my problem.
I have a dataset with evereyday total snow values from 1970 till 2015.
Now i want to find out when there was the first and the last day with snow.
I want to do this for every season.
One season should be from, for example 01.06.2000 - 30.5.2001, this season is then Season 2000/2001.
I have already set my date column as index(format year-month-day, 2006-04-24)
When I select a specific range with
df_s = df["2006-04-04" : "2006-04-15"]
I am able to find out the first and last day with snow in this period with
firstsnow = df_c[df_c['Height'] > 0].head(1)
lastsnow = df_c[df_c['Height'] > 0].tail(1)
I want to do this now for the whole dataset, so that I'm able to compare each season and see how the time of first snow changed.
My dataframe looks like this(here you see a selected period with values),Height is Snowheight, Diff is the difference to the previous day. Height and Diff are Float64.
Height Diff
Date
2006-04-04 0.000 NaN
2006-04-05 0.000 0.000
2006-04-06 0.000 0.000
2006-04-07 16.000 16.000
2006-04-08 6.000 -10.000
2006-04-09 0.001 -5.999
2006-04-10 0.000 -0.001
2006-04-11 0.000 0.000
2006-04-12 0.000 0.000
2006-04-13 0.000 0.000
2006-04-14 0.000 0.000
2006-04-15 0.000 0.000
(12, 2)
<class 'pandas.core.frame.DataFrame'>
I think i have to work with the groupby function, but i don't know how to apply this function in this case.

You can use the trick to create new column with only positive value, and None otherwise. Then use ffill and bfill to get the head and tail
Sample data:
df = pd.DataFrame({'name': ['a1','a2','a3','a4','a5','b1','b2','b3','b4','b5'],
'gr':[1]*5+[2]*5,
'val1':[None,-1,2,1,None,-1,4,7,3,-2]})
Input:
name gr val1
0 a1 1 NaN
1 a2 1 -1.0
2 a3 1 2.0
3 a4 1 1.0
4 a5 1 NaN
5 b1 2 -1.0
6 b2 2 4.0
7 b3 2 7.0
8 b4 2 3.0
9 b5 2 -2.0
Set positive then ffill and bfill:
df['positive'] = np.where(df['val1']>0, df['val1'], None)
df['positive'] = df.groupby('gr')['positive'].apply(lambda g: g.ffill())
df['positive'] = df.groupby('gr')['positive'].apply(lambda g: g.bfill())
Check result:
df.groupby('gr').head(1)
df.groupby('gr').tail(1)
name gr val1 positive
0 a1 1 NaN 2.0
5 b1 2 -1.0 4.0
name gr val1 positive
4 a5 1 NaN 1.0
9 b5 2 -2.0 3.0

Related

Convert value counts of multiple columns to pandas dataframe

I have a dataset in this form:
Name Batch DXYR Emp Lateral GDX MMT CN
Joe 2 0 2 2 2 0
Alan 0 1 1 2 0 0
Josh 1 1 2 1 1 2
Max 0 1 0 0 0 2
These columns can have only three distinct values ie. 0, 1 and 2..
So, I need percent of value counts for each column in pandas dataframe..
I have simply make a loop like:
for i in df.columns:
(df[i].value_counts()/df[i].count())*100
I am getting the output like:
0 90.608831
1 0.391169
2 9.6787899
Name: Batch, dtype: float64
0 95.545455
1 2.235422
2 2.6243553
Name: MX, dtype: float64
and so on...
These outputs are correct but I need it in pandas dataframe like this:
Batch DXYR Emp Lateral GDX MMT CN
Count_0_percent 98.32 52.5 22 54.5 44.2 53.4 76.01
Count_1_percent 0.44 34.5 43 43.5 44.5 46.5 22.44
Count_2_percent 1.3 64.3 44 2.87 12.6 1.88 2.567
Can someone please suggest me how to get it
You can melt the data, then use pd.crosstab:
melt = df.melt('Name')
pd.crosstab(melt['value'], melt['variable'], normalize='columns')
Or a bit faster (yet more verbose) with melt and groupby().value_counts():
(df.melt('Name')
.groupby('variable')['value'].value_counts(normalize=True)
.unstack('variable', fill_value=0)
)
Output:
variable Batch CN DXYR Emp Lateral GDX MMT
value
0 0.50 0.5 0.25 0.25 0.25 0.50
1 0.25 0.0 0.75 0.25 0.25 0.25
2 0.25 0.5 0.00 0.50 0.50 0.25
Update: apply also works:
df.drop(columns=['Name']).apply(pd.Series.value_counts, normalize=True)

How to extract a database based on a condition in pandas?

Please help me
The below one is the problem...
write an expression to extract a new dataframe containing those days where the temperature reached at least 70 degrees, and assign that to the variable at_least_70. (You might need to think some about what the different columns in the full dataframe represent to decide how to extract the subset of interest.)
After that, write another expression that computes how many days reached at least 70 degrees, and assign that to the variable num_at_least_70.
This is the original DataFrame
Date Maximum Temperature Minimum Temperature \
0 2018-01-01 5 0
1 2018-01-02 13 1
2 2018-01-03 19 -2
3 2018-01-04 22 1
4 2018-01-05 18 -2
.. ... ... ...
360 2018-12-27 33 23
361 2018-12-28 40 21
362 2018-12-29 50 37
363 2018-12-30 37 24
364 2018-12-31 35 25
Average Temperature Precipitation Snowfall Snow Depth
0 2.5 0.04 1.0 3.0
1 7.0 0.03 0.6 4.0
2 8.5 0.00 0.0 4.0
3 11.5 0.00 0.0 3.0
4 8.0 0.09 1.2 4.0
.. ... ... ... ...
360 28.0 0.00 0.0 1.0
361 30.5 0.07 0.0 0.0
362 43.5 0.04 0.0 0.0
363 30.5 0.02 0.7 1.0
364 30.0 0.00 0.0 0.0
[365 rows x 7 columns]
I wrote the code for the above problem is`
at_least_70 = dfc.loc[dfc['Minimum Temperature']>=70,['Date']]
print(at_least_70)
num_at_least_70 = at_least_70.count()
print(num_at_least_70)
The Results it is showing
Date
204 2018-07-24
240 2018-08-29
245 2018-09-03
Date 3
dtype: int64
But when run the test case it is showing...
Incorrect!
You are not correctly extracting the subset.
As suggested by #HenryYik, remove the column selector:
at_least_70 = dfc.loc[dfc['Maximum Temperature'] >= 70,
['Date', 'Maximum Temperature']]
num_at_least_70 = len(at_least_70)
Use boolean indexing and for count Trues of mask use sum:
mask = dfc['Minimum Temperature'] >= 70
at_least_70 = dfs[mask]
num_at_least_70 = mask.sum()

Fill missing values in DataFrame

I have a dataframe that is either missing two values in two columns, or one value in one column.
Date 30 45 60 90
0 2004-01-02 0.88 0.0 0.0 0.93
1 2004-01-05 0.88 0.0 0.0 0.91
...
20 2019-12-24 1.55 0 1.58 1.58
21 2019-12-26 1.59 0 1.60 1.58
I would like to compute all the zero values in the dataframe by some simple linear method. Here is the thing, if there is a value in the 60 column, use the average of the 60 and the 30 for the 45. Otherwise use some simple method to compute both the 45 and the 60.
What is the pandas way to do this? [Prefer no loops]
EDIT 1
As per the suggestions in the comment, I tried
df.replace(0, np.nan, inplace=True)
df=df.interpolate(method='linear', limit_direction='forward', axis=0)
But the df still contains all the np.nan

How to manipulate data in arrays using pandas (and resetting evaluations)

I've revised the question for clarity and removed artifacts and inconsistencies - please reopen for consideration by the community. One contributor already thinks a solution might be possible with groupby in combination with cummax.
I have a dataframe in which the max between prior value of col3 and current value of col2 is evaluated through a cummax function recently offered by Scott Boston (thanks!) as follows:
df['col3'] = df['col2'].shift(-1).cummax().shift().
The resulting dataframe is shown below. Also added the desired logic that compares col2 to a setpoint that is a result of float type value.
result of operating cummax:
col0 col1 col2 col3
0 1 5.0 2.50 NaN
1 2 4.9 2.45 2.45
2 3 5.5 2.75 2.75
3 4 3.5 1.75 2.75
4 5 3.1 1.55 2.75
5 6 4.5 2.25 2.75
6 7 5.5 2.75 2.75
7 8 1.2 0.6 2.75
8 9 5.8 2.90 2.90
The desire is to flag True when col3 >= setpoint or 2.71 in the above example such that every time col3's most recent row exceeds setpoint.
The problem: cummax solution does not reset when setpoint is reached. Need a solution that resets the cummax calculation every time it breaches setpoint. For example in the table above, after the first True when col3 exceeds the setpoint, i.e. col2 value is 2.75, there is a second time when it should satisfy the same condition, i.e. shown as in the extended data table where I’ve deleted col3's value in row 4 to illustrate the need to ‘reset’ the cummax calc. In the if statement, I am using subscript [-1] to target the last row in the df (i.e. most recent). Note: col2=current value of col1*constant1 where constant1 == 0.5
Code tried so far (note that col3 is not resetting properly):
if self.constant is not None: setpoint = self.constant * (1-self.temp) # suppose setpoint == 2.71
df = pd.DataFrame({'col0':[1,2,3,4,5,6,7,8,9]
,'col1':[5,4.9,5.5,3.5,3.1,4.5,5.5,1.2,5.8]
,'col2':[2.5,2.45,2.75,1.75,1.55,2.25,2.75,0.6,2.9]
,'col3':[NaN,2.45,2.75,2.75,2.75,2.75,2.75,2.75,2.9]
})
if df[‘col3’][-1] >= setpoint:
self.log(‘setpoint hit')
return True
Cummax solution needs tweaking: col3 is supposed to evaluate based value of col2 and col3 and once the setpoint is breached (2.71 for col3), the next col3 value should reset to NaN and start a new cummax. The correct output for col3 should be:[NaN,2.45,2.75,NaN,1.55,2.25,2.75,NaN,2.9] and return True again and again when the last row of col3 breaches setpoint value 2.71.
Desired result of operating cummax and additional tweaking for col3 (possibly with groupby that references col2?): return True every time setpoint is breached. Here's one example of the resulting col3:
col0 col1 col2 col3
0 1 5.0 2.50 NaN
1 2 4.9 2.45 2.45
2 3 5.5 2.75 2.75
3 4 3.5 1.75 NaN
4 5 3.1 1.55 1.55
5 6 4.5 2.25 2.25
6 7 5.5 2.75 2.75
7 8 1.2 0.60 NaN
8 9 5.8 2.90 2.90
Open to suggestions on whether NaN is returned on the row the breach occurs or on next row shown as above (key desire is for if statement to resolve True as soon as setpoint is breached).
Try:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col0':[1,2,3,4,5,6,7,8,9]
,'col1':[5,4.9,5.5,3.5,3.1,4.5,5.5,1.2,5.8]
,'col2':[2.5,2.45,2.75,1.75,1.55,2.25,2.75,0.6,2.9]
,'col3':[np.nan,2.45,2.75,2.75,2.75,2.75,2.75,2.75,2.9]
})
threshold = 2.71
grp = df['col2'].ge(threshold).cumsum().shift().bfill()
df['col3'] = df['col2'].groupby(grp).transform(lambda x: x.shift(-1).cummax().shift())
print(df)
Output:
col0 col1 col2 col3
0 1 5.0 2.50 NaN
1 2 4.9 2.45 2.45
2 3 5.5 2.75 2.75
3 4 3.5 1.75 NaN
4 5 3.1 1.55 1.55
5 6 4.5 2.25 2.25
6 7 5.5 2.75 2.75
7 8 1.2 0.60 NaN
8 9 5.8 2.90 2.90
Details:
Create grouping using greater or equal to threshold, then apply the same logic to each group withn at the dataframe using groupby with transform.

Pandas DataFrame Transpose and Matrix Multiplication

I am looking for a way to perform a matrix multiplication on two sets of columns in a dataframe. One set of columns will need to be transposed and then multiplied with the other set. Then I need to take the resulting matrix and do an element wise product with a scalar matrix and add up. Below is an example:
Data for testing:
import pandas as pd
import numpy as np
dftest = pd.DataFrame(data=[['A',0.18,0.25,0.36,0.21,0,0.16,0.16,0.64,0.04,0,0],['B',0,0,0.5,0.5,0,0,0,0.25,0.75,0,0]],columns = ['Ticker','f1','f2','f3','f4','f5','p1','p2','p3','p4','p5','multiplier'])
Starting dataframe with data for Tickers. f1 through f5 represent one set of categories and p1 through p5 represent another.
dftest
Out[276]:
Ticker f1 f2 f3 f4 f5 p1 p2 p3 p4 p5 multiplier
0 A 0.18 0.25 0.36 0.21 0 0.16 0.16 0.64 0.04 0 0
1 B 0.00 0.00 0.50 0.50 0 0.00 0.00 0.25 0.75 0 0
For each row, I need to do transpose columns p1 through p5 and then multiply them to columns f1 through f5. I think I have found the solution using below.
dftest.groupby('Ticker')['f1','f2','f3','f4','f5','p1','p2','p3','p4','p5'].apply(lambda x: x[['p1','p2','p3','p4','p5']].T.dot(x[['f1','f2','f3','f4','f5']]))
Out[408]:
f1 f2 f3 f4 f5
Ticker
A p1 0.0288 0.04 0.0576 0.0336 0.0
p2 0.0288 0.04 0.0576 0.0336 0.0
p3 0.1152 0.16 0.2304 0.1344 0.0
p4 0.0072 0.01 0.0144 0.0084 0.0
p5 0.0000 0.00 0.0000 0.0000 0.0
B p1 0.0000 0.00 0.0000 0.0000 0.0
p2 0.0000 0.00 0.0000 0.0000 0.0
p3 0.0000 0.00 0.1250 0.1250 0.0
p4 0.0000 0.00 0.3750 0.3750 0.0
p5 0.0000 0.00 0.0000 0.0000 0.0
Next I need to do a element wise product of the above matrix against another 5x5 matrix that is in another DataFrame and then add up the columns or rows (you get the same result either way). If I extend the above statement as below, I get the result I want.
dftest.groupby('Ticker')['f1','f2','f3','f4','f5','p1','p2','p3','p4','p5'].apply(lambda x: pd.DataFrame(m.values * x[['p1','p2','p3','p4','p5']].T.dot(x[['f1','f2','f3','f4','f5']]).values, columns = m.columns, index = m.index).sum().sum())
Out[409]:
Ticker
A 2.7476
B 1.6250
dtype: float64
So far so good, I think. Happy to know a better and faster way to do this. The next question and this is where I am stuck.
How do I take this and update the "multiplier" column on my original dataFrame?
if I try to do the following:
dftest['multiplier']=dftest.groupby('Ticker')['f1','f2','f3','f4','f5','p1','p2','p3','p4','p5'].apply(lambda x: pd.DataFrame(m.values * x[['p1','p2','p3','p4','p5']].T.dot(x[['f1','f2','f3','f4','f5']]).values, columns = m.columns, index = m.index).sum().sum())
I get NaNs in the multiplier column.
dftest
Out[407]:
Ticker f1 f2 f3 f4 f5 p1 p2 p3 p4 p5 multiplier
0 A 0.18 0.25 0.36 0.21 0 0.16 0.16 0.64 0.04 0 NaN
1 B 0.00 0.00 0.50 0.50 0 0.00 0.00 0.25 0.75 0 NaN
I suspect it has to do with indexing and whether all the indices after grouping are translating back to the original dataframe. Second, do I need a group by statement for this one? Since it is a row by row solution, can't I just do it without grouping or group by the index? any suggestions on that?
I need to do this without iterating row by row because the whole code will iterate due to some optimization I have to do. So I need to run this whole process, look at the results and if they are outside some constraints, calculate new f1 through f5 and p1 through p5 and run the whole thing again.
I posted a question on this earlier but it was confusing so this a second attempt. Hope it makes sense.
Thanks in advance for all your help.