Fill missing values in DataFrame - pandas

I have a dataframe that is either missing two values in two columns, or one value in one column.
Date 30 45 60 90
0 2004-01-02 0.88 0.0 0.0 0.93
1 2004-01-05 0.88 0.0 0.0 0.91
...
20 2019-12-24 1.55 0 1.58 1.58
21 2019-12-26 1.59 0 1.60 1.58
I would like to compute all the zero values in the dataframe by some simple linear method. Here is the thing, if there is a value in the 60 column, use the average of the 60 and the 30 for the 45. Otherwise use some simple method to compute both the 45 and the 60.
What is the pandas way to do this? [Prefer no loops]
EDIT 1
As per the suggestions in the comment, I tried
df.replace(0, np.nan, inplace=True)
df=df.interpolate(method='linear', limit_direction='forward', axis=0)
But the df still contains all the np.nan

Related

Find first and last positive value of every season over 50 years

i've seen some similar question but can't figure out how to handle my problem.
I have a dataset with evereyday total snow values from 1970 till 2015.
Now i want to find out when there was the first and the last day with snow.
I want to do this for every season.
One season should be from, for example 01.06.2000 - 30.5.2001, this season is then Season 2000/2001.
I have already set my date column as index(format year-month-day, 2006-04-24)
When I select a specific range with
df_s = df["2006-04-04" : "2006-04-15"]
I am able to find out the first and last day with snow in this period with
firstsnow = df_c[df_c['Height'] > 0].head(1)
lastsnow = df_c[df_c['Height'] > 0].tail(1)
I want to do this now for the whole dataset, so that I'm able to compare each season and see how the time of first snow changed.
My dataframe looks like this(here you see a selected period with values),Height is Snowheight, Diff is the difference to the previous day. Height and Diff are Float64.
Height Diff
Date
2006-04-04 0.000 NaN
2006-04-05 0.000 0.000
2006-04-06 0.000 0.000
2006-04-07 16.000 16.000
2006-04-08 6.000 -10.000
2006-04-09 0.001 -5.999
2006-04-10 0.000 -0.001
2006-04-11 0.000 0.000
2006-04-12 0.000 0.000
2006-04-13 0.000 0.000
2006-04-14 0.000 0.000
2006-04-15 0.000 0.000
(12, 2)
<class 'pandas.core.frame.DataFrame'>
I think i have to work with the groupby function, but i don't know how to apply this function in this case.
You can use the trick to create new column with only positive value, and None otherwise. Then use ffill and bfill to get the head and tail
Sample data:
df = pd.DataFrame({'name': ['a1','a2','a3','a4','a5','b1','b2','b3','b4','b5'],
'gr':[1]*5+[2]*5,
'val1':[None,-1,2,1,None,-1,4,7,3,-2]})
Input:
name gr val1
0 a1 1 NaN
1 a2 1 -1.0
2 a3 1 2.0
3 a4 1 1.0
4 a5 1 NaN
5 b1 2 -1.0
6 b2 2 4.0
7 b3 2 7.0
8 b4 2 3.0
9 b5 2 -2.0
Set positive then ffill and bfill:
df['positive'] = np.where(df['val1']>0, df['val1'], None)
df['positive'] = df.groupby('gr')['positive'].apply(lambda g: g.ffill())
df['positive'] = df.groupby('gr')['positive'].apply(lambda g: g.bfill())
Check result:
df.groupby('gr').head(1)
df.groupby('gr').tail(1)
name gr val1 positive
0 a1 1 NaN 2.0
5 b1 2 -1.0 4.0
name gr val1 positive
4 a5 1 NaN 1.0
9 b5 2 -2.0 3.0

Convert value counts of multiple columns to pandas dataframe

I have a dataset in this form:
Name Batch DXYR Emp Lateral GDX MMT CN
Joe 2 0 2 2 2 0
Alan 0 1 1 2 0 0
Josh 1 1 2 1 1 2
Max 0 1 0 0 0 2
These columns can have only three distinct values ie. 0, 1 and 2..
So, I need percent of value counts for each column in pandas dataframe..
I have simply make a loop like:
for i in df.columns:
(df[i].value_counts()/df[i].count())*100
I am getting the output like:
0 90.608831
1 0.391169
2 9.6787899
Name: Batch, dtype: float64
0 95.545455
1 2.235422
2 2.6243553
Name: MX, dtype: float64
and so on...
These outputs are correct but I need it in pandas dataframe like this:
Batch DXYR Emp Lateral GDX MMT CN
Count_0_percent 98.32 52.5 22 54.5 44.2 53.4 76.01
Count_1_percent 0.44 34.5 43 43.5 44.5 46.5 22.44
Count_2_percent 1.3 64.3 44 2.87 12.6 1.88 2.567
Can someone please suggest me how to get it
You can melt the data, then use pd.crosstab:
melt = df.melt('Name')
pd.crosstab(melt['value'], melt['variable'], normalize='columns')
Or a bit faster (yet more verbose) with melt and groupby().value_counts():
(df.melt('Name')
.groupby('variable')['value'].value_counts(normalize=True)
.unstack('variable', fill_value=0)
)
Output:
variable Batch CN DXYR Emp Lateral GDX MMT
value
0 0.50 0.5 0.25 0.25 0.25 0.50
1 0.25 0.0 0.75 0.25 0.25 0.25
2 0.25 0.5 0.00 0.50 0.50 0.25
Update: apply also works:
df.drop(columns=['Name']).apply(pd.Series.value_counts, normalize=True)

How to extract a database based on a condition in pandas?

Please help me
The below one is the problem...
write an expression to extract a new dataframe containing those days where the temperature reached at least 70 degrees, and assign that to the variable at_least_70. (You might need to think some about what the different columns in the full dataframe represent to decide how to extract the subset of interest.)
After that, write another expression that computes how many days reached at least 70 degrees, and assign that to the variable num_at_least_70.
This is the original DataFrame
Date Maximum Temperature Minimum Temperature \
0 2018-01-01 5 0
1 2018-01-02 13 1
2 2018-01-03 19 -2
3 2018-01-04 22 1
4 2018-01-05 18 -2
.. ... ... ...
360 2018-12-27 33 23
361 2018-12-28 40 21
362 2018-12-29 50 37
363 2018-12-30 37 24
364 2018-12-31 35 25
Average Temperature Precipitation Snowfall Snow Depth
0 2.5 0.04 1.0 3.0
1 7.0 0.03 0.6 4.0
2 8.5 0.00 0.0 4.0
3 11.5 0.00 0.0 3.0
4 8.0 0.09 1.2 4.0
.. ... ... ... ...
360 28.0 0.00 0.0 1.0
361 30.5 0.07 0.0 0.0
362 43.5 0.04 0.0 0.0
363 30.5 0.02 0.7 1.0
364 30.0 0.00 0.0 0.0
[365 rows x 7 columns]
I wrote the code for the above problem is`
at_least_70 = dfc.loc[dfc['Minimum Temperature']>=70,['Date']]
print(at_least_70)
num_at_least_70 = at_least_70.count()
print(num_at_least_70)
The Results it is showing
Date
204 2018-07-24
240 2018-08-29
245 2018-09-03
Date 3
dtype: int64
But when run the test case it is showing...
Incorrect!
You are not correctly extracting the subset.
As suggested by #HenryYik, remove the column selector:
at_least_70 = dfc.loc[dfc['Maximum Temperature'] >= 70,
['Date', 'Maximum Temperature']]
num_at_least_70 = len(at_least_70)
Use boolean indexing and for count Trues of mask use sum:
mask = dfc['Minimum Temperature'] >= 70
at_least_70 = dfs[mask]
num_at_least_70 = mask.sum()

Imputing NAN values by pandas forward fill method with set pattern

Suppose I am working on a Dataset where there is a column name "F_N" containing numeric values in a sequence like 10, 20, 30, nan, 50, nan, 70. Here I want these null places to fill by 40, and 60 in the respective place with pandas' help. I am aware of fillna(method=ffill), but it will give us 30 and 50 exact values, not patterns.
Use linear interpolation with interpolate:
df['F_N'] = df['F_N'].interpolate()
>>> df
F_N
0 10.0
1 20.0
2 30.0
3 40.0
4 50.0
5 60.0
6 70.0
You describe a sequence with missing values. fillna() can take a series. Hence simplest is to fill with expected values. Code below demonstrates this:
import pandas as pd
import numpy as np
df = pd.DataFrame({"F_N":range(0,101,10)})
df.loc[np.random.choice(df.index,5)] = np.nan
df["fill"] = df["F_N"].fillna(pd.Series(range(0,101,10)))
output
F_N
fill
0
nan
0
1
10
10
2
20
20
3
30
30
4
nan
40
5
nan
50
6
60
60
7
70
70
8
nan
80
9
90
90
10
100
100

How to add up with a variable instead a number in a dataframe?

Hi guys i am trying to select a the 2nd value and then add this value to the rest of the array exept the 1st value.
this is what i have so far.
Xloc = X.iloc(1) # selecting the second variable
X = X[1:-1] + Xloc # this doenst work but if i do + 1.25 it works...
the Dataframe
X
0
1.25
2.57
4.5
6.9
7.3
Expected Result
X
0
2.5
3.82
5.75
8.15
8.55
Given that this is your original df
N
0 0.00
1 1.25
2 2.57
3 4.50
4 6.90
5 7.30
you can assign these values and use a simple concat to add in the original value in place
df['M'] = pd.concat([df["N"].iloc[:1], (df["N"].iloc[1:] + df["N"].iloc[1])])
print(df)
N M
0 0.00 0.00
1 1.25 2.50
2 2.57 3.82
3 4.50 5.75
4 6.90 8.15
5 7.30 8.55