'NaTType' object has no attribute 'days' - pandas

I have a column in my dataset which represents a date in ms and sometimes its values is nan (actually my columns is of type str and sometimes its valus is 'nan'). I want to compute the epoch in days of this column. The problem is that when doing the difference of two dates:
(pd.to_datetime('now') - pd.to_datetime(np.nan)).days
if one is nan it is converted to NaT and the difference is of type NaTType which hasn't the attribute days.
In my case I would like to have nan as a result.
Other approach I have tried: np.datetime64 cannot be used, since it cannot take as argument nan. My data cannot be converted to int since int doesn't have nan.

It will just work even if you filter first:
In [201]:
df = pd.DataFrame({'date':[dt.datetime.now(), pd.NaT, dt.datetime(2015,1,1)]})
df
Out[201]:
date
0 2015-08-28 12:12:12.851729
1 NaT
2 2015-01-01 00:00:00.000000
In [203]:
df.loc[df['date'].notnull(), 'days'] = (pd.to_datetime('now') - df['date']).dt.days
df
Out[203]:
date days
0 2015-08-28 12:12:12.851729 -1
1 NaT NaN
2 2015-01-01 00:00:00.000000 239

For me upgrading to pandas 0.20.3 from pandas 0.19.2 helped resolve this error.
pip install --upgrade pandas

Related

How multipe series into a dataframe with a series name per column in pandas

I have a list of pd.Series with different date indexes and names as such:
trade_date
2007-01-03 0.049259
2007-01-04 0.047454
2007-01-05 0.057485
2007-01-08 0.059216
2007-01-09 0.055359
...
2013-12-24 0.021048
2013-12-26 0.021671
2013-12-27 0.017898
2013-12-30 0.034071
2013-12-31 0.022301
Name: name1, Length: 1762, dtype: float64
I want to join this list of series into a DataFrame where each Name becomes a column in the DataFame and any missing indexes are set as nan.
When I try pd.concat(list_data) I just get one really big series instead. If i create an empty DataFrame and loop over each series in my list I get an error ValueError: cannot reindex from a duplicate axis How can I join these into a DataFame?
Use:
pd.concat(map(lambda s: s.groupby(level=0).last(), list_data), axis=1)
older answer
You should use axis=1 in pandas.concat:
pd.concat([series1, series2, series3], axis=1)
example on your data (assuming s the provided series):
pd.concat([s, (s+1).rename('name2').iloc[5:]], axis=1)
output:
name1 name2
trade_date
2007-01-03 0.049259 NaN
2007-01-04 0.047454 NaN
2007-01-05 0.057485 NaN
2007-01-08 0.059216 NaN
2007-01-09 0.055359 NaN
2013-12-24 0.021048 1.021048
2013-12-26 0.021671 1.021671
2013-12-27 0.017898 1.017898
2013-12-30 0.034071 1.034071
2013-12-31 0.022301 1.022301
You need to concat along the columns (combine Series horizontally):
pd.concat(list_data, axis=1)
You probably have multiple rows with the same date index per Series.
To debug this problem, you can do:
for sr in list_data:
sr = sr[sr.index.value_counts() > 1]
if len(sr):
print(f'[sr.name]')
print(sr, end='\n')
If there is an output, you can't use concat. Perhaps, you have to use merge with how='outer' as parameter.

Series.replace cannot use dict-like to_replace and non-None value [duplicate]

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.
How can I replace the nans with averages of columns where they are?
This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.
You can simply use DataFrame.fillna to fill the nan's directly:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64
In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().
Try:
sub2['income'].fillna((sub2['income'].mean()), inplace=True)
In [16]: df = DataFrame(np.random.randn(10,3))
In [17]: df.iloc[3:5,0] = np.nan
In [18]: df.iloc[4:6,1] = np.nan
In [19]: df.iloc[5:8,2] = np.nan
In [20]: df
Out[20]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 NaN -0.985188 -0.324136
4 NaN NaN 0.238512
5 0.769657 NaN NaN
6 0.141951 0.326064 NaN
7 -1.694475 -0.523440 NaN
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
In [22]: df.mean()
Out[22]:
0 -0.251534
1 -0.040622
2 -0.841219
dtype: float64
Apply per-column the mean of that columns and fill
In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622 0.238512
5 0.769657 -0.040622 -0.841219
6 0.141951 0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:
df.fillna(df.mean())
In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.
I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value
#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----
df.fillna(df.mean())
#------------------------------------------------
#----(Code 2) Selective Treatment----------------
for i in df.columns[df.isnull().any(axis=0)]: #---Applying Only on variables with NaN values
df[i].fillna(df[i].mean(),inplace=True)
#---df.isnull().any(axis=0) gives True/False flag (Boolean value series),
#---which when applied on df.columns[], helps identify variables with NaN values
Below is the performance i observed, as i kept on increasing the # records in DataFrame
DataFrame with ~100k records
Code 1: 22.06 Seconds
Code 2: 0.03 Seconds
DataFrame with ~200k records
Code 1: 180.06 Seconds
Code 2: 0.06 Seconds
DataFrame with ~1.6 Million records
Code 1: code kept running endlessly
Code 2: 0.40 Seconds
DataFrame with ~13 Million records
Code 1: --did not even try, after seeing performance on 1.6 Mn records--
Code 2: 3.20 Seconds
Apologies for a long answer ! Hope this helps !
If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.
sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
# To read data from csv file
Dataset = pd.read_csv('Data.csv')
X = Dataset.iloc[:, :-1].values
# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Directly use df.fillna(df.mean()) to fill all the null value with mean
If you want to fill null value with mean of that column then you can use this
suppose x=df['Item_Weight'] here Item_Weight is column name
here we are assigning (fill null values of x with mean of x into x)
df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))
If you want to fill null value with some string then use
here Outlet_size is column name
df.Outlet_Size = df.Outlet_Size.fillna('Missing')
Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column
Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']
If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:
Use method .fillna():
mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)
I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.
You should be careful when using the mean. If you have outliers is more recommendable to use the median
Another option besides those above is:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.
using sklearn library preprocessing class
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])
Note: In the recent version parameter missing_values value change to np.nan from NaN
I use this method to fill missing values by average of a column.
fill_mean = lambda col : col.fillna(col.mean())
df = df.apply(fill_mean, axis = 0)
You can also use value_counts to get the most frequent values. This would work on different datatypes.
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
Here is the value_counts api reference.

How to sort a column that have dates in a dataframe?

I have a dataframe like this:
SEMANAS HIDROLOGICAS METEOROLOGICAS
0 02042020 36.00583090379008 31.284418529316522
1 05032020 86.91690962099126 77.01136731748973
2 12032020 87.31778425655976 77.24180581323434
3 19032020 59.2201166180758 54.57343110404338
4 26032020 32.39795918367347 29.049238743116323
I used this code to change df.SEMANAS to datetime
Semanas_Oper['SEMANAS']=pd.to_datetime(Semanas_Oper['SEMANAS'], format='%d%m%Y').dt.strftime('%d/%m/%Y')
SEMANAS HIDROLOGICAS METEOROLOGICAS
02/04/2020 36.01 31.28
05/03/2020 86.92 77.01
12/03/2020 87.32 77.24
19/03/2020 59.22 54.57
26/03/2020 32.4 29.05
But pd.to_datetime is not sorting the dates of the column df.SEMANAS
Can you tell me how to sort this columns. 02/04/2020 must be in the last row.
dt.strftime() undoes the datetime conversion and brings you back to strings. If you sort on this, you'll be left with lexiographical sorting, not what you want given your format is '%d/%m/%Y' (would be fine with '%Y/%m/%d').
When working with dates in pandas you should keep the datetime64[ns] dtype. It's the easiest way to perform all datetime operations. Only use .strftime when you need to move to some other library or file output that requires a very specific string format.
df['SEMANAS'] = pd.to_datetime(df['SEMANAS'], format='%d%m%Y')
df.dtypes
#SEMANAS datetime64[ns]
#HIDROLOGICAS object
#METEOROLOGICAS object
df = df.sort_values('SEMANAS')
# SEMANAS HIDROLOGICAS METEOROLOGICAS
#1 2020-03-05 86.91690962099126 77.01136731748973
#2 2020-03-12 87.31778425655976 77.24180581323434
#3 2020-03-19 59.2201166180758 54.57343110404338
#4 2020-03-26 32.39795918367347 29.049238743116323
#0 2020-04-02 36.00583090379008 31.284418529316522
You need to sort it using datetime64 ns format and change it back to dd/mm/yyyy if you want
df['SEMANAS'] = pd.to_datetime(df['SEMANAS'], format='%d%m%Y')
df.sort_values(by=['SEMANAS'], inplace=True)
df['SEMANAS'] = pd.to_datetime(df['SEMANAS'], format='%d%m%Y').dt.strftime('%d/%m/%Y')
print(df)
SEMANAS HIDROLOGICAS METEOROLOGICAS
1 05/03/2020 86.916910 77.011367
2 12/03/2020 87.317784 77.241806
3 19/03/2020 59.220117 54.573431
4 26/03/2020 32.397959 29.049239
0 02/04/2020 36.005831 31.284419

Pandas dataframe datetime timestamp from string

I am trying to convert a column in a pandas dataframe from a string to a timestamp.
Due to a slightly annoying constraint (I am limited by my employers software & IT policy) I am running an older version of Pandas (0.14.1). This version does include the "pd.Timestamp".
Essentially, I want to pass a dataframe column formatted as a string to "pd.Timestamp" to create a column of Timestamps. Here is an example dataframe
'Date/Time String' 'timestamp'
0 2017-01-01 01:02:03 NaN
1 2017-01-02 04:05:06 NaN
2 2017-01-03 07:08:09 NaN
My DataFrame is very big, so iterating through it is really inefficient. But this is what I came up with:
for i in range (len(df['Date/Time String'])):
df['timestamp'].iloc[i] = pd.Timestamp(df['Date/Time String'].iloc[i])
What would be the sensible way to make this operation much faster?
You can check this:
import pandas as pd
df['Date/Time Timestamp'] = pd.to_datetime(df['Date/Time String'])

What is the functionality of the filling method when reindexing?

When reindexing, say, 1 minute data to daily data (e.g. and index for daily prices at 16:00), if there is a situation that there is no 1 minute data for the 16:00 timestamp on a day, we would want to forward fill from the last non-null 1min data. In the following case, there is no 1min data before 16:00 on the 13th, and the last 1min data comes from 10th.
When using reindex with method='ffill', wouldn't one expect the following code to fill in the value on the 13th at 16:00? Inspecting daily1 shows that it is missing however.
import pandas as pd
import numpy as np
hf_index = pd.date_range(start='2013-05-09 9:00', end='2013-05-13 23:59', freq='1min')
hf_prices = np.random.rand(len(hf_index))
hf = pd.DataFrame(hf_prices, index=hf_index)
hf.ix['2013-05-10 18:00':'2013-05-13 18:00',:]=np.nan
hf.plot()
ind_daily = pd.date_range(start='2013-05-09 16:00', end='2013-05-13 16:00', freq='B')
print(ind_daily.values)
daily1 = hf.reindex(index=ind_daily, method='ffill')
To fill as one (or rather I) would expect, I need to do this:
daily2 = daily1.fillna(method='ffill')
If this is the case, what is the fill method in reindex actually doing. It is not clear to me just from the pandas documentation. It seems to me I should not have to do the above line.
I write my comment on the github here as well:
The current behavior in my opinion makes more sense. 'nan' values can be valid "actual" values in some scenarios. the concept of an actual 'nan' value should be different from 'nan' value because of changing index. If I have a dataframe like this:
A B C
1 1.242 NaN 0.110
3 NaN -0.185 -0.209
5 -0.581 1.483 NaN
and i want to keep all nan as nan, it makes much more sense to have:
df.reindex( [2, 4, 6], method='ffill' )
A B C
2 1.242 NaN 0.110
4 NaN -0.185 -0.209
6 -0.581 1.483 NaN
just take whatever value there is ( nan or not nan ) and fill forward until the next available index. Reindexing should not enforce a mandatory fillna on the data.
This is completely different from
df.reindex( [2, 4, 6], method=None )
which produces
A B C
2 NaN NaN NaN
4 NaN NaN NaN
6 NaN NaN NaN
Here is an example:
np.nan can just mean not applicable; say i have hourly data, and on weekends some calculations are just not applicable. I will fill nan for those columns during the weekends. now if I reindex to finer index, say every minute, the reindex will pick the last value from Friday, and fill it out for the whole weekend. This is wrong.
in reindexing a dataframe, forward flll means just take whatever value there is ( nan or not nan ) and fill forward until the next available index. A 'nan' value can be just an actual valid observation which you want to keep as is.
Reindexing should not enforce a mandatory fillna on the data.