"None of [Float64Index([56.0, ..\n dtype='float64', length=1057499)] are in the [columns]" Pandas dataframe - pandas

Please excuse any obvious mistakes as I am new to Pandas and coding in general.
I am filtering the original dataframe and creating a copy with chosen columns. This is how my data frame looks like:
(dataframe filter routine):
df_new=df.filter(['date','location','value','lat_final','lon_final'], axis=1)
df_new = df_new.set_index('date')
print (df_new.head())
The new dataframe:
location value lat_final lon_final
date
2015-06-30 09:40:00+05:30 XYZI 56.0 28.6508 77.3152
2015-06-30 11:00:00+05:30 MNOP 36.0 28.6683 77.1167
2015-06-30 17:10:00+05:30 QRST 71.0 28.6508 77.3152
2015-06-30 11:00:00+05:30 UVWX 98.0 28.6508 77.3152
2015-06-30 09:40:00+05:30 XXYZ 26.0 28.6683 77.1167
While trying to perform some operations on columns in this new dataframe, I am getting the none type error. These are the operations I am performing:
(This step goes fine)
f=df_new[df_new['value']>=0]
f.drop(f[f['value'] >1500].index, inplace = True)
f.drop(f[f['value'] <2].index, inplace = True)
(The error crops up here):
#Filteration steps:
#Step1: grouping into 12h or n hour intervals:
diurnal = f[f['value']].resample('12h')
Where am I going wrong?
Any help will be much appreciated.

This: f[f['value']] will give you an error. If you want to resample the value column, you should select it properly, and also tell resample how you want to aggregate the values (sum, mean?). Something like this:
f['value'].resample('12h').sum()

Related

Append the values of a particular column from a given dataframe to another dataframe

I have two dataframes similar to samples beneath. First df1 has one column, and second has to columns. This is time series data.
#First dataFrame
data=('2013-01-01','2013-02-01','2013-03-01')
temperature=(-9,-14,5)
df1 = pd.DataFrame({"Data":data,"Temperature":temperature})
df1.set_index('Data',inplace=True)
#Second Dataframe
data2=('2013-04-01','2013-05-01','2013-06-01')
temperature2=(9,15,20)
temperature3=(7,19,22)
df2 = pd.DataFrame({"Data":data,"Temperature":temperature2,"Temperature2":temperature3})
df2.set_index('Data',inplace=True)
Both the dataframes have date type indexes. I want to join values from one column of df2 after values of df1, but I do not know how to do it. It is really simple thing in practice but I need to do this in pandas. Couldn't find any solution in the web. New dataframe should like like this
df_new
2013-01-01 -9
2013-02-01 -14
2013-03-01 5
2013-04-01 9
2013-05-01 15
2013-06-01 20
You can use the pd.concat function:
df_new = pd.concat([df1, df2])
Ok ! i Found a solution to my problem. My DataFrames were imported from xlsx files and headers of columns should have same name. Then pd.concat is working fine ! Thanks !
Problem solved !

pandas groupby keeping other columns

This question is similar to this one, but in my case I need to apply a function that returns a Series rather than a single value for each group — that question is about aggregating with sum, but I need to use rank (so the difference is like that between agg and transform).
I have data on firms over time. This generates some dummy data that looks like my use case:
import numpy as np
import pandas as pd
dates = pd.date_range('1926', '2020', freq='M')
ndates = len(dates)
nfirms = 5000
cols = list('ABCDE')
df = pd.DataFrame(np.random.randn(nfirms*ndates,len(cols)),
index=np.tile(dates,nfirms),
columns=cols)
df.insert(0, 'id', np.repeat(np.arange(nfirms), ndates))
I need to calculate ranks of column E within each date (the index), but keeping column id.
If I just use groupby and .rank I get this:
df.groupby(level=0)['E'].rank()
1926-01-31 3226.0
1926-02-28 1042.0
1926-03-31 1611.0
1926-04-30 2591.0
1926-05-31 30.0
...
2019-08-31 1973.0
2019-09-30 227.0
2019-10-31 4381.0
2019-11-30 1654.0
2019-12-31 1572.0
Name: E, Length: 5640000, dtype: float64
This has the same dimension as df but I'm not sure it's safe to merge on the index — I really need to join on the id column also. Can I assume that the order remains the same?
If the order in the output is the same as in the output, I think I can do this:
df['ranks'] = df.groupby(level=0)['E'].rank()
But something about this seems strange, and I assume there is a way to include additional columns in the groupby output.
(I'm also not clear if calling .rank() is equivalent to .transform('rank').)

Pandas dataframe datetime timestamp from string

I am trying to convert a column in a pandas dataframe from a string to a timestamp.
Due to a slightly annoying constraint (I am limited by my employers software & IT policy) I am running an older version of Pandas (0.14.1). This version does include the "pd.Timestamp".
Essentially, I want to pass a dataframe column formatted as a string to "pd.Timestamp" to create a column of Timestamps. Here is an example dataframe
'Date/Time String' 'timestamp'
0 2017-01-01 01:02:03 NaN
1 2017-01-02 04:05:06 NaN
2 2017-01-03 07:08:09 NaN
My DataFrame is very big, so iterating through it is really inefficient. But this is what I came up with:
for i in range (len(df['Date/Time String'])):
df['timestamp'].iloc[i] = pd.Timestamp(df['Date/Time String'].iloc[i])
What would be the sensible way to make this operation much faster?
You can check this:
import pandas as pd
df['Date/Time Timestamp'] = pd.to_datetime(df['Date/Time String'])

building a DataFrame of a portfolio of symbols

I'm new to pandas.
I'm like to read the quotes for a number of symbols (e.g. ['SPY', 'IWM', 'QQQ']) from Yahoo (which I do with no problem) and then I'd like to use only the 'Adj Close' columns to build a portfolio of ETFs over a given period of time.
Say that I'd like to start with an empty DataFrame whose index are the dates where the market is open, taken for example from the first df. Subsequently, I'd like to "append" to the right one single column at a time with the 'Adj Close' of each symbol, renamed with the ticker name.
I'm sure it must be simple, but I can't get it. Can anybody help me? thank you in advance.
If you are just using the Adj Close column, it is easiest to extract it immediately after reading the data.
import pandas.io.data as web
df = web.DataReader(['F', 'AAPL', 'IBM'], 'yahoo', '2016-05-02', '2016-05-06')['Adj Close']
>>> df
AAPL F IBM
Date
2016-05-02 93.073328 13.62 143.881476
2016-05-03 94.604009 13.43 142.752373
2016-05-04 93.620002 13.31 142.871221
2016-05-05 93.239998 13.32 145.070003
2016-05-06 92.720001 13.44 147.289993

Pandas not detecting the datatype of a Series properly

I'm running into something a bit frustrating with pandas Series. I have a DataFrame with several columns, with numeric and non-numeric data. For some reason, however, pandas thinks some of the numeric columns are non-numeric, and ignores them when I try to run aggregating functions like .describe(). This is a problem, since pandas raises errors when I try to run analyses on these columns.
I've copied some commands from the terminal as an example. When I slice the 'ND_Offset' column (the problematic column in question), pandas tags it with the dtype of object. Yet, when I call .describe(), pandas tags it with the dtype float64 (which is what it should be). The 'Dwell' column, on the other hand, works exactly as it should, with pandas giving float64 both times.
Does anyone know why I'm getting this behavior?
In [83]: subject.phrases['ND_Offset'][:3]
Out[83]:
SubmitTime
2014-06-02 22:44:44 0.3607049
2014-06-02 22:44:44 0.2145484
2014-06-02 22:44:44 0.4031347
Name: ND_Offset, dtype: object
In [84]: subject.phrases['ND_Offset'].describe()
Out[84]:
count 1255.000000
unique 432.000000
top 0.242308
freq 21.000000
dtype: float64
In [85]: subject.phrases['Dwell'][:3]
Out[85]:
SubmitTime
2014-06-02 22:44:44 111
2014-06-02 22:44:44 81
2014-06-02 22:44:44 101
Name: Dwell, dtype: float64
In [86]: subject.phrases['Dwell'].describe()
Out[86]:
count 1255.000000
mean 99.013546
std 30.109327
min 21.000000
25% 81.000000
50% 94.000000
75% 111.000000
max 291.000000
dtype: float64
And when I use the .groupby function to group the data by another attribute (when these Series are a part of a DataFrame), I get the DataError: No numeric types to aggregate error when I try to call .agg(np.mean) on the group. When I try to call .agg(np.sum) on the same data, on the other hand, things work fine.
It's a bit bizarre -- can anyone explain what's going on?
Thank you!
It might be because the ND_Offset column (what I call A below) contains a non-numeric value such as an empty string. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0.36, ''], 'B': [111, 81]})
print(df['A'].describe())
# count 2.00
# unique 2.00
# top 0.36
# freq 1.00
# dtype: float64
try:
print(df.groupby(['B']).agg(np.mean))
except Exception as err:
print(err)
# No numeric types to aggregate
print(df.groupby(['B']).agg(np.sum))
# A
# B
# 81
# 111 0.36
Aggregation using np.sum works because
In [103]: np.sum(pd.Series(['']))
Out[103]: ''
whereas np.mean(pd.Series([''])) raises
TypeError: Could not convert to numeric
To debug the problem, you could try to find the non-numeric value(s) using:
for val in df['A']:
if not isinstance(val, float):
print('Error: val = {!r}'.format(val))