TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index' - pandas

I have the following Panda dataframe:
dataframe
and I need to group the columns per quarter using the resampling function
df_copy=df_copy.resample('Q',axis=1).mean()
however when I apply the function I get the following error:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or
PeriodIndex, but got an instance of 'Index'
Please, can you help me to group the years into quarters? this function is just driving me crazy, I also previous to applying the function I change the type of the years(2000-1,2000-2,2000-3.....) from String to DateTime in order to re-sample.

Related

Regarding ValueError: NaTType does not support strftime

My dataframe has 2 levels for columns and I need to convert my column level[1] from datetime into strings but columns headers have some 'NaT's and hence my strftime function is failing.
df.columns=['a','b','d','d']+[x.strftime('%m-%d-%Y') for x in df.columns.levels[1][:-1]]
This gives me error that
ValueError: NaTType does not support strftime
Based on discussions on similar topic, I tried using
[x.datetime.strftime('%m-%d-%Y') for x in df.columns.levels[1][:-1]]
but then I get an error saying
AttributeError: 'Timestamp' object has no attribute 'datetime'
Is there anything that I am missing. Please help.
thank you!
You can add a condition when converting with:
[ x.strftime('%m-%d-%Y') if not pd.isnull(x) else "01-01-1970" \
for x in df.columns.levels[1][:-1]]

Error tokenizing data. C error: EOF inside string starting at row 148480 with groupby function (dask)

I'm working with a dask dataframe and as I want to apply the groupby method, I get the following Parse Error :
ParserError: Error tokenizing data. C error: EOF inside string starting at row 148480
I'm pretty new to python, I don't see how to fix a parsing Error that occurs when using a method
here's my code :
df1 = df.groupby('date')[['compound score']].mean().compute()
where date (srting) and compound score (float) are two columns of df (a dask dataframe)
Each date appears in many rows of the dataframe, that's why I want to use the groupby method.
I was expecting df1 to be a new dask dataframe with only 2 columns, the date and the mean of compound score for each date. Instead I get the parsing error.
I see many people are having this issue with pandas.read_csv(), but none with the groupby method.

Filtering DataFrame with static date value

I am trying to filter DataFrame to get all dates greater than '2012-09-15'
I tried the solution from another post which suggested me to use
data.filter(data("date").lt(lit("2015-03-14")))
but i am getting an error
TypeError: 'DataFrame' object is not callable
What is the solution for this
You need square brackets around "date", i.e.
data.filter(data["date"] < lit("2015-03-14"))
Calling data("date") is treating data as a function (rather than a dataframe)

Perfoming a conditional statement on a GROUPED data frame in pandas using jupyter notebook

I get the following error:
TypeError: ‘>=’ not supported between instances of ‘SeriesGroupBy’ and ‘int’
when I perform a conditional on a column of a GROUPED data frame.
group_school_data["reading_score"] >=70
I do not have this problem when I use the same syntax on a regular data frame (non-grouped). So when I type:
school_data_complete["reading_score"] >=70
I get a filtered list that performed a boolean and made all instances of 'reading score' >= 70 to 'True'. Which I can sum up.
However, the 'group_school_data' is a grouped data frame performed on the 'school_data_complete' by grouping the dataframe by school name as follow:
group_school_data = school_data_complete.groupby(["school_name"])
When I searched Stackoverflow, I did not find any hints. The most popular response shows how to create a grouped data frame based on an if condition, which is not what I am looking for.
I also watched the syntax that was suggested in the following instructional video for a non-grouped data frame, but I get the same error message. https://www.youtube.com/watch?v=wJhdZfuO2ZA
My code works for a regular data frame and returns a series.
school_data_complete["reading_score"] >=70
But does not work for a grouped data frame
group_school_data["reading_score"] >=70
and returns:
TypeError: ‘>=’ not supported between instances of ‘SeriesGroupBy’ and ‘int’
'group_school_data' is a grouped data frame performed on the 'school_data_complete' by grouping the dataframe by school as follow:
group_school_data = school_data_complete.groupby(["school_name"])
and the grouping returned a data frame using .head()
I expect to get a list or series when I perform the same on a grouped data frame but I get
TypeError Traceback (most recent call last)
in
----> 1 group_school_data["math_score"] >= 70
TypeError: '>=' not supported between instances of 'SeriesGroupBy' and 'int'

TypeError: <class 'datetime.time'> is not convertible to datetime

The problem is somewhat simple. My objective is to compute the days difference between two dates, say A and B.
These are my attempts:
df['daydiff'] = df['A']-df['B']
df['daydiff'] = ((df['A']) - (df['B'])).dt.days
df['daydiff'] = (pd.to_datetime(df['A'])-pd.to_datetime(df['B'])).dt.days
These works for me before but for some reason, I'm keep getting this error this time:
TypeError: class 'datetime.time' is not convertible to datetime
When I export the df to excel, then the date works just fine. Any thoughts?
Use pd.Timestamp to handle the awkward differences in your formatted times.
df['A'] = df['A'].apply(pd.Timestamp) # will handle parsing
df['B'] = df['B'].apply(pd.Timestamp) # will handle parsing
df['day_diff'] = (df['A'] - df['B']).dt.days
Of course, if you don't want to change the format of the df['A'] and df['B'] within the DataFrame that you are outputting, you can do this in a one-liner.
df['day_diff'] = (df['A'].apply(pd.Timestamp) - df['B'].apply(pd.Timestamp)).dt.days
This will give you the days between as an integer.
When I applied the solution offered by emmet02, I got TypeError: Cannot convert input [00:00:00] of type as well. It's basically saying that the dataframe contains missing timestamp values which are represented as [00:00:00], and this value is rejected by pandas.Timestamp function.
To address this, simply apply a suitable missing-value strategy to clean your data set, before using
df.apply(pd.Timestamp)