Perfoming a conditional statement on a GROUPED data frame in pandas using jupyter notebook - pandas

I get the following error:
TypeError: ‘>=’ not supported between instances of ‘SeriesGroupBy’ and ‘int’
when I perform a conditional on a column of a GROUPED data frame.
group_school_data["reading_score"] >=70
I do not have this problem when I use the same syntax on a regular data frame (non-grouped). So when I type:
school_data_complete["reading_score"] >=70
I get a filtered list that performed a boolean and made all instances of 'reading score' >= 70 to 'True'. Which I can sum up.
However, the 'group_school_data' is a grouped data frame performed on the 'school_data_complete' by grouping the dataframe by school name as follow:
group_school_data = school_data_complete.groupby(["school_name"])
When I searched Stackoverflow, I did not find any hints. The most popular response shows how to create a grouped data frame based on an if condition, which is not what I am looking for.
I also watched the syntax that was suggested in the following instructional video for a non-grouped data frame, but I get the same error message. https://www.youtube.com/watch?v=wJhdZfuO2ZA
My code works for a regular data frame and returns a series.
school_data_complete["reading_score"] >=70
But does not work for a grouped data frame
group_school_data["reading_score"] >=70
and returns:
TypeError: ‘>=’ not supported between instances of ‘SeriesGroupBy’ and ‘int’
'group_school_data' is a grouped data frame performed on the 'school_data_complete' by grouping the dataframe by school as follow:
group_school_data = school_data_complete.groupby(["school_name"])
and the grouping returned a data frame using .head()
I expect to get a list or series when I perform the same on a grouped data frame but I get
TypeError Traceback (most recent call last)
in
----> 1 group_school_data["math_score"] >= 70
TypeError: '>=' not supported between instances of 'SeriesGroupBy' and 'int'

Related

Error tokenizing data. C error: EOF inside string starting at row 148480 with groupby function (dask)

I'm working with a dask dataframe and as I want to apply the groupby method, I get the following Parse Error :
ParserError: Error tokenizing data. C error: EOF inside string starting at row 148480
I'm pretty new to python, I don't see how to fix a parsing Error that occurs when using a method
here's my code :
df1 = df.groupby('date')[['compound score']].mean().compute()
where date (srting) and compound score (float) are two columns of df (a dask dataframe)
Each date appears in many rows of the dataframe, that's why I want to use the groupby method.
I was expecting df1 to be a new dask dataframe with only 2 columns, the date and the mean of compound score for each date. Instead I get the parsing error.
I see many people are having this issue with pandas.read_csv(), but none with the groupby method.

Error when filtering pandas dataframe by column value

I am having a problem with filtering a pandas dataframe. I am trying to filter a dataframe based on column values being equal to a specific list but I am getting a length error.
I tried every possible way of filtering a dataframe but got nowhere. Any help would be appreciated, thanks in advance.
Here is my code :
for ind in df_hourly.index:
timeslot = df_hourly['date_parsed'][ind][0:4] # List value to filter
filtered_df = df.loc[df['timeslot'] == timeslot]
Error : ValueError: ('Lengths must match to compare', (5696,), (4,))
Above Image : df , Below Image : df_hourly
In the above image, the dataframe I want to filter is shown. Specifically, I want to filter according to the "timeslot" column.
And the below image shows the the dataframe which includes the value I want to filter by. I specifically want to filter by "date_parsed" column. In the first line of my code, I iterate through every row in this dataframe and assign the first 4 elements of the list value in df_hourly["date_parsed"] to a variable and later in the code, I try to filter the above dataframe by that variable.
When comparing columns using ==, pandas try to compare value by value - aka does the first item equals to first item, second item to the second and so on. This is why you receive this error - pandas expects to have two columns of the same shape.
If you want to compare if value is inside a list, you can use the .isin (documentation):
df.loc[df['timeslot'].isin(timeslot)]
Depends on what timeslot is exactly, you might to take timeslot.values or something like that (hard to understand exactly without giving an example for your dataframe)

combining low frequency value into single "other" category using pandas

I am using this line of code which has the replace method to combine low frequency values in the column
psdf['method_name'] = psdf['method_name'].replace(small_categoris, 'Other')
The error I am getting is:
'to_replace' should be one of str, list, tuple, dict, int, float
So I tried to run this line of code before the replace method
psdf['layer'] = psdf['layer'].astype("string")
Now the column is of type string but the same error still appears. For the context, I am working on pandas api on spark. Also, is there a more efficient way than replace? especially if we want to do the same for more than one column.

Filtering DataFrame with static date value

I am trying to filter DataFrame to get all dates greater than '2012-09-15'
I tried the solution from another post which suggested me to use
data.filter(data("date").lt(lit("2015-03-14")))
but i am getting an error
TypeError: 'DataFrame' object is not callable
What is the solution for this
You need square brackets around "date", i.e.
data.filter(data["date"] < lit("2015-03-14"))
Calling data("date") is treating data as a function (rather than a dataframe)

Julia : Dataframes packages having trouble to convert column containing both int and float

I'm a R user with great interest for Julia. I don't have a computer science background. I just tried to read a 'csv' file in Juno with the following command:
using CSV
using DataFrames
df = CSV.read(joinpath(Pkg.dir("DataFrames"),
"path/to/database.csv"));
and got the following error message
CSV.CSVError('error parsing a 'Int64' value on column 26, row 289; encountered '.'"
in read at CSV/src/Source.jl:294
in #read#29 at CSV/src/Source.jl:299
in stream! at DataStreams/src/DataStreams.jl:145
in stream!#5 at DataStreams/src/DataStreams.jl:151
in stream! at DataStreams/src/DataStreams.jl:187
in streamto! at DataStreams/src/DataStreams.jl:173
in streamfrom at CSV/src/Source.jl:195
in paresefield at CSV/src/paresefield.jl:107
in paresefield at CSV/src/paresefield.jl:127
in checknullend at CSV/src/paresefield.jl:56
I look at the entry indicated in the data frame: the row 287, 288 are like this 30, 33 respectively (seem to be of type Integer) and the the row 289 is 30.445 (which is of type float).
Is the problem that DataFrames filling the column with Int and stopped when it saw an Float?
Many thanks in advance
The problem is that float happens too late in the data set. By default CSV.jl uses rows_for_type_detect value equal to 100. Which means that only first 100 rows are used to determine the type of a column in the output. Set rows_for_type_detect keyword parameter in CSV.read to e.g. 300 and all should work correctly.
Alternatively you can pass types keyword argument to manually set column type (in this case Float64 for this column would be appropriate).