Plotting line graph from pandas DataFrame - does not work if I do not include .mean(), .sum() or even .median(). Very confused - pandas

I have a DataFrame that has list of date, city, country, and average temperature in Celcius.
Here is the .info() of this DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16500 entries, 0 to 16499
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 16500 non-null object
1 city 16500 non-null object
2 country 16500 non-null object
3 avg_temp_c 16407 non-null float64
dtypes: float64(1), object(3)
I only want to plot the avg_temp_c for the cities of Toronto and Rome, both lines on one graph. This was actually a practice problem that has the solution, so here is the code for that:
toronto = temperatures[temperatures["city"] == "Toronto"]
rome = temperatures[temperatures["city"] == "Rome"]
toronto.groupby("date")["avg_temp_c"].mean().plot(kind="line", color="blue")
rome.groupby("date")["avg_temp_c"].mean().plot(kind="line", color="green")
My question is: why do I need to include .mean() in lines 3 and 4? I thought the numbers were already in avg_temp_c. Also, I experimented by replacing .mean() with .sum() and .median(), and it gives the same values. However, removing .mean() altogether for both lines just gives a blank plot. I tried to figure why, but I am very confused and I want to understand. Why doesn't it work without .mean() when the values are already listed in avg_temp_c?
I tried removing .mean(). I tried replacing .mean() with .median() and .sum(), which give the exact same values for some reason. I tried just printing toronto["avg_temp_c"] and rome["avg_temp_c"], which gives me the values, but when I plot it without .mean(), .sum(), or .median(), it does not work. I am just trying to figure why this is the case, and how does all three of those methods give me the same values as if I were just to print the avg_temp_c list?
Hope my question was clear. Thank you!

Related

How to handle categorical variables in MissForest() during missing value imputation?

I am working on a regression problem using Bengaluru house price prediction dataset.
I was trying to impute missing values in bath and balcony using MissForest().
Since documentation says that MissForest() can handle categorical variables using 'cat_vars' parameter, I tried to use 'area_type' and 'locality' features in the imputer fit_transform method by passing their index, as shown below:
df_temp.info()
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 area_type 10296 non-null object
1 location 10296 non-null object
2 bath 10245 non-null float64
3 balcony 9810 non-null float64
4 rooms 10296 non-null int64
5 tot_sqft_1 10296 non-null float64
imputer = MissForest()
imputer.fit_transform(df_temp, cat_vars=[0,1])
But I am getting the below error:
'Cannot convert str to float: 'Super built up Area''
Could you please let me know why this could be? Do we need to encode the categorical variables using one hot encoding?

How do I get the column names for the keys in a DataFrameGroupBy object? [duplicate]

This question already has an answer here:
How can I get the name of grouping columns from a Pandas GroupBy object?
(1 answer)
Closed 12 months ago.
Given a grouped DataFrame (obtained by df.groupby([col1, col2])) I would like to obtain the grouping variables (col1 and col2 in this case).
For example, from the GroupBy user guide
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
("bird", "Falconiformes", 389.0),
("bird", "Psittaciformes", 24.0),
("mammal", "Carnivora", 80.2),
("mammal", "Primates", np.nan),
("mammal", "Carnivora", 58),
],
index=["falcon", "parrot", "lion", "monkey", "leopard"],
columns=("class", "order", "max_speed"),
)
grouped = df.groupby(["class", "order"])
Given grouped I would like to get class and order. However, grouped.indices and grouped.groups contain only the values of the keys, not the column names.
The column names must be in the object somewhere, because if I run grouped.size() for example, they are included in the indices:
class order
bird Falconiformes 1
Psittaciformes 1
mammal Carnivora 2
Primates 1
dtype: int64
And therefore I can run grouped.size().index.names which returns FrozenList(['class', 'order']). But this is doing an unnecessary calculation of .size(). Is there a nicer way of retrieving these from the object?
The ultimate reason I'd like this is so that I can do some processing for a particular group, and associate it with a key-value pair which defines the group. That way I would be able to amalgamate different grouped datasets with arbitrary levels of grouping. For example I could have
group max_speed
class=bird,order=Falconiformes 389.0
class=bird,order=Psittaciformes 24.0
class=bird 206.5
foo=bar 45.5
...
Very similar to your own suggestion, you can extract the grouped by column names using:
grouped.dtypes.index.names
It is not shorter, but you avoid calling a method.
Grouped DataFrame (obtained by df.groupby([col1, col2])) is converted to pandas.core.groupby.generic.DataFrameGroupBy- Object. So we have to convert it into DataFrame in order to get the column names.
df2 = pd.DataFrame(grouped.size().reset_index(name = "Group_Count"))
print(df2)
Output:
class order Group_Count
0 bird Falconiformes 1
1 bird Psittaciformes 1
2 mammal Carnivora 2
3 mammal Primates 1
print(df2.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 class 4 non-null object
1 order 4 non-null object
2 Group_Count 4 non-null int64
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes
I think this would solve your problem of selecting the column names from the grouped data. The function DataFrame.groupBy(cols) returns a GroupedData object. In order to convert a GroupedData object back to a DataFrame, you will need to use one of the GroupedData functions such as mean(cols) avg(cols) count().

Removing values of a certain object type from a dataframe column in Pandas

I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397

When plotting a pandas dataframe, the y-axis values are not displayed correctly

I have a dataframe (see link for image) and I've listed the info on the data frame. I use the pivot_table function to sum the total number of births for each year. The issue is that when I try to plot the dataframe, the y-axis values range from 0 to 2.0 instead of the minimum and maximum values from the M and F columns.
To verify that it's not my environment, I created a simple dataframe, with just a few values and plot the line graph for that dataframe and it works as expected. Does anyone know why this is happening? Attempting to set the values using ylim or yticks is not working. Ultimately, I will have to try other graphing utilities like matplotlib, but I'm curious as to why it's not working for such a simple dataframe and dataset.
Visit my github page for a working example <git#github.com:stevencorrea-chicago/stackoverflow_question.git>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 1690784 non-null object
1 sex 1690784 non-null object
2 births 1690784 non-null int64
3 year 1690784 non-null Int64
dtypes: Int64(1), int64(1), object(2)
memory usage: 53.2+ MB
new_df = df.pivot_table(values='births', index='year', columns='sex', aggfunc=sum)
new_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 131 entries, 1880 to 2010
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 F 131 non-null int64
1 M 131 non-null int64
dtypes: int64(2)
memory usage: 3.1+ KB

Pandas df.head() Error

I'm having a basic problem with df.head(). When the function is executed, it usually displays a nice HTML formatted table of the top 5 values, but now it only appears to slice the dataframe and outputs like below:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 9 columns):
survived 5 non-null values
pclass 5 non-null values
name 5 non-null values
sex 5 non-null values
age 5 non-null values
sibsp 5 non-null values
parch 5 non-null values
fare 5 non-null values
embarked 5 non-null values
dtypes: float64(2), int64(4), object(3)
After looking at this thread I tried
pd.util.terminal.get_terminal_size()
and received the expected output (80, 25). Manually setting the print options with
pd.set_printoptions(max_columns=10)
Yields the same sliced dataframe results like above.
This was confirmed after diving into the documentation here and using the
get_option("display.max_rows")
get_option("display.max_columns")
and getting the correct default 60 rows and 10 columns.
I've never had a problem with df.head() before but now its an issue in all of my IPython Notebooks.
I'm running pandas 0.11.0 and IPython 0.13.2 in google chrome.
In pandas 11.0, I think the minimum of display.height and max_rows (and display.width and max_columns`) is used, so you need to manually change that too.
I don't like this, I posted this github issue about it previously.
Try using the following to display the top 10 items
from IPython.display import HTML
HTML(users.head(10).to_html())
I think pandas 11.0 head function is totally unintuitive and should
simply have remained as head() and you get your html.