How do you iterate two variables in DataFrame, one of which is autoincrementing year? - pandas

import pandas as pd
df = pd.DataFrame(
[['New York', 1995, 160000],
['Philadelphia', 1995, 115000],
['Boston', 1995, 145000],
['New York', 1996, 167500],
['Philadelphia', 1996, 125000],
['Boston', 1996, 148000],
['New York', 1997, 180000],
['Philadelphia', 1997, 135000],
['Boston', 1997, 185000],
['New York', 1998, 200000],
['Philadelphia', 1998, 145000],
['Boston', 1998, 215000]],
index = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12],
columns = ['city', 'year', 'average_price'])
def percent_change(d):
y1995 = float(d['average_price'][d['year']==1995])
y1996 = float(d['average_price'][d['year']==1996])
ratio = str(round(((y1996 / y1995)-1)*100,2)) + '%'
return ratio
city = df[df['city']=='New York']
percent_change(city)
my_final = {}
for c in df['city'].unique():
city = df[df['city'] == c]
my_final[c] = percent_change(city)
print(my_final)
My goal is to get the percentage change between each year for each city. This way I can chart the percentage changes on a line chart. I can only figure out (crudely as it may be) how to do it for one year. Even them, I don't think I'm properly assigning the year to the result in that one. I don't know how to iterate it through ALL the years. I'm so confused, but if someone can help me out, I feel like I can truly start to learn.
So, from 1995 to 1996 the percentage change in price is as follows:
{'New York': '4.69%', 'Philadelphia': '8.7%', 'Boston': '2.07%'}
Going through examples were easy, but the data was so abstract to me. Now that I have actual information that I want, I don't know how to process it.

We can use pivoting and rolling windows to achieve the desired output:
relative_changes = (
df
.pivot('year', 'city', 'average_price')
.rolling(window=2)
.apply(lambda price: price.iloc[1]/price.iloc[0] - 1)
.dropna()
)
I prefer not to hardcode the formatting inside the data so that we can use them in further calculations. Any formatting can be applied later when needed. For example, when displaying data on the screen:
display(
relative_changes
.style
.format("{:.2%}")
.set_caption("Relative changes")
)
The same with charts:
ax = relative_changes.plot(kind='bar', figsize=(10,6))
ax.xaxis.set_tick_params(labelrotation=0)
ax.yaxis.set_major_formatter(lambda y, pos: f'{y:.0%}')
ax.yaxis.grid(linestyle='--', linewidth=0.8)
ax.set_title("Relative changes of the average price")

Related

pandas groupby count percent of positive category

I have dataframe
df = pd.DataFrame({
'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'year': [np.random.randint(2015, 2020) for _ in range(12)],
'sales': [np.random.randint(-100, 100) for _ in range(12)]
})
i want to count for each state, what percent of year observed has positive sales.
My current code is
df.groupby(['state', 'year']).agg({'sales': 'sum'})
The desired output would be a dataframe such as CA: 100%, WA, 50%, AZ:67% etc
There are probably more elegant way to do it, and I would first create a column with a 1 if sale is positive and 0 if negative. Then I would aggregate on that new column using a mean aggregate function. Then finally multiply by 100.

Building Heatmap with two separate series having "Year" and "Month" information

I am working on a dataset
d = {'date_added_month': ['February', 'December', 'October', 'December', 'April','December', 'March', 'April'],
'date_added_year': [2014, 2012, 2008, 2009, 2010, 2011, 2012, 2013],
'title': ['apple', 'ball', 'cat', 'dog', 'elephant', 'fish', 'goat', 'horse'],
'titles_count': [0,0,0,0,0,0,0,0]}
df = pd.DataFrame(data=d)
I want to build a heatmap with years on X-axis and Months on Y-axis and count the number of titles on a particular month and year. How do I count the number of titles month and year wise?
I have counted the titles in both Month and Year basis, like this:
grp_by_yr = df.groupby("date_added_year").size()
grp_by_mn = df.groupby("date_added_month").size()
But I am not sure how to aggregate both this information.
Just fill the titles_count with 1 first, since they denote 1 count per row.
release_dist_df['titles_count'] = 1
Then pivot the table like so -
heatmap1_data = pd.pivot_table(release_dist_df, values='titles_count',
index=['date_added_month'],
columns='date_added_year')
Then plot using seaborn -
sns.heatmap(heatmap1_data, cmap="YlGnBu")
Update
Update with grouping as requested
import pandas as pd
d = {'date_added_month': ['February', 'February', 'December', 'October', 'December', 'April','December', 'March', 'April'],
'date_added_year': [2014, 2014, 2012, 2008, 2009, 2010, 2011, 2012, 2013],
'title': ['apple', 'apple-new', 'ball', 'cat', 'dog', 'elephant', 'fish', 'goat', 'horse'],
'titles_count': [0,0,0,0,0,0,0,0,0]}
df = pd.DataFrame(data=d)
df['titles_count'] = 1
group_by_both = df.groupby(["date_added_year", "date_added_month"]).agg({'titles_count': 'sum'})
heatmap1_data = pd.pivot_table(group_by_both, values='titles_count',
index=['date_added_month'],
columns='date_added_year')
print(heatmap1_data)
import seaborn as sns
sns_plot = sns.heatmap(heatmap1_data, cmap="YlGnBu")
I also added one more data point to show that aggregation is working (2014 February).

How to convert Multi-Index into a Heatmap

New to Pandas/Python, I have managed to make an index like below;
MultiIndex([( 1, 1, 4324),
( 1, 2, 8000),
( 1, 3, 8545),
( 1, 4, 8544),
( 1, 5, 7542),
(12, 30, 7854),
(12, 31, 7511)],
names=['month', 'day', 'count'], length=366)
I'm struggling to find out how I can store the first number into a list (the 1-12 one) the second number into another list (1-31 values) and the third number into another seperate list (scores 0-9000)
I am trying to build a heatmap that is Month x Day on the axis' and using count as the values and failing horribly! I am assuming I have to seperate Month, Day and Count into seperate lists to make the heat map?
data1 = pd.read_csv("a2data/Data1.csv")
data2 = pd.read_csv("a2data/Data2.csv")
merged_df = pd.concat([data1, data2])
merged_df.set_index(['month', 'day'], inplace=True)
merged_df.sort_index(inplace=True)
merged_df2=merged_df.groupby(['month', 'day']).count.mean().reset_index()
merged_df2.set_index(['month', 'day', 'count'], inplace=True)
#struggling here to seperate out month, day and count in order to make a heatmap
Are you looking for:
# let start here
merged_df2=merged_df.groupby(['month', 'day']).count.mean()
# use sns
import seaborn as sns
sns.heatmap(merged_df2.unstack('day'))
Output:
Or you can use plt:
merged_df2=merged_df.groupby(['month', 'day']).count.mean().unstack('day')
plt.imshow(merged_df2)
plt.xticks(np.arange(merged_df2.shape[1]), merged_df2.columns)
plt.yticks(np.arange(merged_df2.shape[0]), merged_df2.index)
plt.show()
which gives:

How to group by multiple columns on a pandas series

The pandas.Series groupby method makes it possible to group by another series, for example:
data = {'gender': ['Male', 'Male', 'Female', 'Male'], 'age': [20, 21, 20, 20]}
df = pd.DataFrame(data)
grade = pd.Series([5, 6, 7, 4])
grade.groupby(df['age']).mean()
However, this approach does not work for a groupby using two columns:
grade.groupby(df[['age','gender']])
ValueError: Grouper for class pandas.core.frame.DataFrame not 1-dimensional.
In the example, it is easy to add the column to the dataframe and get the desired result as follows:
df['grade'] = grade
y = df.groupby(['gender','age']).mean()
y.to_dict()
{'grade': {('Female', 20): 7.0, ('Male', 20): 4.5, ('Male', 21): 6.0}}
But that can get quite ugly in real life situations. Is there any way to do this groupby on multiple columns directly on the series?
Since I don't know of any direct way to solve the problem, I've made a function that creates a temporary table and performs the groupby on it.
def pd_groupby(series,group_obj):
df = pd.DataFrame(group_obj).copy()
groupby_columns = list(df.columns)
df[series.name] = series
return df.groupby(groupby_columns)[series.name]
Here, group_obj can be a pandas Series or a Pandas DataFrame. Starting from the sample code, the desired result can be achieved by:
y = pd_groupby(grade,df[['gender','age']]).mean()
y.to_dict()
{('Female', 20): 7.0, ('Male', 20): 4.5, ('Male', 21): 6.0}

How to implement where clause in python

I want to replicate what where clause does in SQL, using Python. Many times conditions in where clause can be complex and have multiple conditions. I am able to do it in the following way. But I think there should be a smarter way to achieve this. I have following data and code.
My requirement is: I want to select all columns only when first letter in the address is 'N'. This is the initial data frame.
d = {'name': ['john', 'tom', 'bob', 'rock', 'dick'], 'Age': [23, 32, 45, 42, 28], 'YrsOfEducation': [10, 15, 8, 12, 10], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
import pandas as pd
df = pd.DataFrame(data = d)
df['col1'] = df['Address'].str[0:1] #creating a new column which will have only the first letter from address column
n = df['col1'] == 'N' #creating a filtering criteria where the letter will be equal to N
newdata = df[n] # filtering the dataframe
newdata1 = newdata.drop('col1', axis = 1) # finally dropping the extra column 'col1'
So after 7 lines of code I am getting this output:
My question is how can I do it more efficiently or is there any smarter way to do that ?
A new column is not necessary:
newdata = df[df['Address'].str[0] == 'N'] # filtering the dataframe
print (newdata)
Address Age YrsOfEducation name
0 NY 23 10 john
1 NJ 32 15 tom
3 NY 42 12 rock