Plotting 2 columns as 2 lines and 1 column as x axis on Dataframes - pandas

I'm new to pandas and all these dataframe. I am interested to know how I could transform my current codes to plt.figure instead. I would like to plot 2 columns (Tourism Receipts, Visitors) as line while putting another column as the x axis (Quarters).
It seems that this code works. But i would like to know whether there may be a better way to do it such as plt.plot but allowing me to set the x-axis as Quarters and the other 2 columns as lines?
df1= df.set_index('Quarters').plot(figsize=(10,5), grid=True)
Dataframe (from my csv file):
| Quarters | Tourism Receipts | Visitors |
| 2019 Q1 | 10 | 1 |
| 2019 Q2 | 20 | 2 |
| 2019 Q3 | 30 | 3 |
| 2019 Q4 | 40 | 4 |
I understand this following method
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(20,10))
plt.plot(x,y)
plt.title
plt.xlabel
plt.ylabel
I would like to enquire whether there is a way to do transform the 'df.set_index' method to plt instead?

You can actually combine both, using the .plot method which saves a lot of effort from pd and use matplotlib features side-by-side to customize the output.
This is a sample code of who to address this:
from matplotlib import pyplot as plt
import pandas as pd
fig, ax = plt.subplots(1, figsize=(10, 10))
df.set_index('Quarters')[['Tourism Receipts', 'Visitors']].plot(figsize=(10,5), grid=True, ax=ax)
ax.set_yticks(range(-10, 41, 5))
# ax.set_yticklabels( ('{}%'.format(x) for x in range(0, 101, 10)), fontsize=15)
ax.set_xticks(df.Quarters)
ax.set_xticklabels(["{} Q{}".format('2019', x) for x in df.Quarters])
ax.legend(loc='lower left')
You can do the same for yticks as well.
PS: The df.Quarters doesn't include year, so I am assuming 2019.

Related

using pandas dataframe to create matplotlib bar chart

This is an example of how my pandasDF dataframe look like. This is a movie dataset. The count represent how each movie title represent 1.
| yearOfRelease |count|
| 1989 | 1 |
| 1990 | 1 |
| 1991 | 1 |
| 1992 | 1 |
| 1993 | 1 |
Previously my dataframe is a spark dataframe but i convert it to a pandas dataframe
pandasDF = movies_DF.toPandas()
pandasDF.head()
This is the plot i have right now.
I am trying to achieve a plot that look like this.
This is my code:
x = list(pandasDF[pandasDF['yearOfRelease'] >= '1990']['yearOfRelease'])
import matplotlib.pyplot as plt
plt.figure()
pd.value_counts(x).plot.bar(title="number of movies by year of release ")
plt.xlabel("yearOfRelease")
plt.ylabel("numOfMovies")
plt.show()
You have to sort your values by index (yearOfRelease):
import matplotlib.pyplot as plt
sr = pandasDF.loc[pandasDF['yearOfRelease'] >= '1990', 'yearOfRelease'].astype(int)
ax = (sr.value_counts().sort_index()
.plot.bar(title='number of movies by year of release',
xlabel='yearOfRelease', ylabel='numOfMovies',
rot=-90))
ax.xaxis.set_major_locator(ticker.MultipleLocator(2))
ax.yaxis.set_major_locator(ticker.MultipleLocator(500))
plt.tight_layout()
plt.show()
Output:

Matplotlib plot for loop issue

I'm trying to plot time series data in matplotlib using a for loop. The goal is to dynamically plot 'n' years worth of daily closing price data. If i load 7 years of data, I get 7 unique plots. I have created a summary of the start and end dates for a data set, yearly_date_ranges (date is the index). I use this to populate start and end dates. The code I've written so far produces 7 plots of all daily data instead of 7 unique plots, one for each year. Any help is appreciated. Thanks in advance!
yearly_date_ranges
start end
Date
2014 2014-04-01 2014-12-31
2015 2015-01-01 2015-12-31
2016 2016-01-01 2016-12-31
2017 2017-01-01 2017-12-31
2018 2018-01-01 2018-12-31
2019 2019-01-01 2019-12-31
2020 2020-01-01 2020-05-28
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure(figsize=(12,20))
for i in range(len(yearly_date_ranges)):
ax = fig.add_subplot(len(yearly_date_ranges),1,i + 1)
for row in yearly_date_ranges.itertuples(index=False):
start = row.start
end = row.end
subset = data[start:end]
ax.plot(subset['Close'])
plt.show()
Dynamically you should do something like this:
fig, axes = plt.subplots(7,1, figsize=(12,20))
years = data.index.year
for ax, (k,d) in zip(axes.ravel(), data.groupby(years)):
d.plot(y='Close', ax=ax)
This worked! Thank you for the help
fig, axes = plt.subplots(7,1, figsize=(12,20))
years = data.index.year
for ax, (k,d) in zip(axes.ravel(), data['Close'].groupby(years)):
d.plot(x='Close', ax=ax)

Best way of showing dates in a Bar plot with Pandas

I create a bar plot like this:
But since each x axis label is one day of january (for example 1, 3, 4, 5, 7, 8, etc) I think the best way of showing this is something like
__________________________________________________ x axis
Jan 1 3 4 5 7 8 ...
2019
But I dont know how to do this with Pandas.
Here is my code:
import pandas as pd
import matplotlib.plt as plt
df = pd.read_excel('solved.xlsx', sheetname="Jan")
fig, ax = plt.subplots()
df_plot=df.plot(x='Date', y=['Solved', 'POT'], ax=ax, kind='bar',
width=0.9, color=('#ffc114','#0098c9'), label=('Solved','POT'))
def line_format(label):
"""
Convert time label to the format of pandas line plot
"""
month = label.month_name()[:3]
if month == 'Jan':
month += f'\n{label.year}'
return month
ax.set_xticklabels(map(lambda x: line_format(x), df.Date))
The function was a solution provided here: Pandas bar plot changes date format
I dont know how to modify it to get the axis I want
My data example solved.xlsx:
Date Q A B Solved POT
2019-01-04 Q4 11 9 14 5
2019-01-05 Q4 9 11 14 5
2019-01-08 Q4 11 18 10 6
2019-01-09 Q4 18 19 18 5
I have found a solution:
import pandas as pd
import matplotlib.plt as plt
df = pd.read_excel('solved.xlsx', sheetname="Jan")
fig, ax = plt.subplots()
df_plot=df.plot(x='Date', y=['Solved', 'POT'], ax=ax, kind='bar',
width=0.9, color=('#ffc114','#0098c9'), label=('Solved','POT'))
def line_format(label):
"""
Convert time label to the format of pandas line plot
"""
day = label.day
if day == 2:
day = str(day) + f'\n{label.month_name()[:3]}'
return day
ax.set_xticklabels(map(lambda x: line_format(x), df.Date))
plt.show()
In my particular case I didnt have the date 2019-01-01 . So the first day for me was Jan 2

Plotting a line that's first value is not at x = 0

I have several ndarrays that I would like to plot on the same graph. Each is the same size. The first contains y data for x=0, x=1, x=2, x=3. The second contains y data for x=1,x=2,x=3,x=4. And the third contains y data for x=2,x=3,x=4,x=5.
Is there anyway I can get Pyplot to shift the line over 1, so that they all appear starting at their appropriate x values?
(I know I could just prepend one or two dummy values to the lines, but I don't want that, I want the line to start in the right place)
UPDATE
To explain better, here is the data:
x |y1 |y2 |y3
-------------
0 | 0 | - | -
1 | 1 | 0 | -
2 | 2 | 1 | 0
3 | 3 | 2 | 1
4 | - | 3 | 2
5 | - | - | 3
That is, line y1 is defined starting at x=0, line y2 is defined starting at x=1, and line y3 is defined starting at x=2. Likewise, y2 is defined for x=4 whereas y1 isn't. (You can think of y2 as a translation to the right of y1).
Using plot, all lines start at the same x coordinate, which I don't want.
Typically you just supply a range of x values for each y range to plot:
import matplotlib.pyplot as plt
import numpy as np
x1 = np.arange(4)
x2 = x1+1
x3 = x2+1
for x in (x1,x2,x3):
plt.plot(x,np.random.randint(10,size=4))
plt.show()

Using Pandas groupby to calculate many slopes

Some illustrative data in a DataFrame (MultiIndex) format:
|entity| year |value|
+------+------+-----+
| a | 1999 | 2 |
| | 2004 | 5 |
| b | 2003 | 3 |
| | 2007 | 2 |
| | 2014 | 7 |
I would like to calculate the slope using scipy.stats.linregress for each entity a and b in the above example. I tried using groupby on the first column, following the split-apply-combine advice, but it seems problematic since it's expecting one Series of values (a and b), whereas I need to operate on the two columns on the right.
This is easily done in R via plyr, not sure how to approach it in pandas.
A function can be applied to a groupby with the apply function. The passed function in this case linregress. Please see below:
In [4]: x = pd.DataFrame({'entity':['a','a','b','b','b'],
'year':[1999,2004,2003,2007,2014],
'value':[2,5,3,2,7]})
In [5]: x
Out[5]:
entity value year
0 a 2 1999
1 a 5 2004
2 b 3 2003
3 b 2 2007
4 b 7 2014
In [6]: from scipy.stats import linregress
In [7]: x.groupby('entity').apply(lambda v: linregress(v.year, v.value)[0])
Out[7]:
entity
a 0.600000
b 0.403226
You can do this via the iterator ability of the group by object. It seems easier to do it by dropping the current index and then specifying the group by 'entity'.
A list comprehension is then an easy way to quickly work through all the groups in the iterator. Or use a dict comprehension to get the labels in the same place (you can then stick the dict into a pd.DataFrame easily).
import pandas as pd
import scipy.stats
#This is your data
test = pd.DataFrame({'entity':['a','a','b','b','b'],'year':[1999,2004,2003,2007,2014],'value':[2,5,3,2,7]}).set_index(['entity','year'])
#This creates the groups
groupby = test.reset_index().groupby(['entity'])
#Process groups by list comprehension
slopes = [scipy.stats.linregress(group.year, group.value)[0] for name, group in groupby]
#Process groups by dict comprehension
slopes = {name:[scipy.stats.linregress(group.year, group.value)[0]] for name, group in groupby}