Create a bar plot in plt when having the bins and heights - matplotlib

I have the following ranges of bins and their desired heights
Range | Height
-------------------------
0.0-0.0905 | 0.02601
0.0905-0.1811| 0.13678
0.1811-0.2716| 0.22647
0.2716-0.3621| 0.31481
0.3621-0.4527| 0.40681
0.4527-0.5432| 0.50200
0.5432-0.6337| 0.58746
0.6337-0.7243| 0.68153
0.7243-0.8148| 0.76208
0.8148-0.9053| 0.86030
0.9053-0.9958| 0.95027
0.9958-1 | 0.99584
The desired outcome is a histogram/bar plot with the edges according to Range and the heights according to Height.

You can split your Range and explode to get the edges of the bins:
import pandas as pd
from io import StringIO
data = StringIO("""Range | Height
-------------------------
0.0-0.0905 | 0.02601
0.0905-0.1811| 0.13678
0.1811-0.2716| 0.22647
0.2716-0.3621| 0.31481
0.3621-0.4527| 0.40681
0.4527-0.5432| 0.50200
0.5432-0.6337| 0.58746
0.6337-0.7243| 0.68153
0.7243-0.8148| 0.76208
0.8148-0.9053| 0.86030
0.9053-0.9958| 0.95027
0.9958-1 | 0.99584""")
df = pd.read_csv(data, sep="\s*\|\s*", engine="python", skiprows=[1])
df['Range'] = df['Range'].str.split('-')
df = df.explode('Range').drop_duplicates('Range').astype(float)
This will give you:
Range Height
0 0.0000 0.02601
0 0.0905 0.02601
1 0.1811 0.13678
2 0.2716 0.22647
3 0.3621 0.31481
4 0.4527 0.40681
5 0.5432 0.50200
6 0.6337 0.58746
7 0.7243 0.68153
8 0.8148 0.76208
9 0.9053 0.86030
10 0.9958 0.95027
11 1.0000 0.99584
Then use plt.stairs:
import matplotlib.pyplot as plt
plt.stairs(df['Height'].iloc[1:], edges=df['Range'].values, fill=True)
plt.show()
Output:

Related

How to replace dates of 1 month to month in python?

Above is the image of the below dataframe with x-axis as date and y-axis as High
what I want is for date between 06-09-21 to 31-09-21 it should replace it with sep 21 and likewise remaining dates with respected months in graph as right now the x-axis is not readable
I don't even know where to start with
Below is the code that I used to draw/plot graph
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("Stock.csv")
x1=df['High'].values.tolist()
r=df['Date'].values.tolist()
plt.plot(r, x1,color="green", label = "High")
You can use pandas.to_datetime with pandas.Series.dt.strftime.
Try this :
df["Date"] = pd.to_datetime(df["Date"], dayfirst=True).dt.strftime("%b-%d")
# Output :
As a new column to illustrate the change :
print(df)
Date Date(new)
0 08-09-21 Sep-08
1 09-09-21 Sep-09
2 13-09-21 Sep-13
3 30-08-21 Aug-30
4 01-09-22 Sep-01
5 02-09-22 Sep-02
6 05-09-22 Sep-05
7 06-09-22 Sep-06
# Edit :
You can use matplotlib.dates.DateFormatter :
df["Date"] = pd.to_datetime(df["Date"], dayfirst=True)
f, ax = plt.subplots(figsize = [8, 4])
ax.plot(df["Date"], df["High"])
ax.xaxis.set_major_formatter(DateFormatter("%b-%Y"))
Used input :
Date Open High Low Close AdjClose Volume
0 06-09-21 1579.949951 1580.949951 1561.949951 1565.699951 1547.704712 3938448
1 07-09-21 1562.500000 1582.000000 1555.199951 1569.250000 1551.213989 3622748
2 08-09-21 1571.949951 1580.500000 1565.599976 1576.400024 1558.281860 3362040
3 09-09-21 1574.000000 1579.449951 1561.000000 1568.599976 1550.571411 4125474
4 13-09-21 1562.000000 1584.000000 1553.650024 1555.550049 1537.671509 4479582
5 30-08-22 1446.449951 1489.949951 1443.099976 1486.099976 1486.099976 5067700
6 01-09-22 1464.750000 1489.449951 1459.000000 1472.150024 1472.150024 11201568
7 02-09-22 1472.150024 1490.500000 1465.199951 1485.500000 1485.500000 6019043
8 05-09-22 1486.099976 1499.000000 1484.099976 1495.050049 1495.050049 6065966
9 06-09-22 1498.900024 1506.650024 1487.099976 1502.000000 1502.000000 4066957

Convert string to colum

I have a simple data frame, and I am developing a sentiment analysis.
This is the code and the reproducible example
import transformers
from pysentimiento import SentimentAnalyzer
from pysentimiento import EmotionAnalyzer
analyzer = SentimentAnalyzer(lang="en")
emotion_analyzer = EmotionAnalyzer(lang="en")
data = [['Hello world'], ['I am the best'], ['Nice jacket!']]
df2 = pd.DataFrame(data, columns = ['Tweet'])
# print dataframe.
df2["sentiment"] = df2.apply(lambda row : analyzer.predict(row["Tweet"]), axis = 1)
The output for the code below:
Tweet sentiment
---------------------| --------------------
Hello world | SentimentOutput(output=POS, probas={POS: 0.999, NEG: 0.001,NEU: 0.000}) |
I am the best | SentimentOutput(output=POS, probas={POS: 0.999, NEG: 0.001,NEU: 0.000})
Nice jacket! | SentimentOutput(output=POS, probas={POS: 0.999, NEG: 0.001,NEU: 0.000})
I would like to split the sentiment column and have something like this:
Tweet sentiment prob_Pos Prob_Neg Prob_Neu
---------------------|---------------|----------|------------------------------
Hello world | POS | 0.99 | 0.001 | 0.000
I am the best | POS | 0.99 | 0.001 | 0.000
Nice jacket! | POS | 0.99 | 0.001 | 0.000
The results must be converted into a pd.Series then join back to the DataFrame. This is easiest to do with a function as the results cannot be easily unpacked:
analyzer = SentimentAnalyzer(lang="en")
def process(row):
res = analyzer.predict(row["Tweet"])
return pd.Series({'sentiment': res.output, **res.probas})
df2 = df2.join(df2.apply(process, axis=1))
df2:
Tweet sentiment NEG NEU POS
0 Hello world NEU 0.000446 0.548691 0.450863
1 I am the best POS 0.000660 0.001529 0.997811
2 Nice jacket! POS 0.000224 0.051520 0.948256
This can also be done in a way that the analyzer can be passed as a parameter:
def process_with(predictor):
def process_(row):
res = predictor.predict(row["Tweet"])
return pd.Series({'sentiment': res.output, **res.probas})
return process_
analyzer = SentimentAnalyzer(lang="en")
df2 = df2.join(df2.apply(process_with(analyzer), axis=1))

Convert a dict to a DataFrame in pandas

I am using the following code:
import pandas as pd
from yahoofinancials import YahooFinancials
mutual_funds = ['PRLAX', 'QASGX', 'HISFX']
yahoo_financials_mutualfunds = YahooFinancials(mutual_funds)
daily_mutualfund_prices = yahoo_financials_mutualfunds.get_historical_price_data('2015-01-01', '2021-01-30', 'daily')
I get a dictionary as the output file. I would like to get a pandas dataframe with the columns: data, PRLAX, QASGX, HISFX where data is the formatted_date and the Open price for each ticker
pandas dataframe
What you can do is this:
df = pd.DataFrame({
a: {x['formatted_date']: x['adjclose'] for x in daily_mutualfund_prices[a]['prices']} for a in mutual_funds
})
which gives:
PRLAX QASGX HISFX
2015-01-02 19.694817 17.877445 11.852874
2015-01-05 19.203604 17.606575 11.665626
2015-01-06 19.444574 17.316357 11.450289
2015-01-07 19.963596 17.616247 11.525190
2015-01-08 20.260176 18.003208 11.665626
... ... ... ...
2021-01-25 21.799999 33.700001 14.350000
2021-01-26 22.000000 33.139999 14.090000
2021-01-27 21.620001 32.000000 13.590000
2021-01-28 22.120001 32.360001 13.990000
2021-01-29 21.379999 31.709999 13.590000
[1530 rows x 3 columns]
or any other of the values in the dict.

calculate distance matrix with mixed categorical and numerics

I have a data frame with a mixture of numeric (15 fields) and categorical (5 fields) data.
I can create a complete distance matrix of the numeric fields following create distance matrix using own calculation pandas
I want to include the categorical fields as well.
Using as template:
import scipy
from scipy.spatial import distance_matrix
from scipy.spatial.distance import squareform
from scipy.spatial.distance import pdist
df2=pd.DataFrame({'col1':[1,2,3,4],'col2':[5,6,7,8],'col3':['cat','cat','dog','bird']})
df2
pd.DataFrame(squareform(pdist(df2.values, lambda u, v: np.sqrt((w*(u-v)**2).sum()))), index=df2.index, columns=df2.index)
in the squareform calculation, I would like to include the test np.where(u[2]==v[2], 0, 10) (as well as with the other categorical columns)
Hpw do I modify the lambda function to carry out this test as well
Here, the distance between [0,1]
= sqrt((2-1)^2 + (6-5)^2 + (cat - cat)^2)
= sqrt(1 + 1 + 0)
and the distance between [0,2]
= sqrt((3-1)^2 + (7-5)^2 + (dog - cat)^2)
= sqrt(4 + 4 + 100)
etc.
Can anyone suggest how I can implement this algorithm?
import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist, squareform
df2 = pd.DataFrame({'col1':[1,2,3,4],'col2':[5,6,7,8],'col3':['cat','cat','dog','bird']})
def fun(u,v):
const = 0 if u[2] == v[2] else 10
return np.sqrt((u[0]-v[0])**2 + (u[1]-v[1])**2 + const**2)
pd.DataFrame(squareform(pdist(df2.values, fun)), index=df2.index, columns=df2.index)
Result:
0 1 2 3
0 0.000000 1.414214 10.392305 10.862780
1 1.414214 0.000000 10.099505 10.392305
2 10.392305 10.099505 0.000000 10.099505
3 10.862780 10.392305 10.099505 0.000000

Pandas accumulate data for linear regression

I try to adjust my data so total_gross per day is accumulated. E.g.
`Created` `total_gross` `total_gross_accumulated`
Day 1 100 100
Day 2 100 200
Day 3 100 300
Day 4 100 400
Any idea, how I have to change my code to have total_gross_accumulated available?
Here is my data.
my code:
from sklearn import linear_model
def load_event_data():
df = pd.read_csv('sample-data.csv', usecols=['created', 'total_gross'])
df['created'] = pd.to_datetime(df.created)
return df.set_index('created').resample('D').sum().fillna(0)
event_data = load_event_data()
X = event_data.index
y = event_data.total_gross
plt.xticks(rotation=90)
plt.plot(X, y)
plt.show()
List comprehension is the most pythonic way to do this.
SHORT answer:
This should give you the new column that you want:
n = event_data.shape[0]
# skip line 0 and start by accumulating from 1 until the end
total_gross_accumulated =[event_data['total_gross'][:i].sum() for i in range(1,n+1)]
# add the new variable in the initial pandas dataframe
event_data['total_gross_accumulated'] = total_gross_accumulated
OR faster
event_data['total_gross_accumulated'] = event_data['total_gross'].cumsum()
LONG answer:
Full code using your data:
import pandas as pd
def load_event_data():
df = pd.read_csv('sample-data.csv', usecols=['created', 'total_gross'])
df['created'] = pd.to_datetime(df.created)
return df.set_index('created').resample('D').sum().fillna(0)
event_data = load_event_data()
n = event_data.shape[0]
# skip line 0 and start by accumulating from 1 until the end
total_gross_accumulated =[event_data['total_gross'][:i].sum() for i in range(1,n+1)]
# add the new variable in the initial pandas dataframe
event_data['total_gross_accumulated'] = total_gross_accumulated
Results:
event_data.head(6)
# total_gross total_gross_accumulated
#created
#2019-03-01 3481810 3481810
#2019-03-02 4690 3486500
#2019-03-03 0 3486500
#2019-03-04 0 3486500
#2019-03-05 0 3486500
#2019-03-06 0 3486500
X = event_data.index
y = event_data.total_gross_accumulated
plt.xticks(rotation=90)
plt.plot(X, y)
plt.show()