Pyplot limit x axis based on number of elements - facebook-prophet

I'm plotting a facebook-prophet forecast and I only want to plot the last month data. This 'last month' changes from month to month so I can't use plt.xlim() to limit it by a given range.
Is there any way to limit the x-axis by a given number of elements, like plot last 1000 x-axis values no matter what those values are?

I'd say write a function that represents your understanding of the appropriate x limits. For example:
def get_x_limits(data):
lower_x = min(data)
# or
lower_x = min(data)*0.9 # 10% below lowest value (if values are positive)
# or
lower_x = min(data) - 1 # a little below lowest value
# similarly
upper_x = max(data)
# or
upper_x = max(data)*1.1 # 10% above highest value (if values are positive)
# or
upper_x = max(data) + 1 # a little above highest value
# maybe smth like this
if upper_x - lower_x < 1000:
upper_x = lower_x + 1000
# and finally
return lower_x, upper_x
You can then use these values to set your limits:
lower_x, upper_x = get_x_limits(data)
plt.xlim(lower_x,upper_x)
I have to admit that I ignored the question for the number of elements since I thought its mostly relevant what's in your data, not how much data you have. However, you can still work len(data) into the get_x_limits function the way it fits your need.

Related

Calculate and return the average of positive, negative, and neutral

I have the following dataframe:
enter image description here
I am trying to have three additional columns in which they return sum of instances of 0, 1-, and 1 (positive negative and neutral per say). After that, I want to calculate the average sentiment of user's posts. Any help with appending these averages would be great.
So far I tried the solution below:
def mean_positive(L):
# Get all positive numbers into another list
pos_only = [x for x in L if x > 0]
if pos_only:
return sum(pos_only) / len(pos_only)
raise ValueError('No postive numbers in input')
Thank you.

Plotting an exponential function given one parameter

I'm fairly new to python so bare with me. I have plotted a histogram using some generated data. This data has many many points. I have defined it with the variable vals. I have then plotted a histogram with these values, though I have limited it so that only values between 104 and 155 are taken into account. This has been done as follows:
bin_heights, bin_edges = np.histogram(vals, range=[104, 155], bins=30)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2.
plt.errorbar(bin_centres, bin_heights, np.sqrt(bin_heights), fmt=',', capsize=2)
plt.xlabel("$m_{\gamma\gamma} (GeV)$")
plt.ylabel("Number of entries")
plt.show()
Giving the above plot:
My next step is to take into account values from vals which are less than 120. I have done this as follows:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
I need to plot a curve on the same plot as the histogram, which follows the form B(x) = Ae^(-x/λ)
I then estimated a value of λ using the maximum likelihood estimator formula:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
#print(background_data)
N_background=len(background_data)
print(N_background)
sigma_background_data=sum(background_data)
print(sigma_background_data)
lamb = (sigma_background_data)/(N_background) #maximum likelihood estimator for lambda
print('lambda estimate is', lamb)
where lamb = λ. I got a value of roughly lamb = 27.75, which I know is correct. I now need to get an estimate for A.
I have been advised to do this as follows:
Given a value of λ, find A by scaling the PDF to the data such that the area beneath
the scaled PDF has equal area to the data
I'm not quite sure what this means, or how I'd go about trying to do this. PDF means probability density function. I assume an integration will have to take place, so to get the area under the data (vals), I have done this:
data_area= integrate.cumtrapz(background_data, x=None, dx=1.0)
print(data_area)
plt.plot(background_data, data_area)
However, this gives me an error
ValueError: x and y must have same first dimension, but have shapes (981555,) and (981554,)
I'm not sure how to fix it. The end result should be something like:
See the cumtrapz docs:
Returns: ... If initial is None, the shape is such that the axis of integration has one less value than y. If initial is given, the shape is equal to that of y.
So you are either to pass an initial value like
data_area = integrate.cumtrapz(background_data, x=None, dx=1.0, initial = 0.0)
or discard the first value of the background_data:
plt.plot(background_data[1:], data_area)

Determining optimal bins to bin the data

I have X,Y data which i would like to bin according to X values.
However, I would like to determine the optimal number of X bins that satisfy a condition based on the resulting bin intervals and average Y of each bin. For example if i have
X=[2,3,4,5,6,7,8,9,10]
Y=[120,140,143,124,150,140,180,190,200]
I would like to determine the best number of X bins that will satisfy this condition: Average of Y bin/(8* width of X bin) should be above 20, but as close as possible to 20. The bins should also be integers e.g., [1,2,..].
I am currently using:
bin_means, bin_edges, binnumber = binned_statistic(X, Y, statistic='mean', bins=bins)
with bins being pre-defined. However, i would like an algorithim that can determine the optimal bins for me before using this.
One can easily determine it for a small data but for hundreds of points it becomes time consuming.
Thank you
If you NEED to iterate to find optimal nbins with your minimization function, take a look at numpy.digtize
https://numpy.org/doc/stable/reference/generated/numpy.digitize.html
And try:
start = min(X)
stop = max(X)
cut_dict = {
n: np.digitize(X, bins=np.linspace(start, stop, num=n+1))
for n in range(min_nbins, max_nbins)}
#input min/max_nbins
avg = {}
Y = pd.Series(Y).rename('Y')
avg = {nbins: Y.groupby(cut).mean().mean() for nbins, cut in cut_dict.items()}
avg = pd.Series(avg.values(), index=avg.keys()).rename('mean_ybins').to_frame()
Then you can find which is closest to 20 or if 20 is the right number...

What is the mathematics behind the "smoothing" parameter in TensorBoard's scalar graphs?

I presume it is some kind of moving average, but the valid range is between 0 and 1.
It is called exponential moving average, below is a code explanation how it is created.
Assuming all the real scalar values are in a list called scalars the smoothing is applied as follows:
def smooth(scalars: List[float], weight: float) -> List[float]: # Weight between 0 and 1
last = scalars[0] # First value in the plot (first timestep)
smoothed = list()
for point in scalars:
smoothed_val = last * weight + (1 - weight) * point # Calculate smoothed value
smoothed.append(smoothed_val) # Save it
last = smoothed_val # Anchor the last smoothed value
return smoothed
Here is the actual piece of source code that performs that exponential smoothing the with some additional de-biasing explained in the comments to compensate for the choice of the zero initial value:
last = last * smoothingWeight + (1 - smoothingWeight) * nextVal
Source: https://github.com/tensorflow/tensorboard/blob/34877f15153e1a2087316b9952c931807a122aa7/tensorboard/components/vz_line_chart2/line-chart.ts#L714
The implementation of EMA smoothing used for TensorBoard can be found here.
The equivalent in Python is actually:
def smooth(scalars: list[float], weight: float) -> list[float]:
"""
EMA implementation according to
https://github.com/tensorflow/tensorboard/blob/34877f15153e1a2087316b9952c931807a122aa7/tensorboard/components/vz_line_chart2/line-chart.ts#L699
"""
last = 0
smoothed = []
num_acc = 0
for next_val in scalars:
last = last * weight + (1 - weight) * next_val
num_acc += 1
# de-bias
debias_weight = 1
if weight != 1:
debias_weight = 1 - math.pow(weight, num_acc)
smoothed_val = last / debias_weight
smoothed.append(smoothed_val)
return smoothed

find ranges to create Uniform histogram

I need to find ranges in order to create a Uniform histogram
i.e: ages
to 4 ranges
data_set = [18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46]
is there a function that gives me the ranges so the histogram is uniform?
in this case
ranges = [(18,24), (27,29), (30,33), (42,46)]
This example is easy, I'd like to know if there is an algorithm that deals with complex data sets as well
thanks
You are looking for the quantiles that split up your data equally. This combined with cutshould work. So, suppose you want n groups.
set.seed(1)
x <- rnorm(1000) # Generate some toy data
n <- 10
uniform <- cut(x, c(-Inf, quantile(x, prob = (1:(n-1))/n), Inf)) # Determine the groups
plot(uniform)
Edit: now corrected to yield the correct cuts in the ends.
Edit2: I don't quite understand the downvote. But this also works in your example:
data_set = c(18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46)
n <- 4
groups <- cut(data_set, breaks = c(-Inf, quantile(data_set, prob = 1:(n-1)/n), Inf))
levels(groups)
With some minor renaming nessesary. For slightly better level names, you could also put in min(x) and max(x) instead of -Inf and Inf.