Extract ggplot smoothing function and save in dataframe - ggplot2

I am trying to extract my smoothing function from a ggplot and save it as dataframe (hourly datapoints) Plot shown here.
What I have tried:
I have already tried different interpolation techniques, but the results are not satisfying.
Linear interpolation causes a zic-zac pattern.
Na_spline causes a weird curved pattern.
The real data behaves more closely to the geom_smoothing of ggplot. I have tried to reproduce it with the following functions:
loess.data <- stats::loess(Hallwil2018_2019$Avgstemp~as.numeric(Hallwil2018_2019$datetime), span = 0.5)
loess.predict <- predict(loess.data, se = T)
But it creates a list that misses the NA values and is much shorter.

You can pass a newdata argument to predict() to get it to predict a value for every time period you give it. For example (from randomly generated data):
df <- data.frame(date = sample(seq(as.Date('2021/01/01'),
as.Date('2022/01/01'),
by="day"), 40),
var = rnorm(40, 100, 10))
mod <- loess(df$var ~ as.numeric(df$date), span = 0.5)
predict(mod, newdata = seq(as.Date('2021/01/01'), as.Date('2022/01/01'), by="day"))

Related

Getting "ValueError: data type <class 'numpy.object_'> not inexact" error while trying to linear fit a dataset using uncertainities

I am very new to python so i am struggling a lot to do what i want to do, so i figured i could ask.
I have an excel sheet with data columns like period, pdot, flux values etc. There are also error columns associated with these. I want to plot these in python, and then do a linear fit while counting in the errors. Then obtain values like standard deviation or p-value to decide the goodness of the fit. Then using this fit i will try to predict values based on a missing parameter. I managed to do it without the errors, but now im trying to do it while propagating my error and its causing me some errors.
My working code that doesnt take errors into consideration is like this:
dist_array1= np.multiply(3.08567758128*10**21,dist_array)
dist_array2 = np.multiply(dist_array1,dist_array1)
e1=np.multiply(4*math.pi,dist_array2)
L_gamma = np.multiply(e1,flux_array)
Gamma_Eff = np.divide(L_gamma,edot_array)
Tau = np.divide(period_array,pdot_array)
constant = 2.94*10**8
t1=np.power(period_array,-5)
t2=np.multiply(t1,pdot_array)
t3=np.power(t2,1/2)
B_LC = np.multiply(constant,t3)
c1=np.multiply(10**15,pdot_array)
c2=np.log(c1)
c3=np.log(period_array)
c4=1-np.multiply(11/7,c3)+np.multiply(4/7,c2)
c5=3.56-c3-c2
Zeta1=1+np.divide(c4,c5)
c6=0.8-np.multiply(2/7,c3)+np.multiply(2/7,c2)
Zeta2=1+np.divide(c6,1.3)
c8=0.6-np.multiply(11/14,c3)+np.multiply(2/7,c2)
Zeta3=1+np.divide(c8,1.3)
#Here i defined my variables that i will work with, now i will try to fit it.
x1 = np.log(period_array)
y1 = np.log(Gamma_Eff)
coef1, V1 = np.polyfit(x1,y1,1, cov=True)
poly1d_fn1 = np.poly1d(coef1)
fig, (ax1, ax2, ax3) = plt.subplots(1, 3,figsize=(30,10))
fig.suptitle('Figure 1')
ax1.plot(x1,y1, 'yo', x1, poly1d_fn1(x1), '-k')
x2 = np.log(Tau)
coef2, V2 = np.polyfit(x2,y1,1, cov=True)
poly1d_fn2 = np.poly1d(coef2)
ax2.plot(x2,y1, 'yo', x2, poly1d_fn2(x2), '-k')
x3= np.log(B_LC)
coef3, V3 = np.polyfit(x3,y1,1, cov=True)
poly1d_fn3 = np.poly1d(coef3)
ax3.plot(x3,y1, 'yo', x3, poly1d_fn3(x3), '-k')
ax1.set(xlabel='log P (s)', ylabel='log η')
ax2.set(xlabel='log τ (yr)', ylabel='log η')
ax3.set(xlabel='log B_LC (G)', ylabel='log η')
#And then obtain the uncertainities
sigma_period_1=np.sqrt(V1[0][0])
sigma_period_2=np.sqrt(V1[1][1])
sigma_Tau_1=np.sqrt(V2[0][0])
sigma_Tau_2=np.sqrt(V2[1][1])
sigma_B_LC_1=np.sqrt(V3[0][0])
sigma_B_LC_2=np.sqrt(V3[1][1])
Now this works well and i can fit it, the problem is i cannot get stuff like p-value or standard deviation from the fit. I think i need to use statsmodels for that. And i also need to put errors into the formulas to be more accurate. What i changed to obtain this so far is as follows:
period_array= unumpy.uarray(period_array,perioderr_array) # Here im combining the error and the value so that i can use it propagates the error.
pdot_array=unumpy.uarray(pdot_array,pdoterr_array) #Same thing for the second value with error
flux_array=unumpy.uarray(flux_array,flux_err_array) #Same thing for third
c2=unumpy.log(c1) #Here i had to use unumpy instead of np because it gave me errors when using log function
c3=unumpy.log(period_array) #Same thing
Then i tried to fit using polyfit, to see if it works, then i will try to get the same fit with statsmodels.
x1 = unumpy.log(period_array) #log issue again
y1 = unumpy.log(Gamma_Eff)
coef1, V1 = np.polyfit(x1,y1,1, cov=True)
The last line gives me the error "ValueError: data type <class 'numpy.object_'> not inexact" I did some digging and i understood the problem as "my values are not float, and this is why im getting error, so i need to turn them into float". To do this i tried many things including stuff like x = list(x) but to no avail.
So what am i doing wrong?

Scaling textplot_wordcloud quanteda

I want to plot features from my quanteda dfm.
When I use the textplot_wordcloud (see code) I get the error:
In wordcloud(x, min_size, max_size, min_count, max_words, ... : Term x could not be fit on page. It will not be plotted.
dfm_joint <- dfm(tokens_skip)
textplot_wordcloud(dfm_joint, min_size = 2, rotation = 0.25, max_words = 100,
color = rev(RColorBrewer::brewer.pal(10, "RdBu")))
I guess it lies within the scaling of the plot but is there any possibility to adjust the plot size within the textplot_wordcloud function? Because the argument "adjust" delivered with the function is just for adapting the size of the words which doesn´t fix the problem.
Thanks very much in advance.

How to display centroids for categorical variables instead of arrows using function ggord?

I really can’t figure out how to display just the centroids for my categorical variables using the function ggord. If anybody could help me, that would be great.
Here is an example of what I’m trying to achieve using the dune data set:
library(vegan)
library (ggord)
library(ggplot2)
ord <- rda(dune~Moisture+ Management+A1,dune.env)
#first plot
plot(ord)
# second plot
ggord(ord)
#I tried to add the centroids, but somehow the whole plot seems to be differently scaled?
centroids<-ord$CCA$centroids
ggord(ord)+geom_point(aes(centroids[,1],centroids[,2]),pch=4,cex=5,col="black",data=as.data.frame(centroids))
In the first plot only the centroids (instead of arrows) for moisture and management are displayed. In the ggord plot every variable is displayed with an arrow.
And why do these plots look so different? The scales of the axes is totally different?
Something like this could work - you can use the var_sub argument to retain specific predictors (e.g., continuous), then just plot others on top of the ggord object.
library(vegan)
library(ggord)
library(ggplot2)
data(dune)
data(dune.env)
ord <- rda(dune~Moisture+ Management+A1,dune.env)
# get centroids for factors
centroids <- data.frame(ord$CCA$centroids)
centroids$labs <- row.names(centroids)
# retain only continuous predictors, then add factor centroids
ggord(ord, var_sub = 'A1') +
geom_text(data = centroids, aes(x = RDA1, y = RDA2, label = labs))

RobustScaler from scikit-learn not behaving properly

I wanted to fit and cut the outliers part from my data, so I used RobustScaler (with data from here) :
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler(quantile_range = (25.0, 75.0))
df_robust = scaler.fit_transform(df)
df_robust = pd.DataFrame(df_robust,columns=df.columns)
But when I plot the box plot,
df_robust.boxplot(figsize=(25,25))
plt.show()
it is clear that some data outside the quantile range are still there :
Have you already encountered this problem ?
RobustScaler does not remove outliers. When fitted, it computes a scale and mean that's robust to outliers. Outliers however would later be transformed like all other points using those parameters.
In other words, RobustScaler preserves outliers and tries to not let them influence the scaling of the non-outliers.
From the doc:
This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range).
So what it does is compute something like this:
iqr = np.nanpercentile(xs, 75) - np.nanpercentile(xs, 25)
median = xs.median()
and standardize like this (check the source code for exact proportionality constant):
(xs - median) / iqr
There is no step that removes outliers.

Convert date/time index of external dataset so that pandas would plot clearly

When you already have time series data set but use internal dtype to index with date/time, you seem to be able to plot the index cleanly as here.
But when I already have data files with columns of date&time in its own format, such as [2009-01-01T00:00], is there a way to have this converted into the object that the plot can read? Currently my plot looks like the following.
Code:
dir = sorted(glob.glob("bsrn_txt_0100/*.txt"))
gen_raw = (pd.read_csv(file, sep='\t', encoding = "utf-8") for file in dir)
gen = pd.concat(gen_raw, ignore_index=True)
gen.drop(gen.columns[[1,2]], axis=1, inplace=True)
#gen['Date/Time'] = gen['Date/Time'][11:] -> cause error, didnt work
filter = gen[gen['Date/Time'].str.endswith('00') | gen['Date/Time'].str.endswith('30')]
filter['rad_tot'] = filter['Direct radiation [W/m**2]'] + filter['Diffuse radiation [W/m**2]']
lis = np.arange(35040) #used the number of rows, checked by printing. THis is for 2009-2010.
plt.xticks(lis, filter['Date/Time'])
plt.plot(lis, filter['rad_tot'], '.')
plt.title('test of generation 2009')
plt.xlabel('Date/Time')
plt.ylabel('radiation total [W/m**2]')
plt.show()
My other approach in mind was to use plotly. Yet again, its main purpose seems to feed in data on the internet. It would be best if I am familiar with all the modules and try for myself, but I am learning as I go to use pandas and matplotlib.
So I would like to ask whether there are anyone who experienced similar issues as I.
I think you need set labels to not visible by loop:
ax = df.plot(...)
spacing = 10
visible = ax.xaxis.get_ticklabels()[::spacing]
for label in ax.xaxis.get_ticklabels():
if label not in visible:
label.set_visible(False)