How to have partially italicized columns in pdf output? - pdf

This question is related to Creating a data frame that produces partially italicized cells with pkg:sjPlot functions
I'd like to have partially italicized cells in a kable. I have tried
library(tidyverse); library(kableExtra)
sum_dat_final2 <- list(Site = c("Hanauma Bay", "Hanauma Bay", "Hanauma Bay", "Waikiki", "Waikiki", "Waikiki"),
Coral_taxon = expression( italic(Montipora)~ spp.,
italic(Pocillopora)~spp.,
italic(Porites)~spp.,
italic(Montipora)~ spp.,
italic(Pocillopora)~spp.,
italic(Porites)~spp.))
sum_dat_final2 %>%
as.data.frame()%>%
kbl(longtable = F, "latex")
and got this error Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ‘"expression"’ to a data.frame
Many thanks in advance!!

You may italicize specific parts by adding $. In this sense, you need to set escape = F on your kbl function.
```{r}
library(tidyverse); library(kableExtra)
sum_dat_final2 <- list(Site = c("Hanauma Bay", "Hanauma Bay", "Hanauma Bay", "Waikiki", "Waikiki", "Waikiki"),
"Coral_taxon" = c("$Montipora$$~$ spp.",
"$Pocillopora$$~$spp.",
"$Porites$$~$spp.",
"$Montipora$$~$ spp.",
"$Pocillopora$$~$spp.",
"$Porites$$~$spp."))
sum_dat_final2 %>%
as.data.frame()%>%
kbl(longtable = F, "latex",
escape = F,
col.names = c("Site", "Coral taxonomie"))
```
--output

Related

stat_cor function in ggplot2: Print R and p-values on two lines

Is it possible to Correlation-values and the p-value on two lines instead of comma-separated as is the default:
default:
R=0.8, p=0.004
want:
R=0.8
p=0.004
The stat_cor function is from the ggpubr library (not base ggplot2). Regardless, the documentation for the function has your answer, which is to use the label.sep= argument in stat_cor. You can set that to "\n" to add a new line character as a separation and get the label over two lines. See the example in the documentation with the adjustment:
library(ggplot2)
library(ggpubr)
data("mtcars")
df <- mtcars
df$cyl <- as.factor(df$cyl)
sp <- ggscatter(df, x = "wt", y = "mpg",
add = "reg.line", # Add regressin line
add.params = list(color = "blue", fill = "lightgray"), # Customize reg. line
conf.int = TRUE # Add confidence interval
)
# Add correlation coefficient
sp + stat_cor(method = "pearson", label.x = 3, label.y = 30, label.sep='\n')

R dbWriteTable doesn't work with a column of all NA's and class not logical

I am trying to upload a table to SQL from R using dbWriteTable but I am having issues as some of my columns that have all NA's in them. I've learned that if the class is logical than it works but if it is anything else, it throws me an error.
Anybody have a solution? The columns which will have all NA's will be random so I can't just set a column to.logical() and I can't figure out a way to do it using lapply. I also do not want to get rid of these columns either.
Works
test <- data.frame(Name = c("Fred","Wilma","George"), Villians = c(2,4,3), Information = c(NA,NA,NA), stringsAsFactors = FALSE)
if (dbExistsTable(con, "test")) {dbRemoveTable(con, "test")}
dbWriteTable(con, name = "test", value = test, row.names = FALSE)
> sapply(test,class)
Name Villians Information
"character" "numeric" "logical"
Throws an error
test <- data.frame(Name = c("Fred","Wilma","George"), Villians = c(2,4,3), Information = c(NA,NA,NA), stringsAsFactors = FALSE)
test$Information <- as.character(test$Information)
if (dbExistsTable(con, "test")) {dbRemoveTable(con, "test")}
dbWriteTable(con, name = "test", value = test, row.names = FALSE)
Warning message:
In max(nchar(as.character(x)), na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
> sapply(test,class)
Name Villians Information
"character" "numeric" "character"
I am using a company server so if there is any configuration that needs to be made on that end, I may or may not be able to get it done.

How to extract stat_smooth curve maxima in gpplot panel (facet_grid)?

I have created this plot with 18 grids using facet_grid command and two different fitting equations (for Jan - Apr, and May - Jun). I have two things that I need help with:
(may sound obvious, but) I haven't been able to find on the internet working codes extract a curve maximum for a stat_smooth fit. I'd appreciate if someone could show and explain what the codes mean. This is the closest I could find, but I am not sure what it means:
gb <- ggplot_build(p1)
curve_max <- gb$data[[1]]$x[which(diff(sign(diff(gb$data[[1]]$y)))==-2)+1]
How to add a vertical line to indicate max value on each curve?
Data file (rlc2 <- read_excel)
Plot
plot <- ggplot(rlc2, aes(par, etr, color=month, group=site))+
geom_point()+
stat_smooth(data = subset(rlc2, rlc2$month!="May" & rlc2$month!="Jun"),
method = "glm",
formula = y ~ x + log(x),
se = FALSE,
method.args = list(family = gaussian(link = "log"), start=c(a=0, b=0, c=0)))+
stat_smooth(data = subset(rlc2, rlc2$month=="May" | rlc2$month=="Jun"),
method = "nlsLM",
formula = y ~ M*(1 - exp(-(a*x))),
se = FALSE,
method.args = list(start=c(M=0, a=10)))+
facet_grid(rows = vars(month), cols = vars(site))
plot
field_rlc_plot
Any other advice are also welcome. I am educated as programmer so my codes are probably a bit messy. Thank you for helping.
Try this:
First, fit the data and extract the maximum of the fit.
my.fit <- function(month, site, data) {
fit <- glm(formula = etr ~ par + log(par),
data = data,
family=gaussian(link = "log")
)
#arrange the dersired output in a tibble
tibble(max = max(fit$fitted.values),
site = site,
month = month)
}
#Apply a custom function `my.fit` on each subset of data
#according to month and site using the group_by/nest/map method
# the results are rowbinded and returned in a data.frame
my.max<-
rlc2 %>%
dplyr::group_by(month, site) %>%
tidyr::nest() %>%
purrr::pmap_dfr(my.fit)
Next, join the results back on your data and plot a geom_line
rlc2 %>%
dplyr::left_join(my.max) %>%
ggplot(aes(x = par, y = etr))+
geom_point()+
stat_smooth(data = subset(rlc2, rlc2$month!="May" & rlc2$month!="Jun"),
method = "glm",
formula = y ~ x + log(x),
se = FALSE,
method.args = list(family = gaussian(link = "log"), start=c(a=0, b=0, c=0)))+
stat_smooth(data = subset(rlc2, rlc2$month=="May" | rlc2$month=="Jun"),
method = "nlsLM",
formula = y ~ M*(1 - exp(-(a*x))),
se = FALSE,
method.args = list(start=c(M=0, a=10)))+
geom_line(aes(y=max), col="red")+
facet_grid(rows = vars(month), cols = vars(site))

invalid input '📧' in 'utf8towcs when using tm and pdftools

My work was going along smoothly but i encountered problems due to some of my pdf files containing weird symbols ("📧")
I have reviewed the older discussion but none of those solutions worked:
R tm package invalid input in 'utf8towcs'
This is my code so far:
setwd("E:/OneDrive/Thesis/Received comments document/Consultation 50")
getwd()
library(tm)
library(NLP)
library(tidytext)
library(dplyr)
library(pdftools)
files <- list.files(pattern = "pdf$")
comments <- lapply(files, pdf_text)
corp <- Corpus(VectorSource(comments))
corp <- VCorpus(VectorSource(comments));names(corp) <- files
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE,
bounds = list(global = c(3, Inf))))
Results in: Error in .tolower(txt) : invalid input '📧' in 'utf8towcs'
inspect(Comments.tdm[1:32,])
ap_td <- tidy(Comments.tdm)
write.csv(ap_td, file = "Terms 50.csv")
Any help is much appreciated.
ps, this code worked perfectly on other pdf's.
Took another look at the earlier discussion. this solution finally worked for me:
myCleanedText <- sapply(myText, function(x) iconv(enc2utf8(x), sub = "byte"))
remember to follow Fransisco's instructions: "Chad's solution wasn't working for me. I had this embedded in a function and it was giving an error about iconv neededing a vector as input. So, I decided to do the conversion before creating the corpus."
my code now looks like this:
files <- list.files(pattern = "pdf$")
comments <- lapply(files, pdf_text)
comments <- sapply(comments, function(x) iconv(enc2utf8(x), sub = "byte"))
corp <- Corpus(VectorSource(comments))
corp <- VCorpus(VectorSource(comments));names(corp) <- files
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE,
bounds = list(global = c(3, Inf))))
inspect(Comments.tdm[1:28,])
ap_td <- tidy(Comments.tdm)
write.csv(ap_td, file = "Terms 44.csv")

Convert topicmodels output to JSON

I use the following function to convert the topicmodels output to JSON output to use in ldavis.
topicmodels_json_ldavis <- function(fitted, corpus, doc_term){
## Required packages
library(topicmodels)
library(dplyr)
library(stringi)
library(tm)
library(LDAvis)
## Find required quantities
phi <- posterior(fitted)$terms %>% as.matrix
theta <- posterior(fitted)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
temp <- paste(corpus[[i]]$content, collapse = ' ')
doc_length <- c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- inspect(doc_term)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
Freq = colSums(temp_frequency))
rm(temp_frequency)
## Convert to json
json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
vocab = vocab,
doc.length = doc_length,
term.frequency = freq_matrix$Freq)
return(json_lda)
}
but I receive the following error
Error in LDAvis::createJSON(phi = phi, theta = theta, vocab = vocab, doc.length = doc_length, : Length of doc.length not equal
to the number of rows in theta; both should be equal to the number of
documents in the data.
Here is my complete code:
data <- read.csv("textmining.csv")
corpus <- Corpus(DataframeSource(data.frame(data$reasonforleaving)))
# Remove punctuations and numbers because they are generally uninformative.
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
# Convert all words to lowercase.
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove stopwords such as "a", "the", etc.
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Use the SnowballC package to do stemming.
library(SnowballC)
corpus <- tm_map(corpus, stemDocument)
# remove extra words
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "still")
corpus <- tm_map(corpus, toSpace, "also")
# Remove excess white spaces between words.
corpus <- tm_map(corpus, stripWhitespace)
# Inspect the first document to see what it looks like.
corpus[[1]]$content
dtm <- DocumentTermMatrix(corpus)
# remove empty documents
library(slam)
dtm = dtm[row_sums(dtm)>0,]
# Use topicmodels package to conduct LDA analysis.
burnin <- 500
iter <- 1000
keep <- 30
k <- 5
result55 <- LDA(dtm, 5)
ldaoutput = topicmodels_json_ldavis(result55,corpus, dtm)
Do you know why I receive the error?
Thanks
I had the same issue with same code, and found this function here :
topicmodels2LDAvis <- function(x, ...){
post <- topicmodels::posterior(x)
if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics")
mat <- x#wordassignments
LDAvis::createJSON(
phi = post[["terms"]],
theta = post[["topics"]],
vocab = colnames(post[["terms"]]),
doc.length = slam::row_sums(mat, na.rm = TRUE),
term.frequency = slam::col_sums(mat, na.rm = TRUE)
)
}
Much much simpler to use, just put as argument your LDA result :
result55 <- LDA(dtm, 5)
serVis(topicmodels2LDAvis(result55))
Problem
Your problem is in for (i in 1:length(corpus)) in
doc_length <- vector()
for (i in 1:length(corpus)) {
temp <- paste(corpus[[i]]$content, collapse = ' ')
doc_length <- c(doc_length, stri_count(temp, regex = '\\S+'))
}
Remember, you have removed some "empty" documents from your DocumentTermMatrix in dtm = dtm[row_sums(dtm)>0,],
so your vector length here is going to be too big.
Suggestion
You may want to keep a vector of the empty docs around as it will help you not only to generate the JSON but also to go back and forth between your empty and full document sets.
doc.length = colSums( as.matrix(tdm) > 0 )[!empty.docs]
My suggestion assumes you have the full tdm with empty docs in place