invalid input 'ðŸ“§' in 'utf8towcs when using tm and pdftools

invalid input 'ðŸ“§' in 'utf8towcs when using tm and pdftools - pdf

My work was going along smoothly but i encountered problems due to some of my pdf files containing weird symbols ("ðŸ“§")
I have reviewed the older discussion but none of those solutions worked:
R tm package invalid input in 'utf8towcs'
This is my code so far:
setwd("E:/OneDrive/Thesis/Received comments document/Consultation 50")
getwd()
library(tm)
library(NLP)
library(tidytext)
library(dplyr)
library(pdftools)
files <- list.files(pattern = "pdf$")
comments <- lapply(files, pdf_text)
corp <- Corpus(VectorSource(comments))
corp <- VCorpus(VectorSource(comments));names(corp) <- files
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE,
bounds = list(global = c(3, Inf))))
Results in: Error in .tolower(txt) : invalid input 'ðŸ“§' in 'utf8towcs'
inspect(Comments.tdm[1:32,])
ap_td <- tidy(Comments.tdm)
write.csv(ap_td, file = "Terms 50.csv")
Any help is much appreciated.
ps, this code worked perfectly on other pdf's.

Took another look at the earlier discussion. this solution finally worked for me:
myCleanedText <- sapply(myText, function(x) iconv(enc2utf8(x), sub = "byte"))
remember to follow Fransisco's instructions: "Chad's solution wasn't working for me. I had this embedded in a function and it was giving an error about iconv neededing a vector as input. So, I decided to do the conversion before creating the corpus."
my code now looks like this:
files <- list.files(pattern = "pdf$")
comments <- lapply(files, pdf_text)
comments <- sapply(comments, function(x) iconv(enc2utf8(x), sub = "byte"))
corp <- Corpus(VectorSource(comments))
corp <- VCorpus(VectorSource(comments));names(corp) <- files
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE,
bounds = list(global = c(3, Inf))))
inspect(Comments.tdm[1:28,])
ap_td <- tidy(Comments.tdm)
write.csv(ap_td, file = "Terms 44.csv")

Related

How to solve "Error in knn: 'train' and 'class' have different lengths"

I'm trying to use the knn function (from the class package) on my dataset. It has 12 columns of features, and the 13th is what I want to be able to predict. I'm doing a 67/33 split.
This is my code so far:
nrow(Company_bankruptcy_papernorm)
random <- sample(nrow(Company_bankruptcy_papernorm),
size = 0.33*nrow(Company_bankruptcy_papernorm),replace = FALSE)
Company_bankruptcy_test <- Company_bankruptcy_papernorm[random,]
Company_bankruptcy_train <- Company_bankruptcy_papernorm[-random,]
able(Company_bankruptcy_paper$`Bankrupt?`)
table(Company_bankruptcy_paper$`Bankrupt?`[random]) *-> length 2250*
table(Company_bankruptcy_paper$`Bankrupt?`[-random]) *-> length 4569*
Bankruptcy_train_labels <- Company_bankruptcy_paper[-random,13]
Bankruptcy_test_labels <- Company_bankruptcy_paper[random,13]
length(Bankruptcy_train_labels) -> Answer: NULL
For KNN I tried
KNN_pred1 <- knn(train = Company_bankruptcy_train,
test = Company_bankruptcy_test,
cl = Bankruptcy_train_labels, k=83)
KNN_pred1 <- knn(train = Company_bankruptcy_train,
test = Company_bankruptcy_test,
cl = Bankruptcy_train_labels$`Bankrupt?`, k=83)
But both don't work.
What can I do?
Thank you in advance!
I got the data from: https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction

How to have partially italicized columns in pdf output?

This question is related to Creating a data frame that produces partially italicized cells with pkg:sjPlot functions
I'd like to have partially italicized cells in a kable. I have tried
library(tidyverse); library(kableExtra)
sum_dat_final2 <- list(Site = c("Hanauma Bay", "Hanauma Bay", "Hanauma Bay", "Waikiki", "Waikiki", "Waikiki"),
Coral_taxon = expression( italic(Montipora)~ spp.,
italic(Pocillopora)~spp.,
italic(Porites)~spp.,
italic(Montipora)~ spp.,
italic(Pocillopora)~spp.,
italic(Porites)~spp.))
sum_dat_final2 %>%
as.data.frame()%>%
kbl(longtable = F, "latex")
and got this error Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ‘"expression"’ to a data.frame
Many thanks in advance!!

You may italicize specific parts by adding $. In this sense, you need to set escape = F on your kbl function.
```{r}
library(tidyverse); library(kableExtra)
sum_dat_final2 <- list(Site = c("Hanauma Bay", "Hanauma Bay", "Hanauma Bay", "Waikiki", "Waikiki", "Waikiki"),
"Coral_taxon" = c("$Montipora$$~$ spp.",
"$Pocillopora$$~$spp.",
"$Porites$$~$spp.",
"$Montipora$$~$ spp.",
"$Pocillopora$$~$spp.",
"$Porites$$~$spp."))
sum_dat_final2 %>%
as.data.frame()%>%
kbl(longtable = F, "latex",
escape = F,
col.names = c("Site", "Coral taxonomie"))
```
--output

stat_cor function in ggplot2: Print R and p-values on two lines

Is it possible to Correlation-values and the p-value on two lines instead of comma-separated as is the default:
default:
R=0.8, p=0.004
want:
R=0.8
p=0.004

The stat_cor function is from the ggpubr library (not base ggplot2). Regardless, the documentation for the function has your answer, which is to use the label.sep= argument in stat_cor. You can set that to "\n" to add a new line character as a separation and get the label over two lines. See the example in the documentation with the adjustment:
library(ggplot2)
library(ggpubr)
data("mtcars")
df <- mtcars
df$cyl <- as.factor(df$cyl)
sp <- ggscatter(df, x = "wt", y = "mpg",
add = "reg.line", # Add regressin line
add.params = list(color = "blue", fill = "lightgray"), # Customize reg. line
conf.int = TRUE # Add confidence interval
)
# Add correlation coefficient
sp + stat_cor(method = "pearson", label.x = 3, label.y = 30, label.sep='\n')

bnlearn error in structural.em

I got an error when try to use structural.em in "bnlearn" package
This is the code:
cut.learn<- structural.em(cut.df, maximize = "hc",
+ maximize.args = "restart",
+ fit="mle", fit.args = list(),
+ impute = "parents", impute.args = list(), return.all = FALSE,
+ max.iter = 5, debug = FALSE)
Error in check.data(x, allow.levels = TRUE, allow.missing = TRUE,
warn.if.no.missing = TRUE, : at least one variable has no observed
values.
Did anyone have the same problems, please tell me how to fix it.
Thank you.

I got structural.em working. I am currently working on a python interface to bnlearn that I call pybnl. I also ran into the problem you desecribe above.
Here is a jupyter notebook that shows how to use StructuralEM from python marks.
The gist of it is described in slides-bnshort.pdf on page 135, "The MARKS Example, Revisited".
You have to create an inital fit with an inital imputed dataframe by hand and then provide the arguments to structural.em like so (ldmarks is the latent-discrete-marks dataframe where the LAT column only contains missing/NA values):
library(bnlearn)
data('marks')
dmarks = discretize(marks, breaks = 2, method = "interval")
ldmarks = data.frame(dmarks, LAT = factor(rep(NA, nrow(dmarks)), levels = c("A", "B")))
imputed = ldmarks
# Randomly set values of the unobserved variable in the imputed data.frame
imputed$LAT = sample(factor(c("A", "B")), nrow(dmarks2), replace = TRUE)
# Fit the parameters over an empty graph
dag = empty.graph(nodes = names(ldmarks))
fitted = bn.fit(dag, imputed)
# Although we've set imputed values randomly, nonetheless override them with a uniform distribution
fitted$LAT = array(c(0.5, 0.5), dim = 2, dimnames = list(c("A", "B")))
# Use whitelist to enforce arcs from the latent node to all others
r = structural.em(ldmarks, fit = "bayes", impute="bayes-lw", start=fitted, maximize.args=list(whitelist = data.frame(from = "LAT", to = names(dmarks))), return.all = TRUE)
You have to use bnlearn 4.4-20180620 or later, because it fixes a bug in the underlying impute function.

ggplot plotly API mess width stack bar graph

I am using plotly library to get me HTML interactive graph, which i already generating from ggplot2, but with stacked graph, plotly doesnt work properly.
Here is my ggplot code :
if(file.exists(filename)) {
data = read.table(filename,sep=",",header=T)
} else {
g <- paste0("=== [E] Error : Couldn't Found File : ",filename)
print (g)
}
ReadChData <- data[data$Channel %in% c("R"),]
#head(ReadChData,10)
# calculate midpoints of bars (simplified using comment by #DWin)
Data <- ddply(ReadChData, .(qos_level),
transform, pos = cumsum(AvgBandwidth) - (0.5 *AvgBandwidth)
)
# library(dplyr) ## If using dplyr...
# Data <- group_by(Data,Year) %>%
# mutate(pos = cumsum(Frequency) - (0.5 * Frequency))
# plot bars and add text
g <- ggplot(Data, aes(x = qos_level, y = AvgBandwidth)) +
scale_x_continuous(breaks = x_axis_break) +
geom_bar(aes(fill = MasterID), stat="identity", width=0.2) +
scale_colour_gradientn(colours = rainbow(7)) +
geom_text(aes(label = AvgBandwidth, y = pos), size = 3) +
theme_set(theme_bw()) +
ylab("Bandwidth (GB/s)") +
xlab("QoS Level") +
ggtitle("Qos Compting Stream")
png(paste0(opt$out,"/",GraphName,".png"),width=6*ppi, height=6*ppi, res=ppi)
print (g)
library(plotly)
p <- ggplotly(g)
#libdir arugumet will be use to point to commin lib
htmlwidgets::saveWidget(as.widget(p), selfcontained=FALSE, paste0(opt$out,"/qos_competing_stream.html"))
and here is HTML output form plotly lib
http://pasteboard.co/2fHQfJwFu.jpg
Please help.

This is perhaps quite a bit late to answer. But for someone who might have the issue in future...
The geom_bar's width parameter is not recognized by ggplotly function.
Work Around :
A work around (not very good one) by using parameters colour="white", size = 1. This basically adds a white line around the bars, making an effect like white space.
You could try the following:
stat_summary(aes(fill = MasterID), geom="bar", colour="white", size = 1, fun.y = "sum", position = "stack")
Better solution :
Use bargap parameter from layout function. The code should be:
ggplotly(type='bar', ...) %>% layout(bargap = 3, autosize=T)
P.S. the code in question code is not executable, throws an error due to missing filename.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

invalid input 'ðŸ“§' in 'utf8towcs when using tm and pdftools - pdf

Related

How to solve "Error in knn: 'train' and 'class' have different lengths"

How to have partially italicized columns in pdf output?

stat_cor function in ggplot2: Print R and p-values on two lines

bnlearn error in structural.em

ggplot plotly API mess width stack bar graph

Categories

Resources