Regarding TDM of specific keywords - text-mining

I am working on multiple annual report pdf files and I want to extract certain keywords for example "Finance" "CSR" etc I created a code to extract keywords like this for each document
filelength= length(files)
wordlength= length(keywords)
word_count= seq(1, filelength*wordlength)
dim(word_count)= c(filelength, wordlength)
j = 1
for (j in 1:length(files)) {
P1 <- pdftools::pdf_text(pdf = files[j]) %>%
str_to_lower() %>%
str_replace_all("\\t", "") %>%
str_replace_all("\n", " ") %>%
str_replace_all(" ", " ") %>%
str_replace_all(" ", " ") %>%
str_replace_all(" ", " ") %>%
str_replace_all(" ", " ") %>%
str_replace_all("[:digit:]", "") %>%
str_replace_all("[:punct:]", "") %>%
str_trim()
for (i in 1:length(keywords)) {
word_count[j,i] <- P1 %>% str_count(keywords[i]) %>% sum()
}
}
word_count= as.data.frame(word_count)
rownames(word_count)= files
colnames(word_count)= keywords
know my query is that, I have extracted the words required (and count), but I want to do further analysis, for example, want to create tdm of specific keywords and want to extract only those phrases which are related to keywords. Is it possible in r this thing

Related

My input function and len function does not return a correct result in some cases

I have a question I wrote this script to format an arabic text to a certain format. But I found a problem then when I type a longer sentence it adds more dots then there are letters. I will paste the code here and explain what the problem is.
import itertools
while True:
input_cloze_deletion = input("\nEnter: text pr 'exit'\n> ")
input_exit_check = input_cloze_deletion.strip().lower()
if input_exit_check == "exit":
break
# copy paste
interpunction_list = ["الله", ",", ".", ":", "?", "!", "'"]
# copy paste
interpunction_list = [ ",", ".", ":", "?", "!", "'", "-", "(", ")", "/", "الله", "اللّٰـه"]
text_replace_0 = input_cloze_deletion.replace(",", " ,")
text_replace_1 = text_replace_0.replace(".", " .")
text_replace_2 = text_replace_1.replace(":", " :")
text_replace_3 = text_replace_2.replace(";", " ;")
text_replace_4 = text_replace_3.replace("?", " ?")
text_replace_5 = text_replace_4.replace("!", " !")
text_replace_6 = text_replace_5.replace("'", " ' ")
text_replace_7 = text_replace_6.replace("-", " - ")
text_replace_8 = text_replace_7.replace("(", " ( ")
text_replace_9 = text_replace_8.replace(")", " ) ")
text_replace_10 = text_replace_9.replace("/", " / ")
text_replace_11 = text_replace_10.replace("الله", "اللّٰـه")
text_split_list = text_replace_11.split()
count_number = []
letter_count_list = []
index_list = itertools.cycle(range(1, 4))
for letter_count in text_split_list:
if letter_count in interpunction_list:
letter_count_list.append(letter_count)
elif "ـ" in letter_count:
letter_count = len(letter_count) - 1
count_number.append(letter_count)
print(letter_count)
else:
letter_count = len(letter_count)
count_number.append(letter_count)
print(letter_count)
for count in count_number:
letter_count_list.append(letter_count * ".")
zip_list = zip(text_split_list, letter_count_list)
zip_list_result = list(zip_list)
for word, count in zip_list_result:
if ((len(word)) == 2 or word == "a" or word == "و") and not word in interpunction_list :
print(f" {{{{c{(next(index_list))}::{word}::{count}}}}}", end="")
elif word and count in interpunction_list:
print(word, end = "")
else:
print(f" {{{{c{(next(index_list))}::{word}::{count}}}}}", end="")
when I type كتب عليـ ـنا و علـ ـي
the return is {{c1::كتب::...}} {{c2::عليـ::...}} {{c3::ـنا::...}} {{c1::و::..}} {{c2::علـ::..}} {{c3::ـي::..}}
but it should be
{{c1::كتب::...}} {{c2::عليـ::...}} {{c3::ـنا::..}} {{c1::و::.}} {{c2::علـ::..}} {{c3::ـي::.}}
I add a print function the print the len() results and the result is correct but it add an extra dot in some case.
But when I type just a single "و" it does a correct len() function but when I input a whole sentence it add an extra dot and I don't know why.
please help

ggplot multiple columns per group [duplicate]

This question already has answers here:
ggplot side by side geom_bar()
(2 answers)
Closed 10 months ago.
I can't figure out how to solve this.
The code at the end of the post produces this plot:
How can I make it so that each year has one column per product group (food & tobacco, personal care, etc...)? That is 5 columns per year.
Many thanks!
library(janitor)
library(tidyverse)
library(dplyr)
# Format data
us_exp <- clean_names(USPersonalExpenditure)
us_exp <- USPersonalExpenditure %>%
as.data.frame() %>%
rownames_to_column(., "type")
us_exp <- us_exp %>%
pivot_longer(!type, names_to = "year", values_to = "count") %>%
as_tibble()
# ggplot
ggplot(us_exp) +
aes(x = year,
y = count,
group = type,
fill = type) +
geom_col()
theme_classic(base_family = "helvetica_regular") +
theme(legend.position="bottom",
text = element_text(size = 24)) +
scale_fill_npg() +
ggtitle("...") +
xlab(NULL) +
ylab(NULL)
Use position = "dodge" or position = position_dodge() for additional arguments.
ggplot(us_exp) +
aes(x = year, y = count, group = type, fill = type) +
geom_col(position = "dodge")

How to properly clean column header in Power Query and capitalize first letter only without changing other letter?

I would like to clean a column Header of the table so that my column header that has a name like below:
[Space][Space][Space]First Name[Space][Space]
[Space]MaintActType[Space]
TECO date[Space]
FIN Date
ABC indicator
COGS
Created On
And my desired Column Header Name to be like below:
First Name
Main Act Type
TECO Date
FIN Date
ABC Indicator
COGS
Created On
my code is as below:
let
Source = Excel.Workbook(File.Contents("C:\RawData\sample.xlsx"), null, true),
#"sample_Sheet" = Source{[Item="sample",Kind="Sheet"]}[Data],
#"Promoted Headers" = Table.PromoteHeaders(#"sample_Sheet", [PromoteAllScalars=true]),
#"Trim ColumnSpace" = Table.TransformColumnNames(#"Promoted Headers", Text.Trim),
#"Split CapitalLetter" = Table.TransformColumnNames(#"Trim ColumnSpace", each Text.Combine(Splitter.SplitTextByPositions(Text.PositionOfAny(_, {"A".."Z"},2)) (_), " ")),
#"Remove DoubleSpace" = Table.TransformColumnNames(#"Split CapitalLetter", each Replacer.ReplaceText(_, " ", " ")),
#"Capitalise FirstLetter" = Table.TransformColumnNames(#"Remove DoubleSpace", Text.Proper),
#"Remove Space" = Table.TransformColumnNames(#"Capitalise FirstLetter", each Text.Remove(_, {" "})),
#"Separate ColumnName" = Table.TransformColumnNames(#"Remove Space", each Text.Combine(Splitter.SplitTextByCharacterTransition({"a".."z"}, {"A".."Z"}) (_), " "))
in
#"Separate ColumnName"
However, i get the result as below. Which is not what i wanted as all the capital letter we combined together. How do i change the code so that i get the result as wanted? I would really appreciate your help, please.
First Name
Main Act Type
TECODate
FINDate
ABCIndicator
COGS
Created On
Alternatively, i changed the code to:
let
Source = Excel.Workbook(File.Contents("C:\RawData\sample.xlsx"), null, true),
#"sample_Sheet" = Source{[Item="sample",Kind="Sheet"]}[Data],
#"Promoted Headers" = Table.PromoteHeaders(#"sample_Sheet", [PromoteAllScalars=true]),
#"Trim ColumnSpace" = Table.TransformColumnNames(Input, Text.Trim),
#"Separate ColumnName" = Table.TransformColumnNames(#"Trim ColumnSpace", each Text.Combine(Splitter.SplitTextByCharacterTransition({"a".."z"}, {"A".."Z"}) (_), " ")),
#"Capitalise FirstLetter" = Table.TransformColumnNames(#"Separate ColumnName", Text.Proper)
in
#"Capitalise FirstLetter"
Unfortunately it return the result like so:
First Name
Main Act Type
Teco Date
Fin Date
Abc Indicator
COGS
Created On
I have no idea how to play around the code anymore.
One way is to mark the existing spaces with something (I used "ZZZ") and restore them back to spaces at the end. Here's your code with a couple of tweaks. Thanks for your code. I was trying to do Text.Proper and your sample helped me!
let
Source = Input,
#"Replaced Value" = Table.ReplaceValue(Source,"[Space]"," ",Replacer.ReplaceText,{"Headers"}),
#"Transposed Table" = Table.Transpose(#"Replaced Value"),
#"Promoted Headers" = Table.PromoteHeaders(#"Transposed Table", [PromoteAllScalars=true]),
#"Trim ColumnSpace" = Table.TransformColumnNames(#"Promoted Headers", Text.Trim),
#"Change space to ZZZ" = Table.TransformColumnNames(#"Trim ColumnSpace", each Replacer.ReplaceText(_, " ", " ZZZ ")),
#"Split CapitalLetter" = Table.TransformColumnNames(#"Change space to ZZZ", each Text.Combine(Splitter.SplitTextByPositions(Text.PositionOfAny(_, {"A".."Z"},2)) (_), " ")),
#"Capitalise FirstLetter" = Table.TransformColumnNames(#"Split CapitalLetter", Text.Proper),
#"Remove Space" = Table.TransformColumnNames(#"Capitalise FirstLetter", each Text.Remove(_, {" "})),
#"Separate ColumnName" = Table.TransformColumnNames(#"Remove Space", each Text.Combine(Splitter.SplitTextByCharacterTransition({"a".."z"}, {"A".."Z"}) (_), " ")),
#"Change ZZZ to space" = Table.TransformColumnNames(#"Separate ColumnName", each Replacer.ReplaceText(_, "ZZZ", " ")),
#"Remove DoubleSpace" = Table.TransformColumnNames(#"Change ZZZ to space", each Replacer.ReplaceText(_, " ", " "))
in
#"Remove DoubleSpace"

Access VBA error in inserting long text into table

In Access I have 2 tables, table_A and table_B. In col2 from table_A, I have an R function as cell value.
mdPatternChart<-function (x, Str_PathFile)
{
if (!(is.matrix(x) || is.data.frame(x)))
stop("Data should be a matrix or dataframe")
if (ncol(x) < 2)
stop("Data should have at least two columns")
R <- is.na(x)
nmis <- colSums(R)
R <- matrix(R[, order(nmis)], dim(x))
pat <- apply(R, 1, function(x) paste(as.numeric(x), collapse = ""))
sortR <- matrix(R[order(pat), ], dim(x))
if (nrow(x) == 1) {
mpat <- is.na(x)
} else {
mpat <- sortR[!duplicated(sortR), ]
}
if (all(!is.na(x))) { cat(" /\\ /\\\n{ `---' }\n{ O O }\n==> V <==")
cat(" No need for mice. This data set is completely observed.\n")
cat(" \\ \\|/ /\n `-----'\n\n")
mpat <- t(as.matrix(mpat, byrow = TRUE))
rownames(mpat) <- table(pat)
} else {
if (is.null(dim(mpat))) {
mpat <- t(as.matrix(mpat))
}
rownames(mpat) <- table(pat)
}
r <- cbind(abs(mpat - 1), rowSums(mpat))
r <- rbind(r, c(nmis[order(nmis)], sum(nmis)))
png(file=paste(Str_PathFile,".png",sep=""),bg="transparent")
plot.new()
if (is.null(dim(sortR[!duplicated(sortR), ]))) {
R <- t(as.matrix(r[1:nrow(r) - 1, 1:ncol(r) - 1]))
} else {
if (is.null(dim(R))) {
R <- t(as.matrix(R))
}
R <- r[1:nrow(r) - 1, 1:ncol(r) - 1]
}
par(mar = rep(0, 4))
plot.window(xlim = c(-1, ncol(R) + 1), ylim = c(-1, nrow(R) +
1), asp = 1)
M <- cbind(c(row(R)), c(col(R))) - 1
shade <- ifelse(R[nrow(R):1, ], mdc(1), mdc(2))
rect(M[, 2], M[, 1], M[, 2] + 1, M[, 1] + 1, col = shade)
adj = c(0, 0.5)
srt = 90
for (i in 1:ncol(R)) {
text(i - 0.5, nrow(R) + 0.3, colnames(r)[i], adj = adj,
srt = srt)
text(i - 0.5, -0.3, nmis[order(nmis)][i])
}
for (i in 1:nrow(R)) {
text(ncol(R) + 0.3, i - 0.5, r[(nrow(r) - 1):1, ncol(r)][i],
adj = 0)
text(-0.3, i - 0.5, rownames(r)[(nrow(r) - 1):1][i],
adj = 1)
}
text(ncol(R) + 0.3, -0.3, r[nrow(r), ncol(r)])
dev.off()
}
Now I would like to insert this into col2 of table_B. Col2 from both tables are memo. It works as
CurrentDb.Execute "insert into Table_B (Col1,Col2) select Col1,Col2 from Table_A"
But it does not work if I use DAO.recordset as below.
CurrentDb.Execute "insert into Table_B (Co1,Col2) values (2,'" & Rs_TableA.Fields("Col2") & "')"
And it gave a run-time error 3075 saying something is wrong with the syntax. I replaced ! and " in the function but it did not work. I also tried by saving its value in a string variable before inserting and it did not work either. As I need to loop through table_A, Can anyone help? Thanks!
The function text contains apostrophes and quote characters. These characters have special meaning in SQL statements. The SELECT subquery won't have an issue but the constructed SQL pulling value from recordset is trying to process them as special characters, not as just simple text. This causes the compiled statement to be nonsense to the SQL engine. Review How do I escape a single quote in SQL Server?.
Options for handling:
Replace(Replace([fieldname], "'", "''"), Chr(34), Chr(34) & Chr(34))
Open a source recordset and a target recordset, loop through source and use AddNew and Update to write records to target
Maybe the SELECT subquery version will actually serve requirement, and if the ID should be supplied dynamically by textbox:
CurrentDb.Execute "INSERT INTO Table_B (Col1,Col2) SELECT " & Me.tbxID & " As C1, Col2 FROM Table_A"
Also, there are 2 slanted apostrophes where I think there should be normal apostrophes.

Programatically write variable names along with their data type to Teradata using R?

I am attempting to write a dataframe using R to Teradata. The dataframe is wide in format (over 100 columns) and writing to Teradata implies declaring both the name and class of each variable. Note that the below data is just serving as an example.
iris$integerRandom <- seq_along(iris$Sepal.Length)
iris$Dates <- seq.Date(as.Date("2018-01-01"), by = "d", length.out = nrow(iris))
iris$Dates2 <- seq.Date(as.Date("2019-01-01"), by = "d", length.out = nrow(iris))
iris$Species <- as.character(iris$Species)
iris$characterRandom <- sample(letters, nrow(iris), replace = TRUE)
## Getting Numeric and Integer Names first
names_num <- names(iris)[which(sapply(iris, class) %in% c("integer", "numeric"))]
names_date <- names(iris)[which(sapply(iris, class) %in% "Date")]
names_character <- names(iris)[which(sapply(iris, class) %in% "character")]
## Generating variable names with corresponding variable types
paste(gsub("varchar(300)", '"varchar(300)"', gsub(",", " = varchar(300), ", toString(names_character))), "varchar(300)", sep = " = ")
paste(gsub(",", " = date", toString(names_date)), " = date")
paste(gsub("varchar(300)", '"float"', gsub(",", " = float, ", toString(names_num))), "float", sep = " = ")
Ideally, I would like the desired output to say
Species = "varchar(300)", characterRandom = "varchar(300)", and so forth. Note that the order in which the variables is important as the order matters when declaring the names and the types to Teradata (or SQL in this case) the code will probably work for both tools. So the order of the variable names along with
Sepal.Length and end with characterRandom.