Quanteda: error message while tokenizing "unable to find an inherited method for function ‘tokens’ for signature ‘"corpus"’" - tokenize

I have been trying to tokenise and clean my 400 txt documents before using structured topic modelling (STM). I wanted to remove punctuations, stopwords, symbols, etc. However, I get the following error message:
"Error in (function (classes, fdef, mtable): unable to find an inherited method for function ‘tokens’ for signature ‘"corpus"’". This is my original code:
answers2 <- tokens(answers_corpus, what = c("word"), remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE,
ngrams = 1L, verbose = quanteda_options("verbose"), include_docvars = TRUE, text_field = "text")
I also tried to tokenize a simple string text - just to check if it was an encoding problem while importing my txt files - but I got the same error message, plus a couple of extra ones when I tried to tokenise the the text directly, without converting it to corpus: "Error: Unable to locate Ciao bella ciao" and "Error: No language specified!". Here is my example code in case someone wants to replicate the error message:
prova <- c("Ciao bella ciao")
prova2 <- "Ciao bella ciao"
prova_corpus <- corpus(prova)
prova2_corpus <- corpus(prova2)
prova_tok <- tokens(prova2_corpus)
prova2_tok <- tokens(prova_corpus)
The packages that are loaded are: data.table, ggplot2, quanteda, readtext, stm, stringi, stringr, tm, textstem. Any suggestion on how I could proceed to tokenise and clean my texts?

After several attempts, I managed to find a solution. When several text analysis/topic modelling packages are loaded in Rstudio, the "tokens" functions can overlap. You need to force the command to be quantedas "tokens", ie quanteda::tokens(answers). Here is the updated code
answers2 <- quanteda::tokens(answers_corpus, what = c("word"), remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE,
verbose = quanteda_options("verbose"), include_docvars = TRUE, text_field = "text")
And the updated example code too:
prova <- c("Ciao bella ciao")
prova2 <- "Ciao bella ciao"
prova_corpus <- corpus(prova)
prova2_corpus <- corpus(prova2)
prova_tok <- quanteda::tokens(prova2_corpus)
prova2_tok <- quanteda::tokens(prova_corpus)

Related

Pipeline generation - passing in simple datastructures like lists/arrays

For a code repository project in Palantir Foundry, I am struggling with re-using some of my transformation logic.
It seems almost trivial, but: is there way to send an Input to a Transform that is not a dataset/dataframe reference?
In my case I want to pass in strings or lists/arrays.
This is my code:
from pyspark.sql import functions as F
from transforms.api import Transform, Input, Output
def my_computation(result, customFilter, scope, my_categories, my_mappings):
scope_df = scope.dataframe()
my_categories_df = my_categories.dataframe()
my_mappings_df = my_mappings.dataframe()
filtered_cat_df = (
my_categories_df
.filter(F.col('CAT_NAME').isin(customFilter))
)
# ... more logic
def generateTransforms(config):
transforms = []
for key, value in config.items():
o = {}
for outKey, outValue in value['outputs'].items():
o[outKey] = Output(outValue)
i = {}
for inpKey, inpValue in value['inputs'].items():
i[inpKey] = Input(inpValue)
i['customFilter'] = Input(value['my_custom_filter'])
transforms.append(Transform(my_computation, inputs=i, outputs=o))
return transforms
config = {
"transform_one": {
"my_custom_filter": {
"foo",
"bar"
},
"inputs": {
"scope": "/my-project/input/scope",
"my_categories": "/my-project/input/my_categories",
"my_mappings": "/my-project/input/my_mappings"
},
"outputs": {
"result": "/my-project/output/result"
}
}
}
TRANSFORMS = generateTransforms(config)
The concrete question is: how can I send in the values from my_custom_filter into customFilter in the transformation function my_computation?
If I execute it like above, I get the error "TypeError: unhashable type: 'set'"
This looks like a python issue, any chance you can point out which line is causing the error?
Reading throung your code, I would guess it's this line:
i['customFilter'] = Input(value['my_custom_filter'])
Your python logic is wrong, if we unpack your code you're trying to do this call:
i['customFilter'] = Input({"foo", "bar"})
Edit to answer the comment on how to create a python transform to lock a variable in a closure:
def create_transform(inputs={}, outputs={}, my_other_var):
#transform(**inputs, **outputs)
def compute(input_foo, input_bar, output_foobar, ctx):
df = input_foo.dataframe()
df = df.withColumn("mycol", F.lit(my_other_var))
output_foorbar.write_dataframe(df)
return compute
and now you can call this:
transforms.append(create_tranform(inputs, outptus, "foobar"))

Kotlin and SimpleXML - Unable to satisfy ElementList Error

I'm struggling to get simpleXML and Kotlin to read a XML file properly.
I've got the following Root class:
class ServerConfiguration {
#field:Root(strict = false, name = "serverConfiguration")
#field:ElementList(required = true, name = "channels", entry = "channel", inline = true, type=Channel::class)
lateinit var channels: List<Channel>
#field:Element(name = "serverSettings", required = true, type = ServerSettings::class)
lateinit var serverSettings: ServerSettings
}
(The Channel class itself has also Lists, but even if I leave it with simple Attributes (ie Strings), it won't work.)
The XML contains:
<serverConfiguration version="3.5.2">
<date>2022-07-12 10:57:47</date>
<channelGroups>
[... lots of groups]
</channelGroups>
<channels>
<channel version="3.5.2">
<id>b7cb6bf9-d3a5-4a74-8399-b6689d915a15</id>
<nextMetaDataId>6</nextMetaDataId>
<name>cANACR_1_Fhir2Hl7Omg</name>
<connector class="[...].TcpReceiverProperties" version="3.5.2">
[... more ]
</channel>
[... a lot of channels]
</channels>
[... even more data]
</serverConfiguration>
Since there are multiple Tags in the xml that contain a "class" Attribute, I understand that I need to use inline = true in my #field:ElementList
I got through a lot of errors up until this point which I could resolve by myself but this one eludes me.
I run the Serializer via:
val serializer: Serializer = Persister()
val dataFetch = serializer.read(ServerConfiguration::class.java, myFile!!, false)
The Error (I cut out classpaths):
org.simpleframework.xml.core.ValueRequiredException: Unable to satisfy #org.simpleframework.xml.ElementList(entry="channel", data=false, inline=true, name="channels", type=Channel.class, required=true, empty=true) on field 'channels' private java.util.List ServerConfiguration.channels for class ServerConfiguration at line 1
If anyone could nudge me in the right direction, I'd be very grateful.
Addendum:
If I set required=false the program runs, but not a single channel is read.
I've tried ArrayList, List, and Set as datatype
I've tried to circumvent lateinit with var channels: List<Channel> = mutableListOf()
I've got it working through adding a wrapper class for the nested lists:
ServerConfiguration.kt:
[...]
#field:Element(name="channels", required = true, type=ChannelList::class)
lateinit var channelList: ChannelList
ChannelList.kt:
class ChannelList {
#field:ElementList(required = true, inline = true,name = "channels", entry = "channel", type=Channel::class, data = true, empty=false)
var channels: List<Channel> = mutableListOf()
}
And finally Channel.kt:
class Channel {
#field:Element(name = "destinationConnectors", required = true, type = DestinationConnectorList::class)
lateinit var destinationConnectorList: DestinationConnectorList
#field:Element(name = "exportData", required = true, type=ExportData::class)
lateinit var exportData: ExportData
[...]
While this is working, I would have expected simpleXML to be able to add the Elements of the List directly without needing to use a wrapper class (ChannelList).
If anyone knows how to achieve this, please feel free to comment or add your solution.
Kind regards,
K

How to read multiple .xls files in one go in r

Tried the below code multiple times, but nothing happens when I run the below code. I think fread does not read .xls format. Thus I tried two other different codes, one with Rio package and another with openxlsx package. Sorry i am new to this. There are 38 files, each with name "Cust+Txn+Details+Customer (36).xls". Thank you.
## First put all file names into a list
library(data.table)
files <- list.files(path = "F:\\MUMuniv\\machine learning class\\
price sensitivty\\PS works\\Customer files",
pattern = ".xls", full.names = T)
readdata <- function(fn){
dt_temp <- fread(fn)
return(dt_temp)
}
# then using
all.files <- lapply(files, readdata)
final.data <- rbindlist(all.files)
Error I got: " Error in fread(fn) : mmap'd region has EOF at the end "
#Example 2
#rio package
require("rio")
xls <- dir(path = ".", all.files = T)
created <- mapply(convert, xls, gsub(".xlsx", ".csv", "xls"))
unlink(xls)
Error in get_ext(file) : 'file' has no extension
#example 3
# using openxlsx package
require("openxlsx")
# Create a vector of Excel files to read
files.to.read = list.files(path = ".", all.files = T)
# Read each file and write it to csv
lapply(files.to.read, function(f) {
df = read.xlsx(f, sheet=1)
write.csv(df, gsub("xlsx", "csv", f), row.names=FALSE)
})
Error in file(con, "r") : invalid 'description' argument In addition: Warning message:
In unzip(xlsxFile, exdir = xmlDir) : error 1 in extracting from zip file

perl CGI parameters not all showing

I am passing about seven fields from an HTML form to a Perl CGI script.
Some of the values are not getting recovered using a variety of methods (POST, GET, CGI.pm or raw code).
That is, this code
my $variable = $q->param('varname');
resulted in about half the variables either being empty or undef, although the latter may have been a coincidental situation from the HTML page, which uses JavaScript.
I wrote a test page on the same platform with a simple form going to a simple CGI, and also got results where onpy half the parameters were represented. The remaining values were empty after the assignment.
I tried both POST and GET. I also tried GET and printed the query string after attempting to write out the variables; everything was in the query string as it should be. I'm using CGI.pm for this.
I tried to see if the variable values had been parsed successfully by CGI.pm by creating a version of my test CGI code which just displays the
parameters on the HTML page. The result is a bunch of odd strings like
CGI=HASH(0x02033)->param('qSetName')
suggesting that assignment of these values results in a cast of some kind, so I was unable to tell if they actually 'contained' the proper values.
My real form uses POST, so I just commented out the CGI.pm code and iterated over STDIN and it had all the name-value pairs as it should have.
Everything I've done points to CGI.pm, so I will try reinstalling it.
Here's the test code that missed half the vars:
#!/usr/bin/perl;
use CGI;
my $q = new CGI;
my $subject = $q->param('qSetSubject');
my $topic = $q->param('qTopicName');
my $userName = $q->param('uName');
my $accessLevel = $q->param('accessLevel');
my $category = $q->param('qSetCat');
my $type = $q->param('qSetType');
print "Content-Type: text/html\n\n";
print "<html>\n<head><title>Test CGI<\/title><\/head>\n<body>\n\n<h2>Here Are The Variables:<\/h2>\n";
print "<list>\n";
print "<li>\$q->param(\'qSetSubject\') = $subject\n";
print "<li>\$q->param(\'qTopicName\') = $topic\n";
print "<li>\$q->param(\'uName\') = $userName\n";
print "<li>\$q->param(\'qSetCat\') = $accessLevel\n";
print "<li>\$q->param(\'qSetType\') = $category\n";
print "<li>\$q->param(\'accessLevel\') = $type\n";
print "<\/list>\n";
The results of ikegami's code are here:
qSetSubject: precalculus
qTopicName: polar coordinates
uName: kjtruitt
accessLevel: private
category: mathematics
type: grid-in
My attempt to incorporate ikegami's code
%NAMES = (
seqNum => 'seqNum',
uName => 'userName',
qSetName => 'setName',
accessLevel => 'accessLevel',
qSetCat => 'category',
qTopicName => 'topic',
qSetType => 'type',
qSetSubject => 'subject',
);
use CGI;
my $cgi = CGI->new();
print "Content-Type:text/html\n\n";
#print($cgi->header('text/plain'));
for my $name ($cgi->param) {
for ($cgi->param($name)) {
#print("$name: ".( defined($_) ? $_ : '[undef]' )."\n");
print "$NAMES{$name} = $_\n";
${$NAMES{$name}} = $_;
}
}
print "<html>\n<head><title>Test CGI<\/title><\/head>\n<body>\n\n<h2>Here Are The Variables:<\/h2>\n";
print "Hello World!\n";
print "<list>\n";
print "<li>\$q->param(\'qSetSubject\') = $subject\n";
print "<li>\$q->param(\'qTopicName\') = $topic\n";
print "<li>\$q->param(\'uName\') = $userName\n";
print "<li>\$q->param(\'qSetCat\') = $accessLevel\n";
print "<li>\$q->param(\'qSetType\') = $category\n";
print "<li>\$q->param(\'accessLevel\') = $type\n";
print "<\/list>\n";
You are receiving
qSetSubject: precalculus
qTopicName: polar coordinates
uName: kjtruitt
accessLevel: private
category: mathematics
type: grid-in
so
my $category = $q->param('qSetCat');
my $type = $q->param('qSetType');
should be replaced with
my $category = $q->param('category');
my $type = $q->param('type');

Calculating the load time of page elements using Rcurl? (R)

I started playing with the idea of testing a webpage load time using R. I have devised a tiny R code to do so:
page.load.time <- function(theURL, N = 10, wait_time = 0.05)
{
require(RCurl)
require(XML)
TIME <- numeric(N)
for(i in seq_len(N))
{
Sys.sleep(wait_time)
TIME[i] <- system.time(webpage <- getURL(theURL, header=FALSE,
verbose=TRUE) )[3]
}
return(TIME)
}
And would welcome your help in several ways:
Is it possible to do the same, but to also know which parts of the page took what parts to load? (something like Yahoo's YSlow)
I sometime run into the following error -
Error in curlPerform(curl = curl,
.opts = opts, .encoding = .encoding) :
Failure when receiving data from the
peer Timing stopped at: 0.03 0 43.72
Any suggestions on what is causing this and how to catch such errors and discard them?
Can you think of ways to improve the above function?
Update: I redid the function. It is now painfully slow...
one.page.load.time <- function(theURL, HTML = T, JavaScript = T, Images = T, CSS = T)
{
require(RCurl)
require(XML)
TIME <- NULL
if(HTML) TIME["HTML"] <- system.time(doc <- htmlParse(theURL))[3]
if(JavaScript) {
theJS <- xpathSApply(doc, "//script/#src") # find all JavaScript files
TIME["JavaScript"] <- system.time(getBinaryURL(theJS))[3]
} else ( TIME["JavaScript"] <- NA)
if(Images) {
theIMG <- xpathSApply(doc, "//img/#src") # find all image files
TIME["Images"] <- system.time(getBinaryURL(theIMG))[3]
} else ( TIME["Images"] <- NA)
if(CSS) {
theCSS <- xpathSApply(doc, "//link/#href") # find all "link" types
ss_CSS <- str_detect(tolower(theCSS), ".css") # find the CSS in them
theCSS <- theCSS[ss_CSS]
TIME["CSS"] <- system.time(getBinaryURL(theCSS))[3]
} else ( TIME["CSS"] <- NA)
return(TIME)
}
page.load.time <- function(theURL, N = 3, wait_time = 0.05,...)
{
require(RCurl)
require(XML)
TIME <- vector(length = N, "list")
for(i in seq_len(N))
{
Sys.sleep(wait_time)
TIME[[i]] <- one.page.load.time(theURL,...)
}
require(plyr)
TIME <- data.frame(URL = theURL, ldply(TIME, function(x) {x}))
return(TIME)
}
a <- page.load.time("http://www.r-bloggers.com/", 2)
a
your getURL call will only do one request and get the source HTML for the web page. It won't get the CSS or Javascript or other elements. If this is what you mean by 'parts' of the web page then you'll have to scrape the source HTML for those parts (in SCRIPT tags, or css references etc) and getURL them separately with timing.
Perhaps Spidermonkey from Omegahat could work.
http://www.omegahat.org/SpiderMonkey/