Wordcloud per topic for quanteda textmodel_lda in R - text-mining

Is there any possibility to extract wordclouds for each topic from quanteda textmodels_lda in R?
All approaches I tried failed as I am not able to extract terms, topics and frequency from the model an put it into a wordcloud.
Thank you very much!

Related

How to list all topics created by me

How can I get a list of all topics that I created?
I think it should be something like
%SEARCH{ "versions[-1].info.author = '%USERNAME%" type="query" web="Sandbox" }%
but that returns 0 results.
With "versions[-1]" I get all topics, and with "info.author = '%USERNAME%'" a list of the topics where the last edit was made by me. Having a list of all topics where any edit was made by me would be fine, too, but "versions.info.author = '%USERNAME%'" again gives 0 results.
I’m using Foswiki-1.0.9. (I know that’s quite old.)
The right syntax would be
%SEARCH{ "versions[-1,info.author='%USERNAME%']" type="query" web="Sandbox"}%
But that's not performing well, i.e. on your old Foswiki install.
Better is to install DBCacheContrib and DBCachePlugin and use
%DBQUERY{"createauthor='%WIKINAME%'"}%
This plugin caches the initial author in a way it does not have to retrieve the information from the revision system for every topic under consideration during query time.

Create a dataframe with names from a text

I am quite new to R and Quanteda. I'm trying to create a dataframe with people who voted in favour and against a legislative proposal based on parliamentary transcripts. I can't figure out how to do this and some help would be greatly appreciated.
The following is an example of what the text could look like:
In favour: van Vliet, Nolens, Bruinmelkanip, Krap, Travagliuo and Lucasse.
Those voted against: Verhey, ter Laan, van Gijn, PÜnacker Hordijk, Röell. Troelstra, Drucker, Schaper en Fox.
I want to try to write it as a function so I can indicate the starting and the ending word of each section and then make the dataframe with all names. If it is a function I could analyse multiple such pieces of text.
Thank you!

How to make spaCy case Insensitive

How can I make spaCy case insensitive when finding the entity name?
Is there any code snippet that i should add or something because the questions could mention entities that are not in uppercase?
def analyseQuestion(question):
doc = nlp(question)
entity=doc.ents
return entity
print(analyseQuestion("what is the best seller of Nicholas Sparks "))
print(analyseQuestion("what is the best seller of nicholas sparks "))
which gives
(Nicholas Sparks,)
()
This is old, but this hopefully this will help anyone looking at similar problems.
You can use a truecaser to improve your results.
https://pypi.org/project/truecase/
It is very easy. You just need to add a preprocessing step of question.lower() to your function:
def analyseQuestion(question):
# Preprocess question to make further analysis case-insensetive
question = question.lower()
doc = nlp(question)
entity=doc.ents
return entity
The solution inspired by this code from Rasa NLU library. However, for non-english (non-ASCII) text it might not work. For that case you can try:
question = question.decode('utf8').lower().encode('utf8')
However the NER module in spacy, to some extent depends on the case of the tokens and you might face some discrepancies as it is a statistical trained model.Refer this link.

Many inputs to one output, access wildcards in input files

Apologies if this is a straightforward question, I couldn't find anything in the docs.
currently my workflow looks something like this. I'm taking a number of input files created as part of this workflow, and summarizing them.
Is there a way to avoid this manual regex step to parse the wildcards in the filenames?
I thought about an "expand" of cross_ids and config["chromosomes"], but unsure to guarantee conistent order.
rule report:
output:
table="output/mendel_errors.txt"
input:
files=expand("output/{chrom}/{cross}.in", chrom=config["chromosomes"], cross=cross_ids)
params:
req="h_vmem=4G",
run:
df = pd.DataFrame(index=range(len(input.files), columns=["stat", "chrom", "cross"])
for i, fn in enumerate(input.files):
# open fn / make calculations etc // stat =
# manual regex of filename to get chrom cross // chrom, cross =
df.loc[i] = stat, chrom, choss
This seems a bit awkward when this information must be in the environment somewhere.
(via Johannes Köster on the google group)
To answer your question:
Expand uses functools.product from the standard library. Hence, you could write
from functools import product
product(config["chromosomes"], cross_ids)

How to use org.openimaj.ml.gmm to construct speaker models.

I would like to know how I can get GMM speaker model using OpenIMaj library.
org.openimaj.ml.gmm.GaussianMixtureModelEM. I have tried following
GaussianMixtureModelEM gmm = new GaussianMixtureModelEM
(DEFAULT_NUMBER_COMPONENTS,GaussianMixtureModelEM.CovarianceType.Diagonal);
MixtureOfGaussians mixture = gmm.estimate(data);
boolean convergerd = gmm.hasConverged();
I get true that GaussianMixtureModelEM has converged, I am lost where to go from here. Any help guidance would be appreciated.
Given your comment, then mixture.estimateLogProbability(point) should do what you want (see http://www.openimaj.org/apidocs/org/openimaj/math/statistics/distribution/MixtureOfGaussians.html#estimateLogProbability(double[])).