Is topic coherence (gensim CoherenceModel) calculated based exclusively on my corpus or external data as well? - data-science

I'm topic modeling a corpus of English 20th century correspondence using LDA and I've been using topic coherence (as well as silhouette scores) to evaluate my topics. I use gensim's CoherenceModel with c_v coherence and the highest I've ever gotten was a 0.35 score in all the models I've tested, even in the topics that make the most sense to me in qualitative evaluation, even after extensive pre-processing and hyperparameter comparison.
So I basically accepted that that's the best I'd get, but in order to write about it now I've been reading up on topic coherence and I've understood it's a pipeline and it models human judgement. One thing I can't seen to find clear info on, though: Is it based exclusively on calculations made on my corpus, or is it based on some external data as well? Like trained on external corpora that might have nothing to do with my domain? Should I use u_mass instead?

Yes, except u_mass, they all use external reference datasets. However, it may not be a bad thing, as those reference datasets provide richer information.

Related

Client participation in the federated computation rounds

I am building a federated learning model using Tensorflow Federated.
Based on what I have read in the tutorials and papers, I understood that the state-of-the-art method (FedAvg) is working by selecting a random subset of clients at each round.
My concern is:
I am having a small number of clients. Totally I have 8 clients, I select 6 clients for training and I kept 2 for testing.
All of the data are provided on my local device, so I am using the TFF as the simulation environment.
If I use all of the 6 clients in all of the rounds during federated communication rounds, would this be a wrong execution of the FedAvg method?
Note that I am planning also to use the same experiment used in this paper. That aims to use different server optimization methods and compare their performance. So, would (all clients participating procedure) works here or not?
Thanks in advance
This is certainly a valid application of FedAvg and the variants proposed in the linked paper, though one that is only studied empirically in a subset of the literature. On the other hand, many theoretical analyses of FedAvg assume a similar situation to the one you're describing; at the bottom of page 4 of that linked paper, you will see that the analysis is performed in this so-called 'full participation' regime, where every client participates on every round.
Often the setting you describe is called 'cross silo'; see, e.g., section 7.5 of Advances and Open Problems in Federated Learning, which will also contain many useful pointers for the cross-silo literature.
Finally, depending on the application, consider that it may be more natural to literally train on all clients, reserving portions of each clients' data for validation and test. Questions around natural partitions of data to model the 'setting we care about' are often thorny in the federated setting.

Why aren't TripleStore implemented as Native Graph Store as Property-Graph Store are?

Sparql based store or put another way, TripleStore, are known to be less efficient than property graph store, on top of not being able to be distributed while maintaining performance as property graph.
I understand that there are a lot of things at stake here, such as inferencing and what not. Putting distribution and inferencing aside where we could limit ourself to RDFS which can be fully captured via SPARQL, I am wondering why that is ?
More specifically why is the storage the issue. What is limiting Sparql Based store to store data as Property graph store does, and performing traversal instead of massive join queries. Can't sparql simply be translated to Gremlin steps for instance ? What is the limitation there? Can't the join be avoided ?
My assumption is, if sparql can be translated in efficient step traversal, and data is stored as property graph do, such as as janusGraph does https://docs.janusgraph.org/latest/data-model.html , then the issue of performance would be bridged while maintaining some inference such as RDFS.
This being said, Sparql is not Turing-complete of course, but at least for what it does, it would do it fast and possibly at scale as well. The goal is not to compete in my view, but to benefit for SPARQL ease of use and using traversal language like gremlin for things that really requires it e.g. OLAP.
Is there any project in that direction, has Apache jena considered any of this?
I saw that Graql of Grakn seem to be using that road for the reason I explain above, hence what's stopping the TripleStore community ?
#Michael, I am happy that you step in as you definitely know more than me on this :) . I am on a learning journey at this point. At your request here is one of the paper that inspired my understanding:
arxiv.org/abs/1801.02911 (SPARQL querying of Property Graphs using
Gremlin Traversals)
I quote them
"We present a comprehensive empirical evaluation of Gremlinator and
demonstrate its validity and applicability by executing SPARQL queries
on top of the leading graph stores Neo4J, Sparksee and Apache
TinkerGraph and compare the performance with the RDF stores Virtuoso,
4Store and JenaTDB. Our evaluation demonstrates the substantial
performance gain obtained by the Gremlin counterparts of the SPARQL
queries, especially for star-shaped and complex queries."
They explain however that things depends somehow on the type of queries.
Or as another answer put that in stack overflow Comparison of Relational Databases and Graph Databases would also help understand the issue between Set and path. My understanding is that TripleStore works with Set too. This being said i am definitely not aware of all the optimization technics implemented in TripleStore lately, and i saw several papers explaining technics to significantly prune set join operation.
On distribution it is more a guts feelings. For instance, doing join operation in a distributed fashion sounds very but very expensive to me. I don't have the papers and my research is not exhaustive on the matters. But from what I have red and I will have to dig in my Evernote :) to back it, that's the fundamental problem with distribution. Automated smart sharding here seems not to help alleviate the issue.
#Michael this a very but very complex subject. I'm definitively on the journey and that's why i am helping myself with stackoverflow to guide my research. You probably have an idea of as to why. So feel free to provides with pointers indeed.
This being said, I am not saying that there is a problem with RDF and that Property-Graph are better. I am saying that somehow, when it comes to graph traversal, there are ways of implementing a backend that makes this fast. The data model is not the issue here, the data structure used to support the traversal is the issue. The second thing that i am saying is that, it seems that the choice of the query language influence how the "traversal" is performed and hence the data structure that is used to back the data model.
That's my understanding so far, and yes I do understand that there are a lot of other factor at play, and feel free to enumerate some of them to guide my journey.
In short my question comes down to, is it possible to have RDF stores backed by a so-called Native Graph Storage and then Implement Sparql in term of Traversal steps rather than joins over set as per its algebra ? Wouldn't that makes things a bit faster. It seems to be that this is somewhat the approach taken by https://github.com/graknlabs/grakn which is primarily backed by janusGraph for a graph like storage. Although it is not RDF, Graql is the same Idea as having RDFS++ + Sparql. They claim to just do it better, for which i have my reservation, but that's not the fundamental question of this thread. The bottom line is they back knowledge representation by the information retrieval (path traversal) and the accompanying storage approach that Property-Graph championed. Let me be clear on this, I am not saying that the graph native storage is the property of property graph. It is just in my mind a storage approach optimized to store Graph Structure where the information retrieval involve (path) traversal: https://docs.janusgraph.org/latest/data-model.html.
First, I'd love to see the references that back up your claim that RDF-based systems are inherently less efficient than property graph ones, because frankly it's a nonsensical claim. Further, there have been distributed, and I'm assuming you mean scale-out, RDF stores, so the claim that they are not able to be distributed is simply incorrect.
The Property Graph model, and Gremlin, can easily be implemented on top of an RDF-based system. This has been done at twice once to my knowledge, and in one of those implementations reasoning was supported at the Gremlin/Property Graph layer. So you don't need to be a Property Graph based system to support that model. There are a myriad of reasons why systems, RDF and Property Graph, make specific implementation choices, from storage to execution and beyond, and those choices are guided some by the "native" model, the technology chosen for implementation, and perhaps most importantly, the use cases for the system and the problems it aims to solve.
Further, it's unclear what you recommend the authors of RDF-based systems actually do; are you suggesting scale-out is beneficial? Are you stating that your preference for the Propety Graph model should be taken as gospel such that RDF-based systems give up and switch data models? Do you want Property Graph systems retrofit RDFS?
Finally, to the initial question you asked, I think you have it exactly backwards; the Property Graph model is a hybrid graph model mixing elements of graph and key-value models, whereas the RDF model is a pure, ie native, graph model. Gremlin will be adopting the RDF model, albeit with syntactic sugar around what in the RDF world is called reification, but to everyone else, edge properties. So in the world where your exemplar of the Property Graph model is abandoning said model, I'm not sure what more to tell you, other than you should do a bit more background research.

IPA (International Phonetic Alphabet) Transcription with Tensorflow

I'm looking into designing a software platform that will aid linguists and anthropologists in their study of previously unstudied languages. Statistics show that around 1,000 languages exist that have never been studied by a person outside of their respective speaker groups.
My goal is to utilize TensorFlow to make a platform that will allow linguists to study and document these languages more efficiently, and to help them create written systems for the ones that don't have a written system already. One of their current methods of accomplishing such a task is three-fold: 1) Record a native speaker conversing in the language, 2) Listening to that recording and trying to transcribe it into the IPA, 3) From the phonetics, analyzing the phonemics and phonotactics of the language to eventually create a written system for the speaker.
My proposed platform would cut that research time down from a minimum of a year to a maximum of six months. Before I start, I have some questions...
What would be required to train TensorFlow to transcribe live audio into the IPA? Has this already been done? and if so, how would I utilize a previous solution for this project? Is a project like this even possible with TensorFlow? if not, what would you recommend using instead?
My apologies for the magnitude of this question. I don't have much experience in the realm of machine learning, as I am just beginning the research process for this project. Any help is appreciated!
I guess I will take a first shot at answering this. Since the question is pretty general, my answer will have to be pretty general as well.
What would be required. At the very least you would have to have a large dataset of pre-transcribed data. Ideally a large amount of spoken language audio mapped to characters in the phonetic alphabet, so the system could learn the sound of individual characters rather than whole transcribed words. If such a dataset doesn't exist, a less granular dataset could be used, mapping single words to their transcriptions. Then you would need a model, that is the actual neural network architecture implemented in code. And lastly you would need some computing resources. This is not something you can train casually, you would either have to buy some time in a cloud based machine learning framework (like Google Cloud ML) or build a fairly expensive machine to train at home.
Has this been done? I don't know. I don't think so. There have been published papers reporting various degrees of success at training systems to transcribe speech. Here is one, for example, http://deeplearning.stanford.edu/lexfree/lexfree.pdf It seems that since the alphabet you want to transcribe to is specifically designed to capture the way words sound rather than just write down the words you might have more success at training such a model.
Is it possible with TensorFlow. Yes, most likely. TensorFlow is well suited for implementing most modern deep learning architectures. Unless you end up designing some really weird and very original model for this purpose, TensorFlow should work just fine.
Edit: after some thought in part 1, you would have to use a dataset mapping spoken words to their transcriptions, since I expect that the same sound pronounced separately would be different from when the same sound is used in a word.
This has actually been done, albeit in PyTorch, by a group at CMU: https://github.com/xinjli/allosaurus

How to make testing data manually for clustering of citation records?

I'm doing a research on the author name disambiguation problem. I want to make some experiments. I want to perform clustering on citation records. My dataset consist of 2000 xml records. I need testing data. The dataset that I'm using is not popular and I need to make testing data manually. I don't know how to do so. I need instruction of how to make testing data manually. Note: I want to compare the performance of a set of techniques in solving the author name disambiguation problem, So I must perform testing.
Even though it is not really clear what kind of testing you want to perform, but general answer to the issue at hand - trying to artificially create more data from the data you have at hand - is a bootstrap. In general it is technique when you perform sampling with replacement from your dataset as many times as you want. It randomly picks up some element from your data repetitively untill you get a sample of the size you want. The sample you get could be larger than your original dataset but should have similar (from statistical point of view) as your original dataset. Bootstrap sampling is available in sklearn.
P.S. You need to keep in mind that this solution is not optimal - best solution to this problem is to actually get more real data somehow.
Classification vs. Clustering
For author name disambiguation, I don't think you want clustering. What you want is classification.
You have a features for each author / publication. Now you give the classifier two of those feature vectors. It classifies "it is the same author" or "those are different authors".
Training / testing data
Having a binary classification problem, the testing suddenly becomes simple: Just use one of the measures used in literature so often (accuracy, precision, recall, confuscation matrix).
Getting the data might be a bit more complicated. You wrote that you have an XML file of 2000 records. I guess you can derive features from those records automatically and authors have an identifier? Then you can simply generate negative examples by having different authors and positive examples by checking if the identifier is the same.
Otherwise you can have a look at http://dblp.uni-trier.de/. Although there are likely many publications under the same author which should be different, they do distinguish authors not only by name but give them identifiers.
Alternatively, you can train a classifier to classify each of the known authors with e.g. > 30 publications. Then remove the softmax layer and use those features to distinguish the authors.

Suitability of Naive Bayes classifier in Mahout to classifying websites

I'm currently working on a project that requires a database categorising websites (e.g. cnn.com = news). We only require broad classifications - we don't need every single URL classified individually. We're talking to the usual vendors of such databases, but most quotes we've had back are quite expensive and often they impose annoying requirements - like having to use their SDKs to query the database.
In the meantime, I've also been exploring the possibility of building such a database myself. I realise that this is not a 5 minute job, so I'm doing plenty of research.
From reading various papers on the subject, it seems a Naive Bayes classifier is generally the standard approach for doing this. However, many of the papers suggest enhancements to improve its accuracy in web classification - typically by making use of other contextual information, such as hyperlinks, header tags, multi-word phrases, the URL, word frequency and so on.
I've been experimenting with Mahout's Naive Bayes classifier against the 20 Newsgroup test dataset, and I can see its applicability to website classification, but I'm concerned about its accuracy for my use case.
Is anyone aware of the feasibility of extending the Bayes classifier in Mahout to take into account additional attributes? Any pointers as to where to start would be much appreciated.
Alternatively, if I'm barking up entirely the wrong tree please let me know!
You can control the input about as much as you'd like. In the end the input is just a feature vector. The feature vector's features can be words, or bigrams -- but they can also be whatever you want. So, yes, you can inject new features by modifying the input as you like.
How best to weave in those features is another topic entirely -- there's not one best way to convert them to numbers. Mahout in Action covers this reasonably well FWIW.