Searching for semantic relatedness tool - semantics

I need a tool that compute the semantic relatedness between two words.Please, Have you an idea about a tool or a code source which adopts this process. I am trying wordnet similarity (http://maraca.d.umn.edu/cgi-bin/similarity/similarity.cgi), but there is several missing words, i need one richer in term of concept.

What you described is more like a research project. :-)
If they are just words, not phrases, the most recent technology is word embedding. You can think of it as converting words to high-dimensional vectors (from 200 to 1000 dimensions) by training on millions of documents.
https://code.google.com/archive/p/word2vec
The code has been archived for proprietary issues but you can still download and run for yourself. Good luck.

Related

Feed large text to PyTextRank

I would like to use PyTextRank for keyphrase extraction. How can I feed feed 5 million documents (each document consisting of a few paragraphs) to the package?
This is the example I see on the official tutorial.
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\n"
doc = nlp(text)
for phrase in doc._.phrases:
ic(phrase.rank, phrase.count, phrase.text)
ic(phrase.chunks)
Is my option only to concatenate several million documents to a single string and pass it to nlp(text)? I do not think I could use nlp.pipe(texts) as I want to create one network by computing words/phrases from all documents.
No, instead it would almost certainly be better to run these tasks in parallel. Many use cases of pytextrank have used Spark, Dask, Ray, etc., to parallelize running documents through a spaCy pipeline with pytestrank to extract entities.
For an example of parallelization with Ray, see https://github.com/Coleridge-Initiative/rclc/blob/4d5347d8d1ac2693901966d6dd6905ba14133f89/bin/index_phrases.py#L45
One question would be how you are associating the extracted entities with documents? Are these being collected into a dataset, or perhaps a database or key/value store?
However these results get collected, you could then construct a graph of co-occurring phrases, and also include additional semantics to help structure the results. A sister project kglab https://github.com/DerwenAI/kglab was created for these kinds of use cases. There are some examples in the Jupyter notebooks included with the kglab project; see https://derwen.ai/docs/kgl/tutorial/
FWIW, we'll have tutorials coming up at ODSC West about using kglab and pytextrank and there are several videos online (under Graph Data Science) for previous tutorials at conferences. We also have monthly public office hours through https://www.knowledgegraph.tech/ – message me #pacoid on Tw for details.

How to successfully convert math papers to plain text

Goals:
1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.
Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.
Problems:
All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.
2.PDF is really hard to process.
3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain \def macros which can not even be processed by de-macro. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.
Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as \mathcal{A} and A being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?
Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write \sqrt).
You can further refer to the issue of recognition to this paper:
https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -
Recognition of handwritten symbols
Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥
http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.

IPA (International Phonetic Alphabet) Transcription with Tensorflow

I'm looking into designing a software platform that will aid linguists and anthropologists in their study of previously unstudied languages. Statistics show that around 1,000 languages exist that have never been studied by a person outside of their respective speaker groups.
My goal is to utilize TensorFlow to make a platform that will allow linguists to study and document these languages more efficiently, and to help them create written systems for the ones that don't have a written system already. One of their current methods of accomplishing such a task is three-fold: 1) Record a native speaker conversing in the language, 2) Listening to that recording and trying to transcribe it into the IPA, 3) From the phonetics, analyzing the phonemics and phonotactics of the language to eventually create a written system for the speaker.
My proposed platform would cut that research time down from a minimum of a year to a maximum of six months. Before I start, I have some questions...
What would be required to train TensorFlow to transcribe live audio into the IPA? Has this already been done? and if so, how would I utilize a previous solution for this project? Is a project like this even possible with TensorFlow? if not, what would you recommend using instead?
My apologies for the magnitude of this question. I don't have much experience in the realm of machine learning, as I am just beginning the research process for this project. Any help is appreciated!
I guess I will take a first shot at answering this. Since the question is pretty general, my answer will have to be pretty general as well.
What would be required. At the very least you would have to have a large dataset of pre-transcribed data. Ideally a large amount of spoken language audio mapped to characters in the phonetic alphabet, so the system could learn the sound of individual characters rather than whole transcribed words. If such a dataset doesn't exist, a less granular dataset could be used, mapping single words to their transcriptions. Then you would need a model, that is the actual neural network architecture implemented in code. And lastly you would need some computing resources. This is not something you can train casually, you would either have to buy some time in a cloud based machine learning framework (like Google Cloud ML) or build a fairly expensive machine to train at home.
Has this been done? I don't know. I don't think so. There have been published papers reporting various degrees of success at training systems to transcribe speech. Here is one, for example, http://deeplearning.stanford.edu/lexfree/lexfree.pdf It seems that since the alphabet you want to transcribe to is specifically designed to capture the way words sound rather than just write down the words you might have more success at training such a model.
Is it possible with TensorFlow. Yes, most likely. TensorFlow is well suited for implementing most modern deep learning architectures. Unless you end up designing some really weird and very original model for this purpose, TensorFlow should work just fine.
Edit: after some thought in part 1, you would have to use a dataset mapping spoken words to their transcriptions, since I expect that the same sound pronounced separately would be different from when the same sound is used in a word.
This has actually been done, albeit in PyTorch, by a group at CMU: https://github.com/xinjli/allosaurus

LSA Similarity interface

I am a PhD student in translation studies and I am currently working on my dissertation. I am using LSA Similarity interface as a method of analysis in my dissertation. My background is in linguistics and not computer science. I tried to find an easy LSA document categorisation tool but I could not find any. I tried to play with Gensim, I did not work. I think my problem is with linking my corpus (txt files) with the Gensim tool to do the analysis (I don't know how o do this step). I would greatly appreciate if anyone could help me with the analysis or direct me to any tool or easy tutorials to do it using Gensim.
I want to do the following: I want to apply document-doecument queries to retrieve the most relevant 5 documents from the corpus to the query document.
I have 15 query document
I have one corpus of (150 texts)The texts are short stories
I am desperate and I was hesitant to post this question here. I am sure that applying LSA in translation studies would add to the field and this makes me more persistent to find a way to do my analysis.
The only really easy, user-friendly tool for LSA that is out there right now is http://lsa.colorado.edu/ . Unfortunately, it is a web-based tool only, and it does not allow you to train LSA on your own corpora. But depending on your needs, that may not matter.
If I'm understanding you correctly, you need document-document similarity scores between each of 15 query documents and each of 150 short stories (a total of 15*150=2250 similarity scores). If these query documents and short stories are in English, then you can use the version of LSA that is trained on the TASA corpus used in many studies of LSA as follows:
Go to http://lsa.colorado.edu/
Select One-To-Many Comparison
Copy-paste one of the short stories in the "Main text" box, and the 15 queries separated with a blank line in the "Texts to compare" box
Repeat for each of your short stories. A huge pain? Yes. But if you are desperate...
If you program a little bit in Python or R, other tools for LSA include http://clic.cimec.unitn.it/composes/toolkit/introduction.html and http://cran.r-project.org/web/packages/lsa/lsa.pdf , and would save you the manual labor of the above suggestion. Also, I know you already tried Gensim, but there is a nice tutorial for it at http://radimrehurek.com/gensim/tutorial.html that you might try following if you haven't already.

Suitability of Naive Bayes classifier in Mahout to classifying websites

I'm currently working on a project that requires a database categorising websites (e.g. cnn.com = news). We only require broad classifications - we don't need every single URL classified individually. We're talking to the usual vendors of such databases, but most quotes we've had back are quite expensive and often they impose annoying requirements - like having to use their SDKs to query the database.
In the meantime, I've also been exploring the possibility of building such a database myself. I realise that this is not a 5 minute job, so I'm doing plenty of research.
From reading various papers on the subject, it seems a Naive Bayes classifier is generally the standard approach for doing this. However, many of the papers suggest enhancements to improve its accuracy in web classification - typically by making use of other contextual information, such as hyperlinks, header tags, multi-word phrases, the URL, word frequency and so on.
I've been experimenting with Mahout's Naive Bayes classifier against the 20 Newsgroup test dataset, and I can see its applicability to website classification, but I'm concerned about its accuracy for my use case.
Is anyone aware of the feasibility of extending the Bayes classifier in Mahout to take into account additional attributes? Any pointers as to where to start would be much appreciated.
Alternatively, if I'm barking up entirely the wrong tree please let me know!
You can control the input about as much as you'd like. In the end the input is just a feature vector. The feature vector's features can be words, or bigrams -- but they can also be whatever you want. So, yes, you can inject new features by modifying the input as you like.
How best to weave in those features is another topic entirely -- there's not one best way to convert them to numbers. Mahout in Action covers this reasonably well FWIW.