How to successfully convert math papers to plain text - pdf

Goals:
1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.
Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.
Problems:
All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.
2.PDF is really hard to process.
3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain \def macros which can not even be processed by de-macro. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.
Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as \mathcal{A} and A being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?

Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write \sqrt).
You can further refer to the issue of recognition to this paper:
https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -
Recognition of handwritten symbols
Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥
http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.

Related

QR code-like alternative with extremely low error rate and ability to read bent codes

I'm trying to find an alternative to QR codes (I'd also be willing to accept an entirely novel solution and implement it myself) that meets certain specifications.
First, the codes will often end up on thin pipes, and so need to be readable around a cylinder. The advantage to this is that the effect on the image from wrapping it around a cylinder is easy to express geometrically, and the codes will never be placed on a very irregular shape.
Second, read accuracy must be very high, as any read mistake would be extremely costly. If this means larger codes with more redundancy for better error correction, so be it.
Third, ability to be read by the average smartphone camera from a few inches out.
Fourth, storage space of around half a kilobyte per code.
Do you know of such a code?
The Data Matrix Rectangular Extension (DMRE) improves upon the standard set of rectangular Data Matrix symbol sizes in an algorithmically compatible manner, thus increasing the range of suitable applications with no real downsides.
Reliable cylindrical marking is a primary use case.
Regardless of symbology you will be unable to approach sufficient data density to achieve 0.5KB of binary data in a single compact, narrow symbol scanned using a standard camera phone. However, most 2D symbologies (DMRE included) support a feature called Structured Append that allows chaining of multiple symbols that can be scanned in any order to produce a single read when all components are accounted for.
If the data to be encoded is known to be highly structured (e.g. mostly numeric or alphanumeric) then the internal encoding process of Data Matrix will be more optimised than for general binary data. For example, the largest DMRE symbol (26×64) will provide up to 236 numeric characters, ~175 alphanumeric characters and only 116 bytes.
If the default error recovery rate is insufficient then including a checksum in the data may be appropriate.
DMRE has just been voted to be accepted as an ISO/IEC project and will likely become an international standard enjoying broad hardware and software support in due course.
Another option may be to investigate PDF417 which has a broader range of symbols sizes, however the data density is somewhat less than Data Matrix.
DMRE references: AIM specification and explanatory notes.

Searching for semantic relatedness tool

I need a tool that compute the semantic relatedness between two words.Please, Have you an idea about a tool or a code source which adopts this process. I am trying wordnet similarity (http://maraca.d.umn.edu/cgi-bin/similarity/similarity.cgi), but there is several missing words, i need one richer in term of concept.
What you described is more like a research project. :-)
If they are just words, not phrases, the most recent technology is word embedding. You can think of it as converting words to high-dimensional vectors (from 200 to 1000 dimensions) by training on millions of documents.
https://code.google.com/archive/p/word2vec
The code has been archived for proprietary issues but you can still download and run for yourself. Good luck.

How to build short sentences with a small letter set restriction?

I'm looking for a way to write a program that creates short german sentences with a restricted letter set. The sentences can be nonsense but should grammatically be correct. The following examples only contain the letters "aeilmnost":
"Antonia ist mit Tina im Tal."
"Tamina malt mit lila Tinte Enten."
"Tina nimmt alle Tomaten mit."
For this task I need a dictionary like this one (found in the answer to "Where can I find a parsable list of German words?"). The research area for programatically create text is NLG - Natural Language Generation. On the NLG-Wiki I found a large table of NLG systems. I picked two from the list, which could be appropriate:
SimpleNLG - a Java API, which has also an adaption for the german language
KOMET - multilingual generation, from University Bremen
Do you have worked with a NLG library and have some advice which one to use for building short sentences with a letter set restriction?
Can you recommend a paper to this topic?
Grammatically correct is a pretty fuzzy area, since grammar is not to strictly defined as one might think. What you really want here though, is a part-of-speech tagger, and a markov chain.
Specifically a markov chain says that given a certain state (the first word for instance) there's just a certain chance of moving on to another state (the next word). They are relatively easy to write from scracth, but I've got a gist here in python that shows how they work if you want an example.
Once you've got that I would suggest a part-of-speech-based markov chain, combined with just checking to see if words are constructed from your desired character set. In general the algorithm would go something like this:
Pick first word at random, checking that it is constructed solely from your desired set of characters
Use the Markov Chain to predict the next word
Check if that word is an appropriate part of speech, and that it conforms to the desired character set.
If not, predict another word until it is the case.
If so, then repeat starting at 2 to completion.
Hope that's what you're looking for. Let me know if you have any more questions.
As Slater Tyranus already said, Markov chains certainly form the basis of this task. I am going to suggest a more heavy-duty approach. It is considerably more work, but is likely to give much better results in terms of grammatical correctness.
Language Model based on PCFG parse trees: A language model works by assigning a probability to a sequence of words. It requires training data, however, in order to be built first. In your case, the training process should disregard words containing letters outside the limited set.
While theoretically a language model based on parse trees is much more likely to serve your purpose, there is one caveat: due to the kind of letter-based restriction you have, data sparsity will certainly raise its ugly head. Backoff techniques (e.g. Katz's backoff model) can help a bit, but it will essentially depend on whether or not you can train on enough enough data.
As far as readily available parsers are concerned, the Stanford NLP group provides a German parser based on the Negra corpus, as mentioned in their home page.

Operating speed of processing planar images vs interleaved images

I am reading a book on advanced iOS programming which says "It's generally much faster to work on planar formats than interleaved formats. If at all possible, get your data into planar format as soon as you can , and leave it in the at format through the entire transformation process".
I got to wondering why that is. The only reason I could come up with was that as you are iterating over your pixels, that you can use ++ operator instead of things like w*y+x to calculate your offsets.
I am familiar with planar vs interleaved as well as RGBA vs YUV. I've written a lot of c code to do things like image format conversion, rotations, flips, resizing, etc... The conversions were typically either to or from i420.
Is there something I'm overlooking?
Sweeping generalizations like the one you quote are often of limited practical applicability (i.e. almost, but not quite, entirely generalized from one or two data points ;-)
You say you have written a lot of C code for image processing. Good, take it to the next step: learn how to use a profiler, write microbenchmarks to unit-test the performance of your algorithms, especially when you are exploring the solution space.

Small embedded synthesized speech libraries/suggestions

Are there any easy-to-use free or cheap speech synthesis libraries for PIC and/or ARM embedded systems where code size is more important than speech quality? Nowadays it seems that a 1 meg package is considered "compact", but a lot of microcontrollers are smaller than that. Back in the 1980's Apple hired a contractor to produce Macintalk, which offered reasonable-quality speech in a 26K package which ran on a 7.16MHz 68000, and a program called SAM could produce speech that wasn't quite as good, but still serviceable, with a 16K package that ran on a 1MHz 6502. The SpeakJet runs a speech-synthesis algorithm on some type of PIC.
I probably wouldn't particularly need to produce speech, but would want to be able to speak messages formed from a number of pre-set words. Obviously it would be possible to simply prerecord all the messages, but with a vocabulary of e.g. 100 words, I would think that storing 16K worth of code plus maybe 1K worth of phonetic strings would be more compact than storing audio for 100 words.
Alternatively, if I wanted to store audio for 100 words, what would be the best way of generating a set of words that would flow naturally together? On older-style speech synthesizers, any given word could be spoken three ways: neutral inflection, falling inflection (as if followed by a period), or rising inflection (followed by a question mark). Words with neutral inflection could be spliced together in any order and sound fine. The text-to-wave tools I've found, though, seem to like to add finer details of inflection which sound "off" if words are cut apart and resequenced. Are there any tools which are designed for producing waves that can be concatenated and spliced nicely? If I do use such a tool, what audio format would be best for storing the waves so as to allow efficient decoding on a small microcontroller?
Last time I did this I was able add hardware like:http://www.sparkfun.com/products/9578 . There may be patent liabilities in your environment, like I ran into, that force a commercial software stack or OTS chip.
Otherwise, I've used http://www.speech.cs.cmu.edu/flite/ for more lenient projects, and it worked well.