I've found a grammar online which I want to rewrite to BNF so I can use it in a grammatical evolution experiment. From what I've read online BNF is given by this form:
<symbol> := <expression> | <term>
...but I don't see where probabilities factor into it.
In a probabilistic context-free grammar (PCFG), every production also is assigned a probability. How you choose to write this probability is up to you; I don't know of a standard notation.
Generally, the probabilities are learned rather than assigned, so the representation issue doesn't come up; the system is given a normal CFG as well as a large corpus with corresponding parse trees, and it derives probabilities by analysing the parse trees.
Note that PCFGs are usually ambiguous. Probabilities are not used to decide whether a sentence is in the language but rather which parse is correct, so with an unambiguous grammar, the probabilities would be of little use.
Related
So I've read the paper named Self-Governing Neural Networks for On-Device Short Text Classification which presents an embedding-free approach to projecting words into a neural representation. To quote them:
The key advantage of SGNNs over existing work is that they surmount the need for pre-trained word embeddings and complex networks with huge parameters. [...] our method is a truly embedding-free approach unlike majority of the widely-used state-of-the-art deep learning techniques in NLP
Basically, from what I understand, they proceed as follow:
You'd first need to compute n-grams (side-question: is that skip-gram like old skip-gram, or new skip-gram like word2vec? I assume it's the first one for what remains) on words' characters to obtain a featurized representation of words in a text, so as an example, with 4-grams you could yield a 1M-dimensional sparse feature vector per word. Hopefully, it's sparse so memory needn't to be fully used for that because it's almost one-hot (or count-vectorized, or tf-idf vectorized ngrams with lots of zeros).
Then you'd need to hash those n-grams sparse vectors using Locality-sensitive hashing (LSH). They seem to use Random Projection from what I've understood. Also, instead of ngram-vectors, they instead use tuples of n-gram feature index and its value for non-zero n-gram feature (which is also by definition a "sparse matrix" computed on-the-fly such as from a Default Dictionary of non-zero features instead of a full vector).
I found an implementation of Random Projection in scikit-learn. From my tests, it doesn't seem to yield a binary output, although the whole thing is using sparse on-the-fly computations within scikit-learn's sparse matrices as expected for a memory-efficient (non-zero dictionnary-like features) implementation I guess.
What doesn't work in all of this, and where my question lies, is in how they could end up with binary features from the sparse projection (the hashing). They seem to be saying that the hashing is done at the same time of computing the features, which is confusing, I would have expected the hashing to come in the order I wrote above as in 1-2-3 steps, but their steps 1 and 2 seems to be somehow merged.
My confusion arises mostly from the paragraphs starting with the phrase "On-the-fly Computation." at page 888 (PDF's page 2) of the paper in the right column. Here is an image depicting the passage that confuses me:
I'd like to convey my school project to a success (trying to mix BERT with SGNNs instead of using word embeddings). So, how would you demystify that? More precisely, how could a similar random hashing projection be achieved with scikit-learn, or TensorFlow, or with PyTorch? Trying to connect the dots here, I've significantly researched but their paper doesn't give implementation details, which is what I'd like to reproduce. I at least know that the SGNN uses 80 fourten-dimensionnal LSHes on character-level n-grams of words (is my understanding right in the first place?).
Thanks!
EDIT: after starting to code, I realized that the output of scikit-learn's SparseRandomProjection() looks like this:
[0.7278244729081154,
-0.7278244729081154,
0.0,
0.0,
0.7278244729081154,
0.0,
...
]
For now, this looks fine, it's closer to binary but it would still be castable to an integer instead of a float by using the good ratio in the first place. I still wonder about the skip-gram thing, I assume n-gram of characters of words for now but it's probably wrong. Will post code soon to GitHub.
EDIT #2: I coded something here, but with n-grams instead of skip-grams: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer
More discussion threads on this here: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer/issues?q=is%3Aissue
First of all, thanks for your implementation of the projection layer, it helped me get started with my own.
I read your discussion with #thinline72, and I agree with him that the features are calculated in the whole line of text, char by char, not word by word. I am not sure this difference in features is too relevant, though.
Answering your question: I interpret that they do steps 1 and 2 separately, as you suggested and did. Right, in the article excerpt that you include, they talk about hashing both in feature construction and projection, but I think those are 2 different hashes. And I interpret that the first hashing (feature construction) is automatically done by the CountVectorizer method.
Feel free to take a look at my implementation of the paper, where I built the end-to-end network and trained on the SwDA dataset, as split in the SGNN paper. I obtain a max of 71% accuracy, which is somewhat lower than the paper claims. I also used the binary hasher that #thinline72 recommended, and nltk's implementation of skipgrams (I am quite certain the SGNN paper is talking about "old" skipgrams, not "word2vec" skipgrams).
According to documentation:
spaCy's small models (all packages that end in sm) don't ship with
word vectors, and only include context-sensitive tensors. [...]
individual tokens won't have any vectors assigned.
But when I use the de_core_news_sm model, the tokens Do have entries for x.vector and x.has_vector=True.
It looks like these are context_vectors, but as far as I understood the documentation only word vectors are accessible through the vector attribute and sm models should have none. Why does this work for a "small model"?
has_vector behaves differently than you expect.
This is discussed in the comments on an issue raised on github. The gist is, since vectors are available, it is True, even though those vectors are context vectors. Note that you can still use them, eg to compute similarity.
Quote from spaCy contributor Ines:
We've been going back and forth on how the has_vector should behave in
cases like this. There is a vector, so having it return False would be
misleading. Similarly, if the model doesn't come with a pre-trained
vocab, technically all lexemes are OOV.
Version 2.1.0 has been announced to include German word vectors.
I'm quite new to text mining and I'm challenging my self to do the sentiment analysis today. But I encounter some problems while doing the sentiment analysis.
In my language, a word can have some different meanings. Like "setan" means : 1) devils 2) cursing words. How to solve this ambiguity in sentiment analysis?
Also for everyone's information, the algorithm that I use is naive bayes classifier. And for the tools, I'm using RapidMiner.
I need your help. Any tips would be great. Thank you!
Training your data on a Naive Bayes classifier would make the model assign a probability for each word for every different class that you are trying to classify. In your case, since it's sentiment analysis, if you have Positive and Negative as the two classes, you would have probability for setan being Positive and Negative.
Keeping this in mind, if a word has multiple meanings that could account for both positive and negative sentiment, I would say make sure to include both kind of instances in your data so that while training the model, the corresponding probabilities are used to classify new text into Positive or Negative class.
In your case, it seems like both the meanings of setan have a negative connotation which really shouldn't be a problem. Words like "the","a" which are present in both Positive and Negative instances, famously called the stopwords should be removed since they don't really count towards the classification.
In your case if you are trying to train the model using their meanings specifically, you can refer this paper https://pdfs.semanticscholar.org/fc01/b42df3077a512620456d8a2714951eccbd67.pdf.
I am new on Semantic Web Rules Language and I am writing some rules in order to calculate the probability of - discrete and continuous - distributions.
I know that with SWRL I can do subtractions, addition, multiplication and divisions.
But what about exponentiation, summation, calculation of mathematical functions? Is there a way to do this in SWRL?
Just an example to place my question :
You know, for example, for Triangular distribution, we need basic mathematical calculus (subtractions and divisions), but for Beta Distribution we need exponentiation and calculus of the beta function..
Is there a way to do this in SWRL?
Thanks
The standard describes what math functions should be available, and these include exponentiation:
8.2. Math Built-Ins
…
swrlb:pow
Satisfied iff the first argument is equal to the result of the second argument raised to the third argument power.
There's no built in for the Beta function, though. You'd need to look into the reasoner that you're using and see whether you can implement additional mathematical builtins.
summation, calculation of mathematical functions
For summations, you may find the aggregate functions in SPARQL useful, but only if the terms you need to sum are available individually. You won't easily be able to express arbitrary sums like ∑i=1…n i2. You might find support for extension functions in SPARQL implementations, too.
I'm thinking so, because the upper bound would be the 2^n, and given that these are both finite machines, the intersection for both the n-state NFA and the DFA with 2^n or less states will be valid.
Am I wrong here?
You're right. 2^n is an upper limit, so the generated DFA can't have more states than that limit. But it's the worst-case scenario. In most common scenarios there's less states than that in the resulting DFA. Sometimes it could be even less than in the original NFA.
But as far as I know, the algorithm to predict how many states the resulting DFA will actually have, doesn't exist yet. So if you'll find it, please let me know ;)
That is correct. As you probably already know, both DFAs and NFAs only accept regular languages. That means that they are equal in the languages they can accept. Also, the most primitive way of transforming a NFA to a DFA is with subset construction (also called powerset construction), where you simply create a state in the DFA for every combination of states in the NFA. This is called the powerset of states, which could at most be 2^n.
But, as mentioned by #SasQ that is the worst case scenario. Typically you will not end up with that many states if you use Hopcroft's algorithm or Brozowski's algorithm.