I am new to CFG's,
Can someone give me tips in creating CFG that generates some language
For example
L ={ w v w~ |v| is even }, where alphabet={a,b} and w~ is inverse of
string w
Related
Im totally new in DXL scripting and try to get deeper into it. Unfortunately DXL is not the best documented kind of language.
First of all let me describe the situation. I have 3 different folders. The first folder includes just a start modul. The other both folders include in each case some more modules.
I want to start with the start modul going through all objects inside of it to look out for incoming links. If there is an incoming link I want to follow this link to the outgoing object which is just a further object in a module in the second folder
and switch to it. From there I want to do the same procedure until objects in folder 3 are reached or rather until check if folder 2 module objects have incoming links from folder 3 module objects.
The reason I want to do this is to check if the link chain is complete or not and how many of them are complete or not. Means if folder 1 module objects have incoming links from folder 2 module objects
and folder 2 module objects have incoming links from folder 3 module objects.
Thats want I ve got so far:
pragma runLim, 0
int iAcc = 0
int iRej = 0
int linkChainComplete = 0
int linkChainNotComplete = 0
string startModulPath = "/X/X.1"
Module m
m = read(startModulPath)
Object o
Link l
filtering off
Filter f = hasLinks(linkFilterIncoming, "*")
for o in m do {
bool isComplete = true
set(m, f, iAcc, iRej)
if(iAcc == 0){
isComplete = false
}
else{
for l in o <- "*" do {
Object src = source l;
set(m, f, iAcc, iRej)
if(iAcc == 0){
isComplete = false
}
}
}
if(isComplete){
linkChainComplete++
}
else{
linkChainNotComplete++
}
}
The first question is, am I on the right path?
And one question or rather problem is for example that I want to proof if there are incoming links by using the hasLinks function (concerning this also see "set(m, f, iAcc, iRej) and if(iAcc == 0)" part.
But this function refers to the module (m) instead objects (o). Is there another way to proof if an object has got incoming links or not?
Hope anyone could help me. Thank you very much indeed.
I would like to use Lucene to run a nearest neighbour search. I'm using Lucene 9.0.0 on JVM 11. I did not find much documentation and mainly tried to piece things together using the existing tests.
I wrote a small test which prepares a HnswGraph but so far the search does not yield the expected result. I setup a set of random vectors and add a final vector (0.99f,0.01f) which is very close to my search target.
The search unfortunately never returns the expected value. I'm not sure where my error is. I assume it may be related with the insert and document id order.
Maybe someone who is more familar with lucene might be able to provide some feedback. Is my approach correct? I'm using Documents only for persistence.
HnswGraphBuilder builder = new HnswGraphBuilder(vectors, similarityFunction, maxConn, beamWidth, seed);
HnswGraph hnsw = builder.build(vectors);
// Run a search
NeighborQueue nn = HnswGraph.search(
new float[] { 1, 0 },
10,
10,
vectors.randomAccess(), // ? Why do I need to specify the graph values again?
similarityFunction, // ? Why can I specify a different similarityFunction for search. Should that not be the same that was used for graph creation?
hnsw,
null,
new SplittableRandom(RandomUtils.nextLong()));
The whole test source can be found here:
https://gist.github.com/Jotschi/cea21a72412bcba80c46b967e9c52b0f
I managed to get this working.
Instead of using the HnswGraph API directly I now use LeafReader#searchNearestVectors. While debugging I noticed that the Lucene90HnswVectorsWriter for example invokes extra steps using the HnswGraph API. I assume this is done to create a correlation between inserted vectors and document Ids. The nodeIds I retrieved using a HnswGraph#search never matched up with the matched up with the vector Ids. I don't know whether extra steps are needed to setup the graph or whether the correlation needs to be created afterwards somehow.
The good news is that the LeafReader#searchNearestVectors method works. I have updated the example which now also makes use of the Lucene documents.
#Test
public void testWriteAndQueryIndex() throws IOException {
// Persist and read the data
try (MMapDirectory dir = new MMapDirectory(indexPath)) {
// Write index
int indexedDoc = writeIndex(dir, vectors);
// Read index
readAndQuery(dir, vectors, indexedDoc);
}
}
Vector 7 with [0.97|0.02] is very close to the search query target [0.98|0.01].
Test vectors:
0 => [0.13|0.37]
1 => [0.99|0.49]
2 => [0.98|0.57]
3 => [0.23|0.64]
4 => [0.72|0.92]
5 => [0.08|0.74]
6 => [0.50|0.27]
7 => [0.97|0.02]
8 => [0.90|0.21]
9 => [0.89|0.09]
10 => [0.11|0.95]
Doc Based Search:
Searching for NN of [0.98 | 0.01]
TotalHits: 11
7 => [0.97|0.02]
9 => [0.89|0.09]
Full example:
https://gist.github.com/Jotschi/d8a91758c84203d172f818c8be4964e4
Another way to solve this is to use the KnnVectorQuery.
try (IndexReader reader = DirectoryReader.open(dir)) {
IndexSearcher searcher = new IndexSearcher(reader);
System.out.println("Query: [" + String.format("%.2f", queryVector[0]) + ", " + String.format("%.2f", queryVector[1]) + "]");
TopDocs results = searcher.search(new KnnVectorQuery("field", queryVector, 3), 10);
System.out.println("Hits: " + results.totalHits);
for (ScoreDoc sdoc : results.scoreDocs) {
Document doc = reader.document(sdoc.doc);
StoredField idField = (StoredField) doc.getField("id");
System.out.println("Found: " + idField.numericValue() + " = " + String.format("%.1f", sdoc.score));
}
}
Full example:
https://gist.github.com/Jotschi/7d599dff331d75a3bdd02e62f65abfba
Is there a way to convert CONLL file into list of Doc objects without having to parse the sentence using the nlp object. I have a list of annotations that I have to pass to the automatic component that uses Doc objects as input. I have found a way to create the doc:
doc = Doc(nlp.vocab, words=[...])
And that I can use the from_array function to recreate the other linguistic features. This array can be recreated by using index value from StringStore object, I have successfully created Doc object with LEMMA and TAG information but cannot recreate HEAD data. My question is how to pass HEAD data to Doc object using from_array method.
The confusing thing about the HEAD is that for sentence that has this structure:
Ona 2
je 2
otišla 2
u 4
školu 2
. 2
The output of this code snippet:
from spacy.attrs import TAG, HEAD, DEP
doc.to_array([TAG, HEAD, DEP])
is:
array([[10468770234730083819, 2, 429],
[ 5333907774816518795, 1, 405],
[11670076340363994323, 0, 8206900633647566924],
[ 6471273018469892813, 1, 8110129090154140942],
[ 7055653905424136462, 18446744073709551614, 435],
[ 7173976090571422945, 18446744073709551613, 445]],
dtype=uint64)
I cannot correlate the center column of the from_array output to dependency tree structure given above.
Thanks in advance for the help,
Daniel
Ok, so I finally cracked it it appears that head - index if the index is lower than head and 18446744073709551616 - index otherwise. This is the function that I used if anyone else needed it:
import numpy as np
from spacy.tokens import Doc
docs = []
for sent in sents:
generated_doc = Doc(doc.vocab, words=[word["word"] for word in sent])
heads = []
for idx, word in enumerate(sent):
if word["pos"] not in doc.vocab.strings:
doc.vocab.strings.add(word["pos"])
if word["dep"] not in doc.vocab.strings:
doc.vocab.strings.add(word["dep"])
if word["head"] >= idx:
heads.append(word["head"] - idx)
else:
heads.append(18446744073709551616-idx)
np_array = np.array([np.array([doc.vocab.strings[word["pos"]], heads[idx], doc.vocab.strings[word["dep"]]], dtype=np.uint64) for idx, word in enumerate(sent)], dtype=np.uint64)
generated_doc.from_array([TAG, HEAD, DEP], np_array)
docs.append(generated_doc)
I am currenly attempting to train a NER model centered around Property Descriptions. I could get a fully trained model to function to my liking however, I now want to add a retokenize pipe to the model so that I can set up the model to train other things.
From here, I am having issues getting the retokenize pipe to actually work. Here is the definition:
def retok(doc):
ents = [(ent.start, ent.end, ent.label) for ent in doc.ents]
with doc.retokenize() as retok:
string_store = doc.vocab.strings
for start, end, label in ents:
retok.merge(
doc[start: end],
attrs=intify_attrs({'ent_type':label},string_store))
return doc
i am adding it into my training like this:
nlp.add_pipe(retok, after="ner")
and I am adding it into the Language Factories like this:
Language.factories['retok'] = lambda nlp, **cfg: retok(nlp)
The issue I keep getting is "AttributeError: 'English' object has no attribute 'ents'". Now I am assuming I am getting this error because the parameter that is being passed through this function is not a doc but actually the NLP model itself. I am not really sure to get a doc to flow into this pipe during training. At this point I don't really know where to go from here to get the pipe to function the way I want.
Any help is appreciated, thanks.
You can potentially use the built-in merge_entities pipeline component: https://spacy.io/api/pipeline-functions#merge_entities
The example copied from the docs:
texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]
merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents)
texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]
If you need to customize it further, the current implementation of merge_entities (v2.2) is a good starting point:
def merge_entities(doc):
"""Merge entities into a single token.
doc (Doc): The Doc object.
RETURNS (Doc): The Doc object with merged entities.
DOCS: https://spacy.io/api/pipeline-functions#merge_entities
"""
with doc.retokenize() as retokenizer:
for ent in doc.ents:
attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.l
abel}
retokenizer.merge(ent, attrs=attrs)
return doc
P.S. You are passing nlp to retok() below, which is where the error is coming from:
Language.factories['retok'] = lambda nlp, **cfg: retok(nlp)
See a related question: Spacy - Save custom pipeline
I've got a ByteString that I need to send over XHR in GHCJS, but I can't for the life of me figure out how to get that ByteString into XHR's RequestData
data RequestData = NoData
| StringData JSString
| TypedArrayData (forall e. SomeTypedArray e Immutable)
| FormData [(JSString, FormDataVal)]
Obviously TypedArrayData is what I need to use, but I'm having no luck at all figuring out how to convert a ByteString to fit in there. I looked at this, and I tried something like this.
setData r bs = do
let (b, _, _) = fromByteString $ toStrict bs
return r { XHR.reqData = XHR.TypedArrayData $ getUint8Array b }
But for some reason, I'm getting a weird error with the kinds.
Couldn't match kind ‘AnyK’ with ‘*’
Expected type: GHCJS.Buffer.Types.SomeBuffer
'ghcjs-base-0.2.0.0:GHCJS.Internal.Types.Immutable
Actual type: Buffer
In the first argument of ‘getUint8Array’, namely ‘b’
In the second argument of ‘($)’, namely ‘getUint8Array b’
As far as I can tell, there's no reason these types should be incompatible.