How to convert a Trig file to a Turtle file so that i can upload it in Sesame? - sesame

{
ex:repository ex:createdBy ex:repOwner; ex:title “Rep_1”.
}
ex:books
{
ex:book_1 a ex:Science; ex:size “100”; ex:title “Science book 1”.
ex:book_2 a ex:Science; ex:size “1000”; ex:title “Science book 2”.
ex:book_3 a ex:Fantasy; ex:size “100”; ex:title “Fantasy book 1”.
}

It is not necessary to convert a TriG file to Turtle to upload it to Sesame, as Sesame supports the TriG format.
Moreover, conversion from TriG to Turtle loses data: TriG is a format that can record quads, so you can put several named graphs in one file, while Turtle only records triples. If you convert TriG to Turtle, you will remove all the named graph information.
Having said all that, conversion from one format to another is simple in Sesame:
// writing to System.out as an example, change to a fileoutputstream to write to file
RDFWriter turtleWriter = Rio.createWriter(RDFFormat.TURTLE, System.out);
RDFParser trigParser = Rio.createParser(RDFFormat.TRIG);
// link the parser with the writer
trigParser.setRDFHandler(turtleWriter);
File trigFile = new File("/path/to/file.trig");
trigParser.parse(new FileInputStream(trigFile), trigFile.getAbsolutePath());

Related

How to convert PDF with images which I don't care about to text?

I'm trying to convert pdf to text files. The problem is that those pdf contain images, which I don't care about (this is the type of file I want to extract (https://www.sia.aviation-civile.gouv.fr/pub/media/store/documents/file/l/f/lf_sup_2020_213_fr.pdf). Note that if I do copy/paste with my mouse, it work quite well (except the line break), so I'd guess that it's possible. Most of the answer I found online work pretty well on dummy pdf with text only, but give especially bad result on the map.
For instance, something like this
from tika import parser # pip install tika
raw = parser.from_file('test2.pdf')
print(raw['content'])
works well for retrieving the text, but I have a lot of trash like this :
ERY
CTR
3
CH
A
which appear because of the map.
Something like this, which work by converting the pdf to images and then reading the images, face the same problem (I found it on a very similar thread on stackoverflow, but there is no answer) :
import pytesseract as pt
from PIL import Image
import sys
def convert(name):
pages = convert_from_path(name, dpi=200)
for idx,page in enumerate(pages):
page.save('page'+str(idx)+'.jpg', 'JPEG')
quote = Image.open('page'+str(idx)+'.jpg')
text = pt.image_to_string(quote, lang="fra")
file_ex = open('page'+str(idx)+'.text',"w")
file_ex.write(text)
file_ex.close()
if __name__ == '__main__':
convert(sys.argv[1])
Finally, I tried to remove the image first, and then using one of the solutions above, but it didn't work better :
from tika import parser # pip install tika
from PyPDF2 import PdfFileWriter, PdfFileReader
# Remove the images
inputStream = open("lf_sup_2020_213_fr.pdf", "rb")
outputStream = open("test3.pdf", "wb")
src = PdfFileReader(inputStream)
output = PdfFileWriter()
[output.addPage(src.getPage(i)) for i in range(src.getNumPages())]
output.removeImages()
output.write(outputStream)
outputStream.close()
# Read from pdf without images
raw = parser.from_file('test2.pdf')
print(raw['content'])
Do you know how to solve this ? It can be in any language.
Thanks
One approach you could try is to use a toolkit capable of parsing the text characters in the PDF then use the object properties to try and remove the unwanted map labels while keeping the text characters required.
For example, the ParsePages method from LEADTOOLS PDF toolkit (which is what I am familiar with since I work for the vendor of this toolkit) can be used to obtain the text from the PDF:
using (PDFDocument document = new PDFDocument(pdfFileName))
{
PDFParsePagesOptions options = PDFParsePagesOptions.All;
document.ParsePages(options, 1, -1);
using (StreamWriter writer = File.CreateText(txtFileName))
{
IList<PDFObject> objects = document.Pages[0].Objects;
writer.WriteLine("Objects: {0}", objects.Count);
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
writer.WriteLine("---------------------");
}
}
This will obtain all the text in the PDF for the first page, with the unwanted results as you mentioned. Here is an excerpt below:
Objects: 3918
5
91L
F5
4
1 LF
N
OY
L2
1AM
TService
8
26
1de l’Information
0
B09SUP AIP 213/20
7
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
141
17˚
82
N20
9Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
More code can be used to examine the properties for each parsed character:
writer.WriteLine(" ObjectType: {0}", obj.ObjectType.ToString());
writer.WriteLine(" Bounds: {0}, {1}, {2}, {3}", obj.Bounds.Left, obj.Bounds.Top, obj.Bounds.Right, obj.Bounds.Bottom);
writer.WriteLine(" TextProperties.FontHeight: {0}", obj.TextProperties.FontHeight.ToString());
writer.WriteLine(" TextProperties.FontIndex: {0}", obj.TextProperties.FontIndex.ToString());
writer.WriteLine(" Code: {0}", obj.Code);
writer.WriteLine("------");
This will give the properties for each character:
Objects: 3918
ObjectType: Text
Bounds: -60.952693939209, 1017.25231933594, -51.8431816101074, 1023.71826171875
TextProperties.FontHeight: 7.10454273223877
TextProperties.FontIndex: 48
Code: 5
------
Using these properties, the unwanted text might be filtered using their properties. For example, I noticed that the FontHeight for a good portion of the unwanted text is around 7 PDF units, so the first code might be altered to avoid extracting any text smaller than 7.25 PDF units:
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.FontHeight > 7.25)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
}
The extracted output would give a better result, an excerpt follows:
Objects: 3918
Service
de l’Information
SUP AIP 213/20
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
Lieu : FIR : Marseille LFMM - AD : Chambéry Aix-Les-Bains LFLB, Chambéry Challes les Eaux LFLE
ZRT LE SIRE, MOTTE CASTRALE, ALLEVARD
*
C
D
E
In the end, you will have to try and come up with a good criteria to filter out the unwanted text without removing the text you need to keep, using this approach.

Adding a Retokenize pipe while training NER model

I am currenly attempting to train a NER model centered around Property Descriptions. I could get a fully trained model to function to my liking however, I now want to add a retokenize pipe to the model so that I can set up the model to train other things.
From here, I am having issues getting the retokenize pipe to actually work. Here is the definition:
def retok(doc):
ents = [(ent.start, ent.end, ent.label) for ent in doc.ents]
with doc.retokenize() as retok:
string_store = doc.vocab.strings
for start, end, label in ents:
retok.merge(
doc[start: end],
attrs=intify_attrs({'ent_type':label},string_store))
return doc
i am adding it into my training like this:
nlp.add_pipe(retok, after="ner")
and I am adding it into the Language Factories like this:
Language.factories['retok'] = lambda nlp, **cfg: retok(nlp)
The issue I keep getting is "AttributeError: 'English' object has no attribute 'ents'". Now I am assuming I am getting this error because the parameter that is being passed through this function is not a doc but actually the NLP model itself. I am not really sure to get a doc to flow into this pipe during training. At this point I don't really know where to go from here to get the pipe to function the way I want.
Any help is appreciated, thanks.
You can potentially use the built-in merge_entities pipeline component: https://spacy.io/api/pipeline-functions#merge_entities
The example copied from the docs:
texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]
merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents)
texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]
If you need to customize it further, the current implementation of merge_entities (v2.2) is a good starting point:
def merge_entities(doc):
"""Merge entities into a single token.
doc (Doc): The Doc object.
RETURNS (Doc): The Doc object with merged entities.
DOCS: https://spacy.io/api/pipeline-functions#merge_entities
"""
with doc.retokenize() as retokenizer:
for ent in doc.ents:
attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.l
abel}
retokenizer.merge(ent, attrs=attrs)
return doc
P.S. You are passing nlp to retok() below, which is where the error is coming from:
Language.factories['retok'] = lambda nlp, **cfg: retok(nlp)
See a related question: Spacy - Save custom pipeline

How do I convert simple training style data to spaCy's command line JSON format?

I have the training data for a new NER type in the "Training an additional entity type" section of the spaCy documentation.
TRAIN_DATA = [
("Horses are too tall and they pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("Do they bite?", {
'entities': []
}),
("horses are too tall and they pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("horses pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("they pretend to care about your feelings, those horses", {
'entities': [(48, 54, 'ANIMAL')]
}),
("horses?", {
'entities': [(0, 6, 'ANIMAL')]
})
]
I want to train an NER model on this data using the spacy command line application. This requires data in spaCy's JSON format. How do I write the above data (i.e. text with labeled character offset spans) in this JSON format?
After looking at the documentation for that format, it's not clear to me how to manually write data in this format. (For example, do I have partition everything into paragraphs?) There is also a convert command line utility that converts from non-spaCy data formats to spaCy's format, but that doesn't take a spaCy format like the one above as input.
I understand the examples of NER training code that uses the "Simple training style", but I'd like to be able to use the command line utility for training. (Though as is apparent from my previous spaCy question, I'm unclear when you're supposed to use that style and when you're supposed to use the command line.)
Can someone show me an example of the above data in "spaCy's JSON format", or point to documentation that explains how to make this transformation.
There's a built in function to spaCy that will get you most of the way there:
from spacy.gold import biluo_tags_from_offsets
That takes in the "offset" type annotations you have there and converts them to the token-by-token BILOU format.
To put the NER annotations into the final training JSON format, you just need a bit more wrapping around them to fill out the other slots the data requires:
sentences = []
for t in TRAIN_DATA:
doc = nlp(t[0])
tags = biluo_tags_from_offsets(doc, t[1]['entities'])
ner_info = list(zip(doc, tags))
tokens = []
for n, i in enumerate(ner_info):
token = {"head" : 0,
"dep" : "",
"tag" : "",
"orth" : i[0].string,
"ner" : i[1],
"id" : n}
tokens.append(token)
sentences.append(tokens)
Make sure that you disable the non-NER pipelines before training with this data.
I've run into some issues using spacy train on NER-only data. See #1907 and also check out this discussion on the Prodigy forum for some possible workarounds.

Increase the length of text returned by Highlighter

Initially, I thought setMaxDocCharsToAnalyze(int) will increase the output length, but it does not.
Currently the output generated by my Search (String fragment) is less than a line long and hence makes no sense as a preview.
Can the output generated by getBestFragment() be increased, by some mechanism, to at least 1 sentence or more (it does not matter if it is one and a half sentences or more, but I need it to be long enough to at least make some sense).
Indexing:
Document document = new Document();
document.add(new TextField(FIELD_CONTENT, content, Field.Store.YES));
document.add(new StringField(FIELD_PATH, path, Field.Store.YES));
indexWriter.addDocument(document);
Searching
QueryParser queryParser = new QueryParser(FIELD_CONTENT, new StandardAnalyzer());
Query query = queryParser.parse(searchQuery);
QueryScorer queryScorer = new QueryScorer(query, FIELD_CONTENT);
Fragmenter fragmenter = new SimpleSpanFragmenter(queryScorer);
Highlighter highlighter = new Highlighter(queryScorer); // Set the best scorer fragments
highlighter.setMaxDocCharsToAnalyze(100000); //"HAS NO EFFECT"
highlighter.setTextFragmenter(fragmenter);
// STEP B
File indexFile = new File(INDEX_DIRECTORY);
Directory directory = FSDirectory.open(indexFile.toPath());
IndexReader indexReader = DirectoryReader.open(directory);
// STEP C
System.out.println("query: " + query);
ScoreDoc scoreDocs[] = searcher.search(query, MAX_DOC).scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs)
{
//System.out.println("1");
Document document = searcher.getDocument(scoreDoc.doc);
String title = document.get(FIELD_CONTENT);
TokenStream tokenStream = TokenSources.getAnyTokenStream(indexReader,
scoreDoc.doc, FIELD_CONTENT, document, new StandardAnalyzer());
String fragment = highlighter.getBestFragment(tokenStream, title); //Increase the length of the this String this is the output
System.out.println(fragment + "-------");
}
Sample Output
query: +Content:canada +Content:minister
|Liberal]] [[Prime Minister of Canada|Prime Minister]] [[Pierre Trudeau]] led a [[Minority-------
. Thorson, Minister of National War Services, Ottawa. Printed in Canada Description: British lion-------
politician of the [[New Zealand Labour Party| Labour Party]], and a cabinet minister. He represented-------
|}}}| ! [[Minister of Finance (Canada)|Minister]] {{!}} {{{minister-------
, District of Franklin''. Ottawa: Minister of Supply and Services Canada, 1977. ISBN 0660008351
25]], [[1880]] – [[March 4]], [[1975]]) was a [[Canada|Canadian]] provincial and federal-------
-du-Quebec]] region, in [[Canada]]. It is named after the first French Canadian to become Prime-------
11569347, Cannon_family_(Canada) ::: {{for|the American political family|Cannon family-------
minister of [[Guyana]] and prominent Hindu politician in [[Guyana]]. He also served, at various times-------
11559743, Mohammed_Hussein_Al_Shaali ::: '''Mohammed Hussein Al Shaali''' is the former Minister-------
The Fragmenter is the piece that controls this behavior. You can pass an int into the SimpleSpanFragmenter constructor to control the size of the fragments it produces (in bytes). The default size is 100. For example, to double that:
Fragmenter fragmenter = new SimpleSpanFragmenter(queryScorer, 200);
As far as splitting on sentence boundaries, there isn't a fragmenter for that, out of the box. Someone posted their implementation of one here. It's an extremely naive implementation, but you may find it helpful if you want to go down that particular rabbit hole.

ResourceWriter data formatting

I have a .resx file to update some data. I can read the data from the file via a ResXResourceSet object, but when I want to save the data back, the saved data format is
unrecognizable. How do I edit .resx files? Thanks.
ResXResourceSet st = new ResXResourceSet(#"thepath");
entries=new List<DictionaryEntry>();
DictionaryEntry curEntry ;
foreach (DictionaryEntry ent in st)
{
if (ent.Key.ToString() == "Page.Title")
{
curEntry = ent;
curEntry.Value = "change this one"
entries.Add(curEntry);
}
else
{
entries.Add(ent);
}
}
st.Close();
System.Resources.ResourceWriter wr = new ResourceWriter(#"thepath");
foreach (DictionaryEntry entry in entries)
{
wr.AddResource(entry.Key.ToString(), entry.Value.ToString());
}
wr.Close();
Hi again i searched up and found that..
ResourceWriter writes data as binary type
ResourceReader reads data as binary type
ResXResourceWriter writes data as xml format
ResXResourceReader reads data as xml format
so example on top using ResXResourceWriter,ResXResourceReader instead of ResourceReader ,ResourceWriter will manipulate resources as xml type