Convert CONLL file to a list of Doc objects - spacy

Is there a way to convert CONLL file into list of Doc objects without having to parse the sentence using the nlp object. I have a list of annotations that I have to pass to the automatic component that uses Doc objects as input. I have found a way to create the doc:
doc = Doc(nlp.vocab, words=[...])
And that I can use the from_array function to recreate the other linguistic features. This array can be recreated by using index value from StringStore object, I have successfully created Doc object with LEMMA and TAG information but cannot recreate HEAD data. My question is how to pass HEAD data to Doc object using from_array method.
The confusing thing about the HEAD is that for sentence that has this structure:
Ona 2
je 2
otišla 2
u 4
školu 2
. 2
The output of this code snippet:
from spacy.attrs import TAG, HEAD, DEP
doc.to_array([TAG, HEAD, DEP])
is:
array([[10468770234730083819, 2, 429],
[ 5333907774816518795, 1, 405],
[11670076340363994323, 0, 8206900633647566924],
[ 6471273018469892813, 1, 8110129090154140942],
[ 7055653905424136462, 18446744073709551614, 435],
[ 7173976090571422945, 18446744073709551613, 445]],
dtype=uint64)
I cannot correlate the center column of the from_array output to dependency tree structure given above.
Thanks in advance for the help,
Daniel

Ok, so I finally cracked it it appears that head - index if the index is lower than head and 18446744073709551616 - index otherwise. This is the function that I used if anyone else needed it:
import numpy as np
from spacy.tokens import Doc
docs = []
for sent in sents:
generated_doc = Doc(doc.vocab, words=[word["word"] for word in sent])
heads = []
for idx, word in enumerate(sent):
if word["pos"] not in doc.vocab.strings:
doc.vocab.strings.add(word["pos"])
if word["dep"] not in doc.vocab.strings:
doc.vocab.strings.add(word["dep"])
if word["head"] >= idx:
heads.append(word["head"] - idx)
else:
heads.append(18446744073709551616-idx)
np_array = np.array([np.array([doc.vocab.strings[word["pos"]], heads[idx], doc.vocab.strings[word["dep"]]], dtype=np.uint64) for idx, word in enumerate(sent)], dtype=np.uint64)
generated_doc.from_array([TAG, HEAD, DEP], np_array)
docs.append(generated_doc)

Related

get the index of string element in np1 while it has a substring in np2

my code is as below:
import numpy as np
keywordlist = ['cpp-4.8.5', 'CUnit-2.1.3', 'CUnit-devel', 'doxygen-1.8.5', 'e2fsprogs-1.42.9', 'e2fsprogs-libs', 'epel-release', 'fuse3-devel', 'fuse3-libs', 'gcc-4.8.5', 'gcc-c++', 'gcc-gfortran', 'ghc-array', 'ghc-base', 'ghc-bytestring', 'ghc-containers', 'ghc-deepseq', 'ghc-directory', 'ghc-filepath', 'ghc-json', 'ghc-mtl', 'ghc-old', 'ghc-parsec', 'ghc-pretty', 'ghc-regex', 'ghc-regex', 'ghc-ShellCheck', 'ghc-syb', 'ghc-text', 'ghc-time', 'ghc-transformers', 'ghc-unix', 'git-1.8.3.1', 'graphviz-2.30.1', 'help2man-1.41.1', 'ibacm-22.4', 'keyutils-libs', 'krb5-devel', 'krb5-libs', 'krb5-workstation', 'lcov-1.13', 'libaio-devel', 'libblkid-2.23.2', 'libcom_err-1.42.9', 'libcom_err-devel', 'libgcc-4.8.5', 'libgfortran-4.8.5', 'libgomp-4.8.5', 'libibumad-22.4', 'libibverbs-22.4', 'libiscsi-devel', 'libkadm5-1.15.1', 'libmount-2.23.2', 'libpmem-1.5.1', 'libpmemblk-1.5.1', 'libpmemblk-devel', 'libpmem-devel', 'libquadmath-4.8.5', 'libquadmath-devel', 'librdmacm-22.4', 'libselinux-2.5', 'libselinux-devel', 'libselinux-python', 'libselinux-utils', 'libsepol-devel', 'libsmartcols-2.23.2', 'libss-1.42.9', 'libstdc++-4.8.5', 'libstdc++-devel', 'libunwind-1.2', 'libunwind-devel', 'libuuid-2.23.2', 'libuuid-devel', 'libverto-devel', 'libXaw-1.0.13', 'libXScrnSaver-1.2.2', 'make-3.82', 'nasm-2.10.07', 'numactl-devel', 'numactl-libs', 'openssl-1.0.2k', 'openssl-devel', 'openssl-libs', 'pcre-devel', 'perl-Digest', 'perl-Digest', 'perl-GD', 'perl-Git', 'python-2.7.5', 'python2-pycodestyle', 'python-libs', 'rdma-core', 'rdma-core', 'sg3_utils-1.37', 'sg3_utils-libs', 'ShellCheck-0.3.8', 'util-linux', 'zlib-devel']
np1 = np.array(keywordlist)
# ['cpp-4.8.5' 'CUnit-2.1.3' 'CUnit-devel' 'doxygen-1.8.5' ... 'ShellCheck-0.3.8' 'util-linux' 'zlib-devel']
result = ['epel-release-7-12.noarch', 'rdma-core-22.4-5.el7.x86_64', 'cpp-4.8.5-44.el7.x86_64', 'doxygen-1.8.5-4.el7.x86_64', 'ghc-base-4.6.0.1-26.4.el7.x86_64', 'libuuid-2.23.2-65.el7.x86_64', 'python-libs-2.7.5-89.el7.x86_64', 'libkadm5-1.15.1-50.el7.x86_64', 'libmount-2.23.2-65.el7.x86_64', 'libquadmath-4.8.5-44.el7.x86_64', 'util-linux-2.23.2-65.el7.x86_64', 'libss-1.42.9-19.el7.x86_64', 'keyutils-libs-1.5.8-3.el7.x86_64', 'e2fsprogs-libs-1.42.9-19.el7.x86_64', 'ghc-pretty-1.1.1.0-26.4.el7.x86_64', 'libXaw-1.0.13-4.el7.x86_64', 'libselinux-2.5-15.el7.x86_64', 'libibverbs-22.4-5.el7.x86_64', 'libselinux-utils-2.5-15.el7.x86_64', 'libgomp-4.8.5-44.el7.x86_64', 'libblkid-2.23.2-65.el7.x86_64', 'gcc-c++-4.8.5-44.el7.x86_64', 'e2fsprogs-1.42.9-19.el7.x86_64', 'CUnit-devel-2.1.3-8.el7.x86_64', 'make-3.82-24.el7.x86_64', 'numactl-libs-2.0.12-5.el7.x86_64', 'perl-Git-1.8.3.1-23.el7_8.noarch', 'openssl-libs-1.0.2k-19.el7.x86_64', 'gcc-4.8.5-44.el7.x86_64', 'CUnit-2.1.3-8.el7.x86_64', 'ghc-syb-0.4.0-35.el7.x86_64', 'gcc-gfortran-4.8.5-44.el7.x86_64', 'libselinux-python-2.5-15.el7.x86_64', 'sg3_utils-libs-1.37-19.el7.x86_64', 'fuse3-libs-3.6.1-4.el7.x86_64', 'libquadmath-devel-4.8.5-44.el7.x86_64', 'libgfortran-4.8.5-44.el7.x86_64', 'krb5-workstation-1.15.1-50.el7.x86_64', 'librdmacm-22.4-5.el7.x86_64', 'sg3_utils-1.37-19.el7.x86_64', 'libsmartcols-2.23.2-65.el7.x86_64', 'fuse3-devel-3.6.1-4.el7.x86_64', 'python-2.7.5-89.el7.x86_64', 'openssl-1.0.2k-19.el7.x86_64', 'libgcc-4.8.5-44.el7.x86_64', 'libaio-devel-0.3.109-13.el7.x86_64', 'ghc-old-locale-1.0.0.5-26.4.el7.x86_64', 'libcom_err-1.42.9-19.el7.x86_64', 'git-1.8.3.1-23.el7_8.x86_64', 'krb5-libs-1.15.1-50.el7.x86_64']
np2 = np.array(result)
# ['epel-release-7-12.noarch' 'rdma-core-22.4-5.el7.x86_64' ... 'krb5-libs-1.15.1-50.el7.x86_64']
expectation = ['cpp-4.8.5-39.el7.x86_64', 'CUnit-2.1.3-8.el7.x86_64', 'CUnit-devel-2.1.3-8.el7.x86_64', 'doxygen-1.8.5-4.el7.x86_64', 'e2fsprogs-1.42.9-17.el7.x86_64', 'e2fsprogs-libs-1.42.9-17.el7.x86_64', 'epel-release-latest-7.noarch', 'fuse3-devel-3.6.1-4.el7.x86_64', 'fuse3-libs-3.6.1-4.el7.x86_64', 'gcc-4.8.5-39.el7.x86_64', 'gcc-c++-4.8.5-39.el7.x86_64', 'gcc-gfortran-4.8.5-39.el7.x86_64', 'ghc-array-0.4.0.1-26.4.el7.x86_64', 'ghc-base-4.6.0.1-26.4.el7.x86_64', 'ghc-bytestring-0.10.0.2-26.4.el7.x86_64', 'ghc-containers-0.5.0.0-26.4.el7.x86_64', 'ghc-deepseq-1.3.0.1-26.4.el7.x86_64', 'ghc-directory-1.2.0.1-26.4.el7.x86_64', 'ghc-filepath-1.3.0.1-26.4.el7.x86_64', 'ghc-json-0.7-4.el7.x86_64', 'ghc-mtl-2.1.2-27.el7.x86_64', 'ghc-old-locale-1.0.0.5-26.4.el7.x86_64', 'ghc-parsec-3.1.3-31.el7.x86_64', 'ghc-pretty-1.1.1.0-26.4.el7.x86_64', 'ghc-regex-base-0.93.2-29.el7.x86_64', 'ghc-regex-tdfa-1.1.8-11.el7.x86_64', 'ghc-ShellCheck-0.3.8-1.el7.x86_64', 'ghc-syb-0.4.0-35.el7.x86_64', 'ghc-text-0.11.3.1-2.el7.x86_64', 'ghc-time-1.4.0.1-26.4.el7.x86_64', 'ghc-transformers-0.3.0.0-34.el7.x86_64', 'ghc-unix-2.6.0.1-26.4.el7.x86_64', 'git-1.8.3.1-23.el7_8.x86_64', 'graphviz-2.30.1-21.el7.x86_64', 'help2man-1.41.1-3.el7.noarch', 'ibacm-22.4-2.el7_8.x86_64', 'keyutils-libs-devel-1.5.8-3.el7.x86_64', 'krb5-devel-1.15.1-46.el7.x86_64', 'krb5-libs-1.15.1-46.el7.x86_64', 'krb5-workstation-1.15.1-46.el7.x86_64', 'lcov-1.13-1.el7.noarch', 'libaio-devel-0.3.109-13.el7.x86_64', 'libblkid-2.23.2-63.el7.x86_64', 'libcom_err-1.42.9-17.el7.x86_64', 'libcom_err-devel-1.42.9-17.el7.x86_64', 'libgcc-4.8.5-39.el7.x86_64', 'libgfortran-4.8.5-39.el7.x86_64', 'libgomp-4.8.5-39.el7.x86_64', 'libibumad-22.4-2.el7_8.x86_64', 'libibverbs-22.4-2.el7_8.x86_64', 'libiscsi-devel-1.9.0-7.el7.x86_64', 'libkadm5-1.15.1-46.el7.x86_64', 'libmount-2.23.2-63.el7.x86_64', 'libpmem-1.5.1-2.1.el7.x86_64', 'libpmemblk-1.5.1-2.1.el7.x86_64', 'libpmemblk-devel-1.5.1-2.1.el7.x86_64', 'libpmem-devel-1.5.1-2.1.el7.x86_64', 'libquadmath-4.8.5-39.el7.x86_64', 'libquadmath-devel-4.8.5-39.el7.x86_64', 'librdmacm-22.4-2.el7_8.x86_64', 'libselinux-2.5-15.el7.x86_64', 'libselinux-devel-2.5-15.el7.x86_64', 'libselinux-python-2.5-15.el7.x86_64', 'libselinux-utils-2.5-15.el7.x86_64', 'libsepol-devel-2.5-10.el7.x86_64', 'libsmartcols-2.23.2-63.el7.x86_64', 'libss-1.42.9-17.el7.x86_64', 'libstdc++-4.8.5-39.el7.x86_64', 'libstdc++-devel-4.8.5-39.el7.x86_64', 'libunwind-1.2-2.el7.x86_64', 'libunwind-devel-1.2-2.el7.x86_64', 'libuuid-2.23.2-63.el7.x86_64', 'libuuid-devel-2.23.2-63.el7.x86_64', 'libverto-devel-0.2.5-4.el7.x86_64', 'libXaw-1.0.13-4.el7.x86_64', 'libXScrnSaver-1.2.2-6.1.el7.x86_64', 'make-3.82-24.el7.x86_64', 'nasm-2.10.07-7.el7.x86_64', 'numactl-devel-2.0.12-5.el7.x86_64', 'numactl-libs-2.0.12-5.el7.x86_64', 'openssl-1.0.2k-19.el7.x86_64', 'openssl-devel-1.0.2k-19.el7.x86_64', 'openssl-libs-1.0.2k-19.el7.x86_64', 'pcre-devel-8.32-17.el7.x86_64', 'perl-Digest-1.17-245.el7.noarch', 'perl-Digest-MD5-2.52-3.el7.x86_64', 'perl-GD-2.49-3.el7.x86_64', 'perl-Git-1.8.3.1-23.el7_8.noarch', 'python-2.7.5-88.el7.x86_64', 'python2-pycodestyle-2.5.0-1.el7.noarch', 'python-libs-2.7.5-88.el7.x86_64', 'rdma-core-22.4-2.el7_8.x86_64', 'rdma-core-devel-22.4-2.el7_8.x86_64', 'sg3_utils-1.37-19.el7.x86_64', 'sg3_utils-libs-1.37-19.el7.x86_64', 'ShellCheck-0.3.8-1.el7.x86_64', 'util-linux-2.23.2-63.el7.x86_64', 'zlib-devel-1.2.7-18.el7.x86_64']
np3 = np.array(expectation)
# ['cpp-4.8.5-39.el7.x86_64' 'CUnit-2.1.3-8.el7.x86_64' ... 'util-linux-2.23.2-63.el7.x86_64' 'zlib-devel-1.2.7-18.el7.x86_64']
ready = []
for i in keywordlist:
for j in result:
x = np.char.startswith(j, i)
if x:
ready.append(np3[np.where(np.char.startswith(np3, i))])
np4 = np.array(ready)
# [array(['cpp-4.8.5-39.el7.x86_64'], dtype='<U39') array(['CUnit-2.1.3-8.el7.x86_64'], dtype='<U39') ... array(['util-linux-2.23.2-63.el7.x86_64'], dtype='<U39')]
notready = [i for i in np3 if i not in np4]
print(f"not ready: {notready}")
The purpose is to use string format keyword in keyword list to examine its existence in all np2 elements.
If any element in np2 starts with any keyword, or keyword is the substring of any element in np2, get the index of element in expectation which also start with that keyword and form into np4.
Finally, get not ready which is made up of elements that are in np3 but not in np4.
To make my explanation more vividly, I have a bunch of rpm files to be installed, the list of expectation.
The keyword list catches the former two keywords of each rpm file name.
Result is the standard output of already installed rpm files.
Taking cpp-4.8.5 as an example, I can see cpp-4.8.5-44.el7.x86_64 in result, which means currently cpp-4.8.5-44.el7.x86_64 has been installed. So, cpp-4.8.5-39.el7.x86_64 in expectation can be removed, since cpp-4.8.5-*.rpm has been successfully installed. Next step, deal with the other left items in expectation.
My question is: there any easier or more efficient way to get the result equivalent to notready? maybe with any other numpy built-in methods, but not with for loop.

Can use Beautifulsoup to find elements hidden by other wrapped elements?

I would like to extract the text data of the author affiliations on this page using Beautiful soup.
I know of a work around using selenium to simply click on the 'show more' link and scan the page again? Im not sure what kind of elements these are, hidden? as they only appear in the inspector after clicking the button.
Is there a way to extract this info just using beautiful soup or do I need selenium or something equivalent to reveal the elements in the HTML code?
from bs4 import BeautifulSoup
import requests
url = 'https://www.sciencedirect.com/science/article/abs/pii/S0920379621007596'
sp = BeautifulSoup(r.content, 'html.parser')
r = sp.get(url)
author_data = sp.find('div', id='author-group')
affiliations = author_data.find('dl', class_='affiliation').text
print(affiliations)
That info is within a script tag though you need to map the letters for affiliations to the actual affiliations. The code below extracts the JavaScript object housing the info you want and handles with JSON library.
There is then a series of steps to dynamically determine which indices hold the info of interest and then use a constructed mapping of the letters to affiliations to assign the correct affiliation to each author.
The author first and last names are also dynamically ascertained and joined together with a space.
The intention was to avoid hardcoding indices which might change over time.
import re
import json
import requests
r = requests.get('https://www.sciencedirect.com/science/article/abs/pii/S0920379621007596',
headers={'User-Agent': 'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"abstracts".*})', r.text).group(1))
base = [i for i in data['authors']['content']
if i.get('#name') == 'author-group'][0]['$$']
affiliation_data = [i for i in base if i['#name'] == 'affiliation']
author_data = [i for i in base if i['#name'] == 'author']
name_info = [i['_'] for author in author_data for i in author['$$']
if i['#name'] in ['given-name', 'surname']]
affiliations = dict(zip([j['_'] for i in affiliation_data for j in i['$$'] if j['#name'] == 'label'], [
j['_'] for i in affiliation_data for j in i['$$'] if isinstance(j, dict) and '_' in j and j['_'][0].isupper()]))
# print(affiliations)
author_affiliations = dict(zip([' '.join([i[0], i[1]]) for i in zip(name_info[0::2], name_info[1::2])], [
affiliations[j['_']] for author in author_data for i in author['$$'] if i['#name'] == 'cross-ref' for j in i['$$'] if j['_'] != '⁎']))
print(author_affiliations)

Adding a Retokenize pipe while training NER model

I am currenly attempting to train a NER model centered around Property Descriptions. I could get a fully trained model to function to my liking however, I now want to add a retokenize pipe to the model so that I can set up the model to train other things.
From here, I am having issues getting the retokenize pipe to actually work. Here is the definition:
def retok(doc):
ents = [(ent.start, ent.end, ent.label) for ent in doc.ents]
with doc.retokenize() as retok:
string_store = doc.vocab.strings
for start, end, label in ents:
retok.merge(
doc[start: end],
attrs=intify_attrs({'ent_type':label},string_store))
return doc
i am adding it into my training like this:
nlp.add_pipe(retok, after="ner")
and I am adding it into the Language Factories like this:
Language.factories['retok'] = lambda nlp, **cfg: retok(nlp)
The issue I keep getting is "AttributeError: 'English' object has no attribute 'ents'". Now I am assuming I am getting this error because the parameter that is being passed through this function is not a doc but actually the NLP model itself. I am not really sure to get a doc to flow into this pipe during training. At this point I don't really know where to go from here to get the pipe to function the way I want.
Any help is appreciated, thanks.
You can potentially use the built-in merge_entities pipeline component: https://spacy.io/api/pipeline-functions#merge_entities
The example copied from the docs:
texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]
merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents)
texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]
If you need to customize it further, the current implementation of merge_entities (v2.2) is a good starting point:
def merge_entities(doc):
"""Merge entities into a single token.
doc (Doc): The Doc object.
RETURNS (Doc): The Doc object with merged entities.
DOCS: https://spacy.io/api/pipeline-functions#merge_entities
"""
with doc.retokenize() as retokenizer:
for ent in doc.ents:
attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.l
abel}
retokenizer.merge(ent, attrs=attrs)
return doc
P.S. You are passing nlp to retok() below, which is where the error is coming from:
Language.factories['retok'] = lambda nlp, **cfg: retok(nlp)
See a related question: Spacy - Save custom pipeline

Custom SPARQL functions in rdflib

What is a good way to hook a custom SPARQL function into rdflib?
I have been looking around in rdflib for an entry point for custom function. I found no dedicated entry point but found that rdflib.plugins.sparql.CUSTOM_EVALS might be a place to add the custom function.
So far I have made an attempt with the code below. It seems "dirty" to me. I am calling a "hidden" function (_eval) and I am not sure I got all the argument updating correct. Beyond the custom_eval.py example code (which form the basis for my code) I found little other code or documentation about CUSTOM_EVALS.
import rdflib
from rdflib.plugins.sparql.evaluate import evalPart
from rdflib.plugins.sparql.sparql import SPARQLError
from rdflib.plugins.sparql.evalutils import _eval
from rdflib.namespace import Namespace
from rdflib.term import Literal
NAMESPACE = Namespace('//custom/')
LENGTH = rdflib.term.URIRef(NAMESPACE + 'length')
def customEval(ctx, part):
"""Evaluate custom function."""
if part.name == 'Extend':
cs = []
for c in evalPart(ctx, part.p):
if hasattr(part.expr, 'iri'):
# A function
argument = _eval(part.expr.expr[0], c.forget(ctx, _except=part.expr._vars))
if part.expr.iri == LENGTH:
e = Literal(len(argument))
else:
raise SPARQLError('Unhandled function {}'.format(part.expr.iri))
else:
e = _eval(part.expr, c.forget(ctx, _except=part._vars))
if isinstance(e, SPARQLError):
raise e
cs.append(c.merge({part.var: e}))
return cs
raise NotImplementedError()
QUERY = """
PREFIX custom: <%s>
SELECT ?s ?length WHERE {
BIND("Hello, World" AS ?s)
BIND(custom:length(?s) AS ?length)
}
""" % (NAMESPACE,)
rdflib.plugins.sparql.CUSTOM_EVALS['exampleEval'] = customEval
for row in rdflib.Graph().query(QUERY):
print(row)
So first off, I want to thank you for showing how you implemented a new SPARQL function.
Secondly, by using your code I was able to create a SPARQL function that evaluates two strings by using the Levenshtein distance. It has been really insightful and I wish to share it for it holds additional documentation that could help other developers creating their own custom SPARQL functions.
# Import needed to introduce new SPARQL function
import rdflib
from rdflib.plugins.sparql.evaluate import evalPart
from rdflib.plugins.sparql.sparql import SPARQLError
from rdflib.plugins.sparql.evalutils import _eval
from rdflib.namespace import Namespace
from rdflib.term import Literal
# Import for custom function calculation
from Levenshtein import distance as levenshtein_distance # python-Levenshtein==0.12.2
def SPARQL_levenshtein(ctx:object, part:object) -> object:
"""
The first two variables retrieved from a SPARQL-query are compared using the Levenshtein distance.
The distance value is then stored in Literal object and added to the query results.
Example:
Query:
PREFIX custom: //custom/ # Note: this part refereces to the custom function
SELECT ?label1 ?label2 ?levenshtein WHERE {
BIND("Hello" AS ?label1)
BIND("World" AS ?label2)
BIND(custom:levenshtein(?label1, ?label2) AS ?levenshtein)
}
Retrieve:
?label1 ?label2
Calculation:
levenshtein_distance(?label1, ?label2) = distance
Output:
Save distance in Literal object.
:param ctx: <class 'rdflib.plugins.sparql.sparql.QueryContext'>
:param part: <class 'rdflib.plugins.sparql.parserutils.CompValue'>
:return: <class 'rdflib.plugins.sparql.processor.SPARQLResult'>
"""
# This part holds basic implementation for adding new functions
if part.name == 'Extend':
cs = []
# Information is retrieved and stored and passed through a generator
for c in evalPart(ctx, part.p):
# Checks if the function holds an internationalized resource identifier
# This will check if any custom functions are added.
if hasattr(part.expr, 'iri'):
# From here the real calculations begin.
# First we get the variable arguments, for example ?label1 and ?label2
argument1 = str(_eval(part.expr.expr[0], c.forget(ctx, _except=part.expr._vars)))
argument2 = str(_eval(part.expr.expr[1], c.forget(ctx, _except=part.expr._vars)))
# Here it checks if it can find our levenshtein IRI (example: //custom/levenshtein)
# Please note that IRI and URI are almost the same.
# Earlier this has been defined with the following:
# namespace = Namespace('//custom/')
# levenshtein = rdflib.term.URIRef(namespace + 'levenshtein')
if part.expr.iri == levenshtein:
# After finding the correct path for the custom SPARQL function the evaluation can begin.
# Here the levenshtein distance is calculated using ?label1 and ?label2 and stored as an Literal object.
# This object is than stored as an output value of the SPARQL-query (example: ?levenshtein)
evaluation = Literal(levenshtein_distance(argument1, argument2))
# Standard error handling and return statements
else:
raise SPARQLError('Unhandled function {}'.format(part.expr.iri))
else:
evaluation = _eval(part.expr, c.forget(ctx, _except=part._vars))
if isinstance(evaluation, SPARQLError):
raise evaluation
cs.append(c.merge({part.var: evaluation}))
return cs
raise NotImplementedError()
namespace = Namespace('//custom/')
levenshtein = rdflib.term.URIRef(namespace + 'levenshtein')
query = """
PREFIX custom: <%s>
SELECT ?label1 ?label2 ?levenshtein WHERE {
BIND("Hello" AS ?label1)
BIND("World" AS ?label2)
BIND(custom:levenshtein(?label1, ?label2) AS ?levenshtein)
}
""" % (namespace,)
# Save custom function in custom evaluation dictionary.
rdflib.plugins.sparql.CUSTOM_EVALS['SPARQL_levenshtein'] = SPARQL_levenshtein
for row in rdflib.Graph().query(query):
print(row)
To answer your question: "What is a good way to hook a custom SPARQL function into rdflib?
Currently I'm developing a class that handles RDF data and I believe it might be best to implement the following code in to __init__function.
For example:
class ClassName():
"""DOCSTRING"""
def __init__(self):
"""DOCSTRING"""
# Save custom function in custom evaluation dictionary.
rdflib.plugins.sparql.CUSTOM_EVALS['SPARQL_levenshtein'] = SPARQL_levenshtein
Please note, this SPARQL function will only work for the endpoint on which it is implemented. Even though the SPARQL syntax in the query is correct, it is not possible applying the function in SPARQL-queries used for databases like DBPedia. The DBPedia endpoint does not support this custom function (yet).

How to extract Highlighted Parts from PDF files

Is there any way to extract highlighted text from a PDF file programmatically? Any language is welcome. I have found several libraries with Python, Java, and also PHP but none of them do the job.
To extract highlighted parts, you can use PyMuPDF. Here is an example which works with this pdf file:
Direct download
# Based on https://stackoverflow.com/a/62859169/562769
from typing import List, Tuple
import fitz # install with 'pip install pymupdf'
def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int, int]]) -> str:
points = annot.vertices
quad_count = int(len(points) / 4)
sentences = []
for i in range(quad_count):
# where the highlighted part is
r = fitz.Quad(points[i * 4 : i * 4 + 4]).rect
words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
sentences.append(" ".join(w[4] for w in words))
sentence = " ".join(sentences)
return sentence
def handle_page(page):
wordlist = page.get_text("words") # list of words on page
wordlist.sort(key=lambda w: (w[3], w[0])) # ascending y, then x
highlights = []
annot = page.first_annot
while annot:
if annot.type[0] == 8:
highlights.append(_parse_highlight(annot, wordlist))
annot = annot.next
return highlights
def main(filepath: str) -> List:
doc = fitz.open(filepath)
highlights = []
for page in doc:
highlights += handle_page(page)
return highlights
if __name__ == "__main__":
print(main("PDF-export-example-with-notes.pdf"))
Ok, after looking I found a solution for exporting highlighted text from a pdf to a text file. Is not very hard:
First, you highlight your text with the tool you like to use (in my case, I highlight while I'm reading on an iPad using Goodreader app).
Transfer your pdf to a computer and open it using Skim (a pdf reader, free and easy to find on the web)
On FILE, choose CONVERT NOTES and convert all the notes of your document to SKIM NOTES.
That's all: simply go to EXPORT an choose EXPORT SKIM NOTES. It will export you a list of your highlighted text. Once opened this list can be exported again to a txt format file.
Not much work to do, and the result is fantastic.