Context parameters in Blazegraph Nano SPARQL - semantic-web

I am doing some experiments with the Blazegraph Nano SPARQL Server. I started the server with the following command:
$ java -server -Xmx4g -jar bigdata-bundled.jar
However, I need to set a timeout for queries. There is a context parameter named queryTimeout for that, but I do not know how it has to be used. Can I add a command option to set this parameter? If this parameter can only be set in a web.xml file, where can I find a minimal web.xml file that I can use to set the queryTimeout parameter?

If you're using the REST API, you don't need to recompile with the web.xml. You can use the timeout query parameter to set the value for an individual query in seconds or the X-BIGDATA-MAX-QUERY-MILLIS HTTP Header to set the query timeout in milliseconds. See REST Query API.
Example setting the timeout to 30 seconds.
curl -X POST http://localhost:8080/bigdata/sparql --data-urlencode \
'query=SELECT * { ?s ?p ?o } LIMIT 1' --data-urlencode 'timeout=30'
Example setting the timeout to 100 milliseconds.
curl -X POST http://localhost:8080/bigdata/sparql --data-urlencode \
'query=SELECT * { ?s ?p ?o } LIMIT 1' -H 'X-BIGDATA-MAX-QUERY-MILLIS:100'
If you have an embedded application, such as Blueprints. You can set the maxQueryTime property when you create the knowledge base. It sets the time out in seconds per the Query object on the OpenRDF (rdf4j) library. Here's an example with the Sesame embedded mode.
com.bigdata.blueprints.BigdataGraph.maxQueryTime=30

It is possible to compile Blazegraph again after updating the web.xml file. The steps are:
Clone the git repository.
git clone git://git.code.sf.net/p/bigdata/git Blazegraph
Checkout the release.
git checkout -b BLAZEGRAPH_RELEASE_1_5_1
Edit bigdata-war/src/WEB-INF/web.xml to set the queryTimeout property as:
<context-param>
<description>When non-zero, the timeout for queries (milliseconds).</description>
<param-name>queryTimeout</param-name>
<param-value>60000</param-value>
</context-param>
Recompile Blazegraph.
ant clean executable-jar

Related

AWS Neptune /system Bad Route Connection Error

I am trying to reset my Neptune instance following the documentation provided,
https://aws.amazon.com/blogs/database/resetting-your-graph-data-in-amazon-neptune-in-seconds/
https://docs.aws.amazon.com/neptune/latest/userguide/manage-console-fast-reset.html
When I try these approach to awscurl, I am getting BadRequestException error
{"requestId":"21cd6d80-some-more-code-25566575a4ba","code":"BadRequestException","detailedMessage":"Bad route: /system"}
This was my awscurl
awscurl -X POST --access_key ACCESS_KEY --secret_key SECRET_KEY --service neptune-db "https://neptunedbinstance-somecode.somemorecode.us-east-2.neptune.amazonaws.com:8182/system" -H 'Content-Type: application/json' --region us-east-2 -d '{ "action" : "initiateDatabaseReset" }'
Adding an answer in case others find this question.
The fast reset feature requires that the Neptune engine version needs to be 1.0.4.0 or later. There is a related blog post that describes the feature in detail.

import status lingers after GraphDB repository deleted

GraphDB Free/9.4.1, RDF4J/3.3.1
I'm working on using the /rest/data/import/server/{repo-id} endpoint to initiate the importing of an RDF/XML file.
Steps:
put SysML.owl in the ${graphdb.workbench.importDirectory} directory.
chmod a+r SysML.owl
create repository test1 (in Workbench - using all defaults except RepositoryID := "test1")
curl http://127.0.0.1:7200/rest/data/import/server/test1 => as expected:
[{"name":"SysML.owl","status":"NONE"..."timestamp":1606848520821,...]
curl -XPOST --header 'Content-Type: application/json' --header 'Accept: application/json' -d ' { "fileNames":[ "SysML.owl" ] }' http://127.0.0.1:7200/rest/data/import/server/test1 => SC==202
after 60 seconds, curl http://127.0.0.1:7200/rest/data/import/server/test1 =>
[{"name":"SysML.owl","status":"DONE","message":"Imported successfully in 7s.","context":null,"replaceGraphs":[],"baseURI":
"file:/home/steve/graphdb-import/SysML.owl", "forceSerial":false,"type":"file","format":null,"data":null,"timestamp":
1606848637716, [...other json content deleted]
Repository test1 now has the 263,119 (824 inferred) statements from SysML.owl loaded
BUT if I then
delete the repository using the Workbench page at http://localhost:7200/repository, wait 180 seconds
curl http://127.0.0.1:7200/rest/data/import/server/test => same as in step 5 above, despite repository having been deleted.
curl -X GET --header 'Accept: application/json' 'http://localhost:7200/rest/repositories' => test1 not shown.
create the repository again, using the Workbench - same settings as previously. wait 60 seconds. Initial 70 statements present.
curl http://127.0.0.1:7200/rest/data/import/server/test1 =>
The same output as from the earlier usage - when I was using the prior repository instance. "status":"DONE", same timestamp - which is prior to the time at which I deleted, recreated the test1 repository.
The main-2020-12-01.log shows the INFO messages pertaining to the repository test1, plugin registrations, etc. Nothing indicating why the prior repository instance's import status is lingering.
And this is of concern because I was expecting to use some polling of the status to determine when the data is loaded so my processing can proceed. Some good news - I can issue the import server file request again and after waiting 60 seconds, the 263,119 statements are present. But the timestamp on the import is the earlier repo instance's timestamp. It was not reset via the latest import request.
I'm probably missing some cleanup step(s), am hoping someone knows which.
Thanks,
-Steve
The status is simply for your reference and doesn't represent the actual presence of data in the repository. You could achieve a similar thing simply by clearing all data in the repository without recreating it.
If you really need to rely on those status records you can clear the status for a given file once you polled it and determined it's done (or prior to starting an import) with this curl:
curl -X DELETE http://127.0.0.1:7200/rest/data/import/server/test1/status \
-H 'content-type: application/json' -d '["SysML.owl"]'
Note that this is an undocumented API and it may change without notice.

Using gcloud, how can I find the Virtual Machine Name?

I am trying to find the Google Cloud Compute Virtual Machine Name, using the gcloud command while logged in to the VM.
Me searching in the documentation didn't yield a result...
Thanks!
See Metadata service.
Specifically:
curl \
--header "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/instance/name
The Metadata service is a well-implemented API, you can navigate up-down the tree of resources, for example, dropping the final name from the above URL, enumerates all the resources under instance:
curl \
--header "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/instance/
returns:
attributes/
cpu-platform
description
disks/
guest-attributes/
hostname
id
image
legacy-endpoint-access/
licenses/
machine-type
maintenance-event
name
network-interfaces/
preempted
remaining-cpu-time
scheduling/
service-accounts/
tags
virtual-clock/
zone
You can then pick any of the above, append it to the URL to continue browsing.
The documentation in the link is similarly comprehensive and, in the case of instance metadata, as expected, reflects the response from GET'ing
https://compute.googleapis.com/compute/v1/.../instances/...
i.e.:
https://cloud.google.com/compute/docs/reference/rest/v1/instances/get#response-body

Programmaticaly upload dataset to fuseki

I use the jena fuseki 2 docker image to create a fuseki server.
And I want to know if there is a way to upload my dataset to fuseki not from the web interface but programmatically, from SPARQL or Python or whatever else.
And also, is there a way to work with ontology from webprotégé directly from fuseki?
Thanks for your answer
Fuseki comes with an HTTP API that can be used to upload data. You could use CURL or a Python HTTP library to communicate with this API. Fuseki also includes command-line helper scripts that can be used for calling the HTTP API. See https://jena.apache.org/documentation/fuseki2/soh.html#soh-sparql-http for more details.
If your RDF data is in turtle format, you can use the following code:
data = open('test.ttl').read()
headers = {'Content-Type': 'text/turtle;charset=utf-8'}
r = requests.post('http://localhost:3030/mydataset/data?default', data=data, headers=headers)
If your RDF data are in other format, you should change your headers, here is a list:
n3: text/n3; charset=utf-8
nt: text/plain
rdf: application/rdf+xml
owl: application/rdf+xml
nq: application/n-quads
trig: application/trig
jsonld: application/ld+json
I tried to upload file to fuseki using CURL, WGET, ./s-post, ./s-put with no effect. I generated request with postman's help. If someone, like me, is looking for correct CURL request, this is it:
curl --location --request POST 'http://{FUSEKIADDRESS}/{YOURDATASET}/data' --header 'Content-Type: multipart/form-data' --form 'file.ttl=#{}PATHtoFILE/file.ttl'

Indexer IOException job fail while Indexing nutch crawled data in “Bluemix” solr

I'm trying to index the nutch crawled data by Bluemix solr. I used the following command in my command prompt:
bin/nutch index -D solr.server.url="https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/admin/collections -D solr.auth=true -D solr.auth.username="USERNAME" -D solr.auth.password="PASS" Crawl/crawldb -linkdb Crawl/linkdb Crawl/segments/2016*
But it fails to finish the indexing. The result is as followed:
Indexer: starting at 2016-06-16 16:31:50
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SolrIndexWriter
solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent')
solr.server.url : URL of the Solr instance (mandatory)
solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type)
solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.commit.size : buffer size when sending to Solr (default 1000)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
Indexing 153 documents
Indexing 153 documents
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
I guess it has something to do with the solr.server.url address, maybe the end of it. I changed it in different ways
e.g
"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/example_collection/update".
(since it is used for indexing JSON/CSV/... files by the the Bluemix Solr ).
But no chance to now.
Anyone knows how can I fix it? And if the problem is as I guessed, anyone knows what exactly should the solr.server.url be ?
By the way, "example_collection" is my collections name, and I'm working with nutch1.11.
As far as I know, indexing nutch crawled data in Bluemix R&R, by the index command provided in nutch itself(bin/nutch index...) is not possible.
I realized that for indexing nutch crawled data in Bluemix Retrieve and Rank service one should:
Crawl seeds with nutch e.g
$:bin/crawl -w 5 urls crawl 25
you can check the status of crawling with:
bin/nutch readdb crawl/crawldb/ -stats
Dumped the crawled dataas files:
$:bin/nutch dump -flatdir -outputDir dumpData/ -segment crawl/segments/
Post those that are possible e.g xml files to solr Collection on Retrieve and Rank:
Post_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update"' %(solr_cluster_id, solr_collection_name)
cmd ='''curl -X POST -H %s -u %s %s --data-binary #%s''' %(Cont_type_xml, solr_credentials, Post_url, myfilename)
subprocess.call(cmd,shell=True)
Convert the rest to json with Bluemix Doc-Conv service:
doc_conv_url = '"https://gateway.watsonplatform.net/document-conversion/api/v1/convert_document?version=2015-12-15"'
cmd ='''curl -X POST -u %s -F config="{\\"conversion_target\\":\\"answer_units\\"}" -F file=#%s %s''' %(doc_conv_credentials, myfilename, doc_conv_url)
process = subprocess.Popen(cmd, shell= True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
and then save these Json results in a json file.
Post this json file to the collection:
Post_converted_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update/json/docs?commit=true&split=/answer_units/id&f=id:/answer_units/id&f=title:/answer_units/title&f=body:/answer_units/content/text"' %(solr_cluster_id, solr_collection_name)
cmd ='''curl -X POST -H %s -u %s %s --data-binary #%s''' %(Cont_type_json, solr_credentials, Post_converted_url, Path_jsonFile)
subprocess.call(cmd,shell=True)
Send Queries:
pysolr_client = retrieve_and_rank.get_pysolr_client(solr_cluster_id, solr_collection_name)
results = pysolr_client.search(Query_term)
print(results.docs)
Codes are in python.
For beginners: You can use the curl commands directly in you CMD. I hope it helpes others.
like this:
nutch index -Dsolr.server.url=http://username:password#localhost:8983/solr/nutch crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20170816191100/ -filter -normalize -deleteGone
it works.