I would like to generate inferences over an ontology with the OWL API or directly with something like the Hermit reasoner. That is easy (code shown below). However, I would like to cache / store / save those results so that I don't have to rerun the reasoner each time. This is what happens in the VM. The first time I call this:
Set<OWLClass> classes = reasoner.getSubClasses(parent, false).getFlattened();
it takes about 20 seconds (in this example). The second time it takes no time to generate the same results because they're cached. That is great. However, if I bounce my VM I have to re-run the reasoner (I have some fairly long queries) or save the output to a database to display them in right away.
Reasoner is not serializable (either from Hermit, or via the OWL API) and there seems to be no obvious way to me to recreate a Reasoner and load any previous results into it. They have to be recalculated each time.
Serializing the ontology is straightforward, but I don't see any way
of reloading results into the reasoner to save you from having to make
the same calls again.
Is there something I'm missing? Would Jena be a better choice for this?
OWLOntologyManager owlOntologyManager = OWLManager.createOWLOntologyManager();
File file = new File("resource/RDFData/release", "NEMOv2.85_GAFLP1_diffwave_data.rdf");
IRI iri = IRI.create(file);
OWLDataFactory factory = owlOntologyManager.getOWLDataFactory();
OWLOntology owlOntology = owlOntologyManager.loadOntologyFromOntologyDocument(iri);
Reasoner reasoner = new Reasoner(owlOntology);
OWLClass parent = factory.getOWLClass(IRI.create(DataSet.NS + "#NEMO_0877000"));
Set<OWLClass> classes = reasoner.getSubClasses(parent, false).getFlattened();
FileOutputStream fos = null;
ObjectOutputStream out = null;
try {
fos = new FileOutputStream("reasoner.ser");
out = new ObjectOutputStream(fos);
out.writeObject(reasoner);
out.close();
} catch (IOException ex) {
ex.printStackTrace();
}
Saving the inferred ontology (which you obtained from the InferredOntologyGenerator) seems to be the best method.
From what I understand though, the reasoner will still check for consistency and classify the ontology when you create a new reasoner, but it's just faster because you have made all the inferences explicit.
Related
I followed the instructions on https://graphdb.ontotext.com/documentation/9.4/free/shacl-validation.html, and it worked as documented. However, once this is done, I found no way to inspect the Shape graph configured for my repository.
The special graph <http://rdf4j.org/schema/rdf4j#SHACLShapeGraph> is nowhere to be found; it does not appear in the 'Graphs overview` screen, it is not accessible via SPARQL queries.
Shape graphs currently cannot be queried with SPARQL inside GraphDB as it is not part of the data. One way to inspect the graph is using a RDF4J Client to connect to the GraphDB repository. You can find all the statements inside the shape graph with the following code snippet:
HTTPRepository repository = new HTTPRepository("http://address:port/", "repositoryname");
try (RepositoryConnection connection = repository.getConnection()) {
Model statementsCollector = new LinkedHashModel(connection.getStatements(null, null, null, RDF4J.SHACL_SHAPE_GRAPH)
.stream()
.collect(Collectors.toList()));
}
For more information regarding accessing and updating Shacl shape graphs you can also take a look here https://rdf4j.org/documentation/programming/shacl/ .
I am experiencing a performance issue related to the default batch size of the query ResultSender using client/server config. I believe the default value is 100.
If I run a simple query to get keys (with some order by columns due to the PARTITION Region type), this default batch size causes too many chunks being sent back for even 1000 records. In my tests, even the total query time is only less than 100 ms, however, the app takes more than 10 seconds to process those chunks.
Reading between the lines in your problem statement, it seems you are:
Executing an OQL query on a PARTITION Region (PR).
Running the query inside a Function as recommended when executing queries on a PR.
Sending batch results (as opposed to streaming the results).
I also assume since you posted exclusively in the #spring-data-gemfire channel, that you are using Spring Data GemFire (SDG) to:
Execute the query (e.g. by using the SDG GemfireTemplate; Of course, I suppose you could also be using the GemFire Query API inside your Function directly, too)?
Implemented the server-side Function using SDG's Function annotation support?
And, are possibly (indirectly) using SDG's BatchingResultSender, as described in the documentation?
NOTE: The default batch size in SDG is 0, NOT 100. Zero means stream the results individually.
Regarding #2 & #3, your implementation might look something like the following:
#Component
class MyApplicationFunctions {
#GemfireFunction(id = "MyFunction", batchSize = "1000")
public List<SomeApplicationType> myFunction(FunctionContext functionContext) {
RegionFunctionContext regionFunctionContext =
(RegionFunctionContext) functionContext;
Region<?, ?> region = regionFunctionContext.getDataSet();
if (PartitionRegionHelper.isPartitionRegion(region)) {
region = PartitionRegionHelper.getLocalDataForContext(regionFunctionContext);
}
GemfireTemplate template = new GemfireTemplate(region);
String OQL = "...";
SelectResults<?> results = template.query(OQL); // or `template.find(OQL, args);`
List<SomeApplicationType> list = ...;
// process results, convert to SomeApplicationType, add to list
return list;
}
}
NOTE: Since you are most likely executing this Function "on Region", the FunctionContext type will actually be a RegionFunctionContext in this case.
The batchSize attribute on the SDG #GemfireFunction annotation (used for Function "implementations") allows you to control the batch size.
Of course, instead of using SDG's GemfireTemplate to execute queries, you can, of course, use the GemFire Query API directly, as mentioned above.
If you need even more fine grained control over "result sending", then you can simply "inject" the ResultSender provided by GemFire to the Function, even if the Function is implemented using SDG, as shown above. For example you can do:
#Component
class MyApplicationFunctions {
#GemfireFunction(id = "MyFunction")
public void myFunction(FunctionContext functionContext, ResultSender resultSender) {
...
SelectResults<?> results = ...;
// now process the results and use the `resultSender` directly
}
}
This allows you to "send" the results however you see fit, as required by your application.
You can batch/chunk results, stream, whatever.
Although, you should be mindful of the "receiving" side in this case!
The 1 thing that might not be apparent to the average GemFire user is that GemFire's default ResultCollector implementation collects "all" the results first before returning them to the application. This means the receiving side does not support streaming or batching/chunking of the results, allowing them to be processed immediately when the server sends the results (either streamed, batched/chunked, or otherwise).
Once again, SDG helps you out here since you can provide a custom ResultCollector on the Function "execution" (client-side), for example:
#OnRegion("SomePartitionRegion", resultCollector="myResultCollector")
interface MyApplicationFunctionExecution {
void myFunction();
}
In your Spring configuration, you would then have:
#Configuration
class ApplicationGemFireConfiguration {
#Bean
ResultCollector myResultCollector() {
return ...;
}
}
Your "custom" ResultCollector could return results as a stream, a batch/chunk at a time, etc.
In fact, I have prototyped a "streaming" ResultCollector implementation that will eventually be added to SDG, here.
Anyway, this should give you some ideas on how to handle the performance problem you seem to be experiencing. 1000 results is not a lot of data so I suspect your problem is mostly self-inflicted.
Hope this helps!
John,
Just to clarify, I use client/server topology(actually wan, but that is not important in here). My client is a spring boot web app which has kendo grid as ui. Users can filter/sort on any combination of the columns, which will be passed to the spring boot app for generating dynamic OQL and create the pagination. Till now, except for being dynamic, my OQL queries are quite straight forward. I do not want to introduce server side functions due to the complexity of our global deployment process. But I can if you think that is something I have to do.
Again, thanks for your answers.
I have stitched a lot of small XML files into one file, and then made a custom extractor to return rows with one byte array that corresponds to each file.
Run on remote/master
Run it for one file (gzipped, 11Mb), it works fine.
Run it for more than one file, I get a System.OutOfMemoryException.
Run on local/master
Run it for one or more files (gzipped 500+ Mbs), works fine.
Extractor looks like this:
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
using (var stream = new StreamReader(input.BaseStream))
{
var xml = stream.ReadToEnd();
// Clean stiched XML
xml = UtilsXml.CleanXml(xml);
// Get nodes - one for each stiched file
var d = new XmlDocument();
d.LoadXml(xml);
var root = d.FirstChild;
for (int i = 0; i < root.ChildNodes.Count; i++)
{
output.Set<object>(1, Encoding.ASCII.GetBytes(root.ChildNodes[i].OuterXml.ToString()));
yield return output.AsReadOnly();
}
yield break;
}
}
and error message looks like this:
==== Caught exception System.OutOfMemoryException
at System.Xml.XmlDocument.CreateTextNode(String text)
at System.Xml.XmlLoader.LoadAttributeNode()
at System.Xml.XmlLoader.LoadNode(Boolean skipOverWhitespace)
at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
at System.Xml.XmlDocument.Load(XmlReader reader)
at System.Xml.XmlDocument.LoadXml(String xml)
at Microsoft.Analytics.Tools.Formats.Text.XmlByteArrayRowExtractor.<Extract>d__0.MoveNext()
at ScopeEngine.SqlIpExtractor<ScopeEngine::GZipInput,Extract_0_Data0>.GetNextRow(SqlIpExtractor<ScopeEngine::GZipInput\,Extract_0_Data0>* , Extract_0_Data0* output) in d:\data\ccs\jobs\bc367467-ef86-43d2-a937-46ba2d4cc524_v0\sqlmanaged.h:line 1924
So what am I doing wrong? And how do I debug this on remote?
Thanks!
Unfortunately local run does not enforce memory allocations, so you would have to check memory in local vertex debug yourself.
Looking at your code above, I see that you are loading XML documents into a DOM. Please note that an XML DOM can explode the data size from the string representation up to a factor of 10 or more (I have seen 2 to 12 in my times as the resident SQL XML guru).
Each UDO today only gets 1/2 GB of RAM to play with. So what I assume is that your XML DOM document(s) start going beyond that.
The recommendation normally is that you use the XMLReader interface (there is a reader extractor in the samples on http://usql.io as well) and scan through the document(s) to find the information you are looking for.
If your documents are always small enough (e.g., <20MB), you may want to make sure that you release the memory of the other documents and operate one document at a time.
We do have plans to allow you to annotate your UDO with memory needs, but that is still a bit out.
I have written a custom record reader and looking for sample test code to test my custom reader using MRUnit or any other testing framework. Its working fine as per the functionality but I would like to add test cases before I make an install. Any help would be appreciable.
In my opinion, a custom record reader is like any iterator. For testing my record reader I have been able to work without MRUnit or any other hadoop junit frameworks. The test executes quickly and the footprint is small too. Initialize the record reader in your test case and keep iterating on it. Here is a pseudocode from one of my tests. I can provide you more details if you want to proceed in this direction.
MyInputFormat myInputFormat = new MyInputFormat();
//configure job and provide input format configuration
Job job = Job.getInstance(conf, "test");
conf = job.getConfiguration();
// verify split type and count if you want to verify the input format also
List<InputSplit> splits = myInputFormat.getSplits(job);
TaskAttemptContext context = new TaskAttemptContextImpl(conf, new TaskAttemptID());
RecordReader<LongWritable, Text> reader = myInputFormat.createRecordReader(splits.get(1), context);
reader.initialize(splits.get(1), context);
for (; number of expected value;) {
assertTrue(reader.nextKeyValue());
// verify key and value
assertEquals(expectedLong, reader.getCurrentKey());
}
I'm working with a complicated xml schema, for which I have created a class structure using xsd.exe (with some effort). I can now reliably deserialize the xml into the generated class structure. For example, consider the following xml from the web service:
<ODM FileType="Snapshot" CreationDateTime="2009-10-09T19:58:46.5967434Z" ODMVersion="1.3.0" SourceSystem="XXX" SourceSystemVersion="999">
<Study OID="2">
<GlobalVariables>
<StudyName>Test1</StudyName>
<StudyDescription/>
<ProtocolName>Test0001</ProtocolName>
</GlobalVariables>
<MetaDataVersion OID="1" Name="Base Version" Description=""/>
<MetaDataVersion OID="2" Name="Test0001" Description=""/>
<MetaDataVersion OID="3" Name="Test0002" Description=""/>
</Study>
</ODM>
I can deserialize the xml as follows:
public ODMcomplexTypeDefinitionStudy GetStudy(string studyId)
{
ODMcomplexTypeDefinitionStudy study = null;
ODM odm = Deserialize<ODM>(Service.GetStudy(studyId));
if (odm.Study.Length > 0)
study = odm.Study[0];
return study;
}
Service.GetStudy() returns an HTTPResponse stream from the web service. And Deserialize() is a helper method that deserializes the stream into the object type T.
My question is this: is it more efficient to let the deserialization process create the entire class structure and deserialize the xml, or is it more efficient to grab only the xml of interest and deserialize that xml. For example, I could replace the above code with:
public ODMcomplexTypeDefinitionStudy GetStudy(string studyId)
{
ODMcomplexTypeDefinitionStudy study = null;
using (XmlReader reader = XmlReader.Create(Service.GetStudy(studyId)))
{
XDocument xdoc = XDocument.Load(reader);
XNamespace odmns = xdoc.Root.Name.Namespace;
XElement elStudy = xdoc.Root.Element(odmns + "Study");
study = Deserialize<ODMcomplexTypeDefinitionStudy>(elStudy.ToString());
}
return study;
}
I suspect that the first approach is preferred -- there is a lot of dom manipulation going on in the second example, and the deserialization process must have optimizations; however, what happens when the xml grows dramatically? Let's say the source returns 1 MB of xml and I'm really only interested in a very small component of that xml. Should I let the deserialzation process fill up the containing ODM class with all it's arrays and properties of child nodes? Or just go get the child node as in the second example!!??
Not sure this helps, but here's a summary image of the dilemma:
Brett,
Later versions of .net will build custom serializer assemblies. Click on project properties -> build and look for "Generate serialization assemblies" and change to On. The XML deserializer will use these assemblies which are customized to the classes in your project. They are much faster and less resource intensive since reflection is not involved.
I would go this route so that if you class changes you will not have to worry about serialization issues. Performance should not be an issue.
I recommend that you not preoptimize. If you have your code working, then use it as it is. Go on to work on some code that is not finished, or which does not work.
Later, if you find you have a performance problem in that area, you can explore performance.