Marklogic Datahub: Generation of Triples

Marklogic Datahub: Generation of Triples - sparql

I am using Data Hub Quick Start V5.4.0 and Marklogic Version 10.0-6.0
I have a few tdex files in data-hub-FINAL-SCHEMAS.
In the Marklogic console, When I select data-hub-FINAL and execute the SPARQL query:
SELECT * WHERE { ?s ?p ?o}
I am able to get the triples in the output.
When I manually update on of the tdex file (In the Marklogic Console) and run the SPARQL query again, I am not able to see the modification in the triples.
Want to know how to edit the tdex file so that I can get the updated triples when I run the SPARQL Query.

This may be related to security. Are you sure that the user has the correct read permissions on the TDE template once you re-insert it into the database manually from QueryConsole?
I would compare the TDE template permissions before and after the modofication and see if perhaps you dropped some permissions on the replaced TDE without realizing it.

Related

BigQuery operator - query refresh - Airflow

We are using Airflow 2.1.4 via Google Cloud composer and are referencing our queries via the "BigQueryInsertJobOperator" and for the query we are referencing a path on the Composer GCS bucket (ie "query" : "{% include ' ...). This works fine except that we have some DAGs where the first step is compiling new queries and those queries are then referenced by subsequent stages. In those cases, the DAG does not consider the newly generated queries but always take the ones that were present before.
Is there a parameter to set up so that the operator refresh at certain interval and make sure to take the latest query file available and not a cache from a previous file ?
Thank you for your help.

How to disable using cache results in Redshift Query?

I am interested in performance testing my query in Redshift.
I would like to disable the query from using any cached results from prior queries. In other words, I would like the query to run from scratch. Is it possible to disable cached results only for the execution of my query?
I would not like to disable cached results for the entire database/all queries.

SET enable_result_cache_for_session TO OFF;
From enable_result_cache_for_session - Amazon Redshift:
Specifies whether to use query results caching. If enable_result_cache_for_session is on, Amazon Redshift checks for a valid, cached copy of the query results when a query is submitted. If a match is found in the result cache, Amazon Redshift uses the cached results and doesn’t execute the query. If enable_result_cache_for_session is off, Amazon Redshift ignores the results cache and executes all queries when they are submitted.

Ran across this during a benchmark today and wanted to add an alternative to this. The benchmark tool I was using has a setup and teardown, but they don't run in the same session/transaction, so the enable_result_cache_for_session setting was having no effect. So I had to get a little clever.
From the Redshift documentation:
Amazon Redshift uses cached results for a new query when all of the following are true:
The user submitting the query has access permission to the objects used in the query.
The table or views in the query haven't been modified.
The query doesn't use a function that must be evaluated each time it's run, such as GETDATE.
The query doesn't reference Amazon Redshift Spectrum external tables.
Configuration parameters that might affect query results are unchanged.
The query syntactically matches the cached query.
In my case, I just added a GETDATE() column to the query to force it to not use the result cache on each run.

Generate a .sparql file in order to backup rdf graphs

Is there a way to use SPARQL to dump all RDF graphs from a triplestore (Virtuoso) to a .sparql file containing all INSERT queries to rebuild the graphs?
Like the mysqldump command?

RDF databases are essentially schemaless, which means a solution like mysqldump is not really necessary: you don't need any queries to re-create the database schema (table structures, constraints, etc), a simple data dump contains all the necessary info to re-create the database.
So you can simply export your entire database to an RDF file, in N-Quads or TriG format (you need to use one of these formats because other formats, like RDF/XML or Turtle, do not preserve named graph information).
I'm not sure about the native Virtuoso approach to do this (perhaps it has an export/data dump option in the client UI), but since Virtuoso is Sesame/RDF4J-compatible you could use the following bit of code to do this programmatically:
Repository rep = ... ; // your Virtuoso repository
File dump = new File("/path/to/file.nq");
try (RepositoryConnection conn = rep.getConnection()) {
conn.export(Rio.createWriter(RDFFormat.NQUADS, new FileOutputStream(dump)));
}

Surprisingly enough, the Virtuoso website and documentation include this information.
You don't get a .sparql file as output, because RDF always uses the same triple (or quad) "schema", so there's no schema definition in such a dump; just data.
The dump procedures are run through the iSQL interface.
To dump a single graph — just a lot of triples — you can use the dump_one_graph stored procedure.
SQL> dump_one_graph ('http://daas.openlinksw.com/data#', './data_', 1000000000);
To dump the whole quad store (all graphs except the Virtuoso-internal virtrdf:), you can use the dump_nquads stored procedure.
SQL> dump_nquads ('dumps', 1, 10000000, 1);
There are many load options; we'd generally recommend the Bulk Load Functions for such a full dump and reload.
(ObDisclaimer: OpenLink Software produces Virtuoso, and employs me.)

What URI to use for a Sesame repository while executing a SPARQL ADD query

I'm trying to copy the data from a Sesame repository to another triplestore. I tried the following query:
ADD <http://my.ip.ad.here:8080/openrdf-workbench/repositories/rep_name> TO <http://dydra.com/username/rep_name>
The query gets executed with output as 'true' but no triples are added.
So, I tried a similar query to see if I can move data from one Sesame repository to another using SPARQL Update:
ADD <http://my.ip.ad.here:8080/openrdf-workbench/repositories/source_rep> TO <http://my.ip.ad.here:8080/openrdf-workbench/repositories/destination_rep>
Again, the query gets executed but no triples are added.
What am I doing incorrectly here? Is the URL I am using for repositories OK or does something else need to be changed?

The SPARQL ADD operation copies named graphs (or 'contexts', as they are known in Sesame). The update operates on a single repository (the one on which you execute it) - it doesn't copy data from one repository to the other.
To copy data from one repository to the other via a SPARQL update, you need to use an INSERT operation with a SERVICE clause:
INSERT { ?s ?p ?o }
WHERE {
SERVICE <http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name> { ?s ?p ?o }
}
(note that the above will not preserve context / named graph information from your source repo)
Alternatively, you can just copy over via the API, or via the Workbench by using http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name/statements as the URL of the RDF file you wish to upload. More details on this in my answer to this related question.

Google BigQuery large table (105M records) with 'Order Each by' clause produce "Resources Exceeds Query Execution" error

I am running into Serious issue "Resources Exceeds Query Execution" when Google Big Query large table (105M records) with 'Order Each by' clause.
Here is the sample query (which using public data set: Wikipedia):
SELECT Id,Title,Count(*) FROM [publicdata:samples.wikipedia] Group EACH by Id, title Order by Id, Title Desc
How to solve this without adding Limit keyword.

Using order by on big data databases is not an ordinary operation and at some point it exceeds the attributes of big data resources. You should consider sharding your query or run the order by in your exported data.
As I explained to you today in your other question, adding allowLargeResults will allow you to return large response, but you can't specify a top-level ORDER BY, TOP or LIMIT clause. Doing so negates the benefit of using allowLargeResults, because the query output can no longer be computed in parallel.
One option here that you may try is sharding your query.
where ABS(HASH(Id) % 4) = 0
You can play with the above parameters a lot to achieve smaller resultsets and then combining.
Also read Chapter 9 - Understanding Query Execution it explaines how internally sharding works.
You should also read Launch Checklist for BigQuery

I've gone through the same problem and fixed it following the next steps
Run the query without ORDER BY and save in a dataset table.
Export the content from that table to a bucket in GCS using wildcard (BUCKETNAME/FILENAME*.csv)
Download the files to a folder in your machine.
Install XAMPP (if you get a UAC warning) and change some settings after.
Start Apache and MySQL in your XAMPP control panel.
Install HeidiSQL and stablish the connection with your MySQL server (installed with XAMPP)
Create a database and a table with its fields.
Go to Tools > Import CSV file, configure accordingly and import.
Once all data is imported, do the ORDER BY and export the table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Marklogic Datahub: Generation of Triples - sparql

Related

BigQuery operator - query refresh - Airflow

How to disable using cache results in Redshift Query?

Generate a .sparql file in order to backup rdf graphs

What URI to use for a Sesame repository while executing a SPARQL ADD query

Google BigQuery large table (105M records) with 'Order Each by' clause produce "Resources Exceeds Query Execution" error

Categories

Resources