Bulk loading data into Virtuoso database - sparql

I am new to Virtuoso and I want to bulk load data (some .ttl files placed in a directory) into Virtuoso, so I can perform SPARQL queries on the graph. So far, I am confused with the process. Here's what I found:
define the data directory in Virtuoso.ini
run this script in iSQL:
ld_dir("path_name",'*.*'graph_name);
rdf_loader_run();
However, I don't know how to define my directory in Virtuoso.ini or if I need to do anything else since the doc is very confusing. If we assume that my file is in this directory: E:\git\Virtuoso Data\ttls How do I bulk load the files (how to define the path in Virtuoso.ini,...)?

Related

Azure Data Factory creates .CSV that's incompatible with Power Query

I have a pipeline that creates a dataset from a stored procedure on Azure SQL Server.
I want to then manipulate it in a power query step within the factory, but it fails to load in the power query editor with this error.
It opens up the JSON file (to correct it, I assume) but I can't see anything wrong with it.
If I download the extract from blob and upload it again as a .csv then it works fine.
The only difference I can find is that if I upload a blob direct to storage then the file information for the blob looks like this:
If I just let ADF create the .csv in blob storage the file info looks like this:
So my assumption is that somewhere in the process in ADF that creates the .csv file it's getting something wrong, and the Power Query module can't recognise it as a valid file.
All the other parts of the pipeline (Data Flows, other datasets) recognise it fine, and the 'preview data' brings it up correctly. It's just PQ that won't read it.
Any thooughts? TIA
I reproduced the same. When data is copied from SQL database to BLOB as csv file, Power query is unable to read. Also, Power query doesn't support json file. But when I tried to download the csv file and reupload, it worked.
Below are the steps to overcome this issue.
When I tried to upload the file in Blob and create the dataset for that file in power query, Schema is imported from connection/Store. It forces us to import schema either from connection/store or from sample file. There is no option as none here.
When data is copied from sql db to Azure blob, and dataset which uses the blob storage didn't have schema imported by default.
Once imported the schema, power query activity ran successfully.
Output before importing schema in dataset
After importing schema in dataset

Pig variable storage

Pig uses variables to store the data.
When I load the data from HDFS into the variable in pig. Where is the data temporarily stored?
What exactly happens in the background when we load the data into the variable ?
Kindy help
Pig lazily evaluates most expressions. In most cases, it checks for syntax errors etc. Like,
a = load 'hdfs://I/Dont/Exist'
won't throw an error unless you use STORE or DUMP or something along those lines which result in the evaluation of a
Similarly, if a file exists and you load it to a relation and perform transformations on it, the file is spooled to /tmp folder usually and then the transformations are performed. If you look at the messages that appear when you run commands on grunt, you'll notice file paths starting with file:///tmp/xxxxxx_201706171047235. These are the files that store intermediate data.

Error while loading AVRO files to BigQuery

I have successfully loaded large number of AVRO files (of same schema type into same table), stored on Google Storage, using bq CLI utility.
However, for some of the AVRO files I am getting very cryptic error while loading into bigquery, the error says:
The Apache Avro library failed to read data with the follwing error: EOF
reached (error code: invalid)
With avro-tools validated that the AVRO file is not corrupted, report output:
java -jar avro-tools-1.8.1.jar repair -o report 2017-05-15-07-15-01_48a99.avro
Recovering file: 2017-05-15-07-15-01_48a99.avro
File Summary:
Number of blocks: 51 Number of corrupt blocks: 0
Number of records: 58598 Number of corrupt records: 0
I tried creating a brand new table with one of the failing files in case it was due to schema mismatch but that didnt help as the error was exactly the same.
need help to figure out what could be causing the error here?
No way to pinpoint the issue without more information, but I ran into this error message and filed a ticket here.
I a number of files in a single load job were missing columns which was causing the error.
Explanation from the ticket.
BigQuery uses the alphabetically last file from the directory as the avro schema to read the other Avro files. I suspect the issue is with schema incompatibility between the last file and the "problematic" file. Do you know if all the files have the exact same schema or differ? One thing you could try to help verify this is to copy the alphabetically last file of the directory and the "problematic" file to a different folder and try to load those two files in one BigQuery load job and see if the error reproduces.

Generate a .sparql file in order to backup rdf graphs

Is there a way to use SPARQL to dump all RDF graphs from a triplestore (Virtuoso) to a .sparql file containing all INSERT queries to rebuild the graphs?
Like the mysqldump command?
RDF databases are essentially schemaless, which means a solution like mysqldump is not really necessary: you don't need any queries to re-create the database schema (table structures, constraints, etc), a simple data dump contains all the necessary info to re-create the database.
So you can simply export your entire database to an RDF file, in N-Quads or TriG format (you need to use one of these formats because other formats, like RDF/XML or Turtle, do not preserve named graph information).
I'm not sure about the native Virtuoso approach to do this (perhaps it has an export/data dump option in the client UI), but since Virtuoso is Sesame/RDF4J-compatible you could use the following bit of code to do this programmatically:
Repository rep = ... ; // your Virtuoso repository
File dump = new File("/path/to/file.nq");
try (RepositoryConnection conn = rep.getConnection()) {
conn.export(Rio.createWriter(RDFFormat.NQUADS, new FileOutputStream(dump)));
}
Surprisingly enough, the Virtuoso website and documentation include this information.
You don't get a .sparql file as output, because RDF always uses the same triple (or quad) "schema", so there's no schema definition in such a dump; just data.
The dump procedures are run through the iSQL interface.
To dump a single graph — just a lot of triples — you can use the dump_one_graph stored procedure.
SQL> dump_one_graph ('http://daas.openlinksw.com/data#', './data_', 1000000000);
To dump the whole quad store (all graphs except the Virtuoso-internal virtrdf:), you can use the dump_nquads stored procedure.
SQL> dump_nquads ('dumps', 1, 10000000, 1);
There are many load options; we'd generally recommend the Bulk Load Functions for such a full dump and reload.
(ObDisclaimer: OpenLink Software produces Virtuoso, and employs me.)

Load several files into spark-sql from SQL shell

I am trying to load a directory's worth of files (no recursion) into Spark-SQL, from the spark-sql> shell.
My query is roughly:
CREATE TABLE table
USING org.apache.spark.sql.parquet
OPTIONS (
path "s3://bucket/folder/*"
);
Cheers.
edit:
"s3:/bucket/" seems to recursively load everything. I now have the issue whereby I cannot use wildcards in the url. So s3://bucket/*/month=06" gives a 'No such file or directory' FileNotFoundException.