Does Semantic tools like Anzo create a copy of data? - sparql

I'm new to semantic technologies. I understand what RDF, OWL and Ontologies and other basic terminologies are and how semantic search uses them. When we create a semantic search module using anzo with enterprise search capabilities. It connects with various data sources and creates relationship between them. Now I'm interested in knowing what a semantic tool like anzo does internally.
Does it creates a copy of data on local machine or it hits data sources every time we execute a SPARQL query
If it stores data, is this data stored in its row format or data is stored after cleaning and creating semantic relation between them.
What happens to data after query is executed. How does it get current data every time?
Any thoughts over it would be valuable for me.
Thanks a lot in advance!

Based on your comments, it appears you're using Anzo Graph Query Engine? If so, then the answers to you questions are
A copy of the data is held in memory
Not clear from any of the published information.
It doesn't. You need to load in the data using the 'LOAD' command.
A bit more on 3: You would be responsible for implementing a mechanism to keep the data in here up-to-date with the underlying data source. (which might be as simple as rebuilding the graph from a nightly dump or trying to implement a change data capture against the underlying store which replicated CRUD operations on the graph)
My answers are based on the marketing and support information available on the CambridgeSemantics site.


One time migration of VSAM files from Mainframe to Cloud Azure

Want to migrate bulk files (e.g VSAM) from Mainframe to Azure in the beginning of the Project, how that can be achieved ?
Any utility or do we need to write own scripts?
I suspect there are some utilities out there but I suspect they are most / all priced products. Since VSAM datasets are not defined using a language construct like DDL you will likely have to do most of the heavy lifting. Either writing your own programs or custom scripts. You didn’t mention operating system but I assume you’re working on z/OS.
Here are some things to consider:
The structure of the VSAM dataset is basically record oriented. There are three basic types you’ll run into that host application data:
Key Sequenced Datasets (KSDS)
Entry Sequenced Datasets (ESDS)
Relative Record datasets (RRDS)
Familiarize yourself with the means of defining the datasets as it will give you some insight into the dataset specifics. DFSMS Access Method Services Commands will show the utilities used to create them and get information like Keylength and offest of the key. DEFINE CLUSTER is the command to create the dataset. You mentioned you are moving the data toi Azure but this will help you understand the characteristics of the data you are moving.
Since there is no DDL for VSAM datasets you will generally find the structure in the programs that manipulate them like COBOL Copybooks, HLASM DSECTs and similar constructs. This is the long pole in the tent for you.
Consider what are the semantics of accessing the data. VSAM as an access method does have some ability to control read/write access on a macro level using a DEFINE CLUSTER option called SHAREOPTIONS. The SHAREOPTIONS instruct the operating system how to handle the VSAM buffers in terms of reading and writing so that multiple processes can access the same data. Its primitive if compared to sahred files systems like NFS. VSAM allows the application to control access (or serialization) using ENQ / DEQ functions. These enable applications to express intent in the cluster about a VSAM file and coordinate their own activities.
You might find that converting a VSAM file to a relational form like Db2 is better for you. Again, you’ll have to create the DDL to describe the tables, data formats and the like.
Another consideration is data conversion. You’ll find there is character data that is most likely in EBCDIC and needs to be converted to a new code page. Numeric data can be in Packed Decimal, Binary, or even text and will need to be converted.
The short answer is there isn’t an “Easy Button” to do what you want. Consider the data is only one of the questions that needs to be answered. Serialization and access to the data, codepage conversion, if you are moving some data but not others will you need to be able to map some of the converted data back to data on the mainframe.
Consider exploring IBM CDC classic replication. You can achieve it with click of buttons.
I have not done for Azure. So not sure about support.

How to populating RDF data from RDBMS-based system

I am new to semantic-web and ontology. From few weeks ago I am start reading papers and online course about it. I have an idea to use ontology rule-based system for extending the feature on my existing reminder system, as can be seen in the attached picture. I've read about Ontology, Rules (e.g. SPIN, SPARQL), Inference engine (e.g. Jena), RDF, RDFS, OWL etc. I think I've got the general idea about it.
System Architecture:
However, one thing that I still miss is: how to integrate this rule-based system into my current system. the current system data is stored in RDBMS (mysql) database. Every transaction data on the system has the possibility to be modified in later time after creation. Meanwhile, ontology-based system - AFAIK, rely on RDF data format. My thinking is, there should be a way to convert the trx data from RDBMS to RDF to be ready to use by the ontology system.
My question are:
Does my thinking correct?
What is best practise of this process?
When there is a modified data on the existing record (RDBMS), how to reflect it on the RDF?
In relation to #3, in case of not using RDBMS, how the ontology system manage their RDF data if there is an update of individual property? is that depend on the underlying triple-store database? Since I read that using TDB only able to insert or delete.

How to turn MongoDB collection into a 'Table'

I've been given access to a cloud MongoDB (MongoLab) and need to extract some data into Excel so I can analyse it. The data isn't particularly complicated or large and is well suited to a 'normal' relational structure.
My research suggests things are trickier because the data has 'nested' aspects although conceptually its pretty clear how this would become a table. Here is what a document in the collection looks like, essntinaly the stuff highlighted blue would be columns in the table while the yellow would create a row for each "marketing_event" with the specifics of each event also being in a column:
Ideally I would use Power Query to get the data into Power Pivot but at this point anything will do!
I've tried a bunch of things all of which haven't got me much closer to end result that I'm looking for:
I downloaded MongoVue which I used to successfully connect to the database and while it enabled me to see the data in a basic table form, it does nothing with the nested stuff and the documentation is minimal in terms of how it could be of more use.
I also tried Pentaho PDI based on this article: but the steps aren't detailed and although I can see the collection, trying to replicate some sample queries I found on the web were totally unsuccesful.
I've tried to get a trial of Simba's ODBC connector but as yet the download doesn't seem to be working. I have contacted them but without response just yet.
I've even installed Mongo locally and tried to use the command prompt to connect which I was unable to do. Even if I pursued this I wouldn't be confident about knowing where to start in terms of creating the end product.
Happy to hear any suggestions or recommendations.
Here's a solid ODBC driver that helps maintain the fidelity of your mongoDB data by exposing the nested MongoDB data model as a set of relational tables to Excel and other ODBC apps. in the sample document above, this driver will do exactly what you're looking for. The embedded documents and arrays can be extracted as separate related tables from the fields at the root level of the document.
I don't know if you already found the solution - but Simba ODBC is providing support for nested arrays.
Have a look here: This is an example how to connect Tableau BI to MongoDB. You might find it helpful.
And some more information on handling no-sql data in BI tools is provided in this whitepaper:

What database for crawler/scraper?

I am currently researching what database to use for a project I am working on. Hopefully you guys can give me some hints.
The project is an automated web crawler that checks websites as per a user's request, scrapes data under certain circumstances, and creates log files of what was done.
Only few tables with few columns; predefining columns is no problem
No overly complex associations between models
Huge amount of date & time based queries
Due to logging, database will grow rapidly and use up a lot of space
Should be able to scale over multiple servers
Fields contain mostly ids (int), strings (around 200-500 characters max), and unix timestamps
Two different types of servers will simultaneously read/write data directly to/from it:
One(/later more) rails app that takes user input and displays results upon request
One(/later more) Node.js server that functions as the executing crawler/scraper. It will have enough load to run continuously and make dozens of database queries every second.
I assume it will neither be a graph database (no complex associations), nor a memory based key/value store (too much data to hold in cached). I'm still on the fence for every other type of database I could find, each seems to have it's merits.
So, any advice from the pros how I should decide?
I would agree with Vladimir that you would want to consider a document-based database for this scenario. I am most familiar with MongoDB. My reasons for using it here are as follows:
Your 'schema requirements' of "only a few tables with few columns" fits well with the NoSQL nature of MongoDB.
Same as above for "no overly complex associations between nodes" -- you will want to decide whether you'd prefer nested documents or using dbref (I prefer the former)
Huge amount of time-based data (and other scaling requirements) - MongoDB scales well via sharding or partitioning
Read/write access - this is why I am recommending MongoDB over something like Hadoop. The interactive query requirement is best met by something other than a Hadoop-style store, as this type of storage is designed for batch (rather than interactive query) requirements.
Google built a database called "BigTable" for crawling, indexing and the search related business. They released a paper about it (google for "BigTable" if you're interested). There are several open source implementations for bigtable-like designs, one of them is Hypertable. We have a blog posting describing a crawler/indexer implementation ( written by the guys from And looking at your requirements: all of them are supported and are common use cases.
(disclaimer: i work for hypertable.)
Take a look at document-oriented database like a CouchDB or MongoDB.

I need advise choosing a NoSQL database for a project with a lot of minute related information

I am currently working on a private project that is going to use Google's GTFS spec to get information about 100s of Public Transit agencies, their routers, stations, times, and other related information. I will be getting my information from here and the google code wiki page with similar info. There is a lot of data and its partitioned into multiple CSV formatted text files. These can be huge, some ranging in 80-100mb of data.
With the data I have, I want to translate it all into a nice solid database that I can build layers on top of to use for my project. I will be using GPS positioning to pinpoint a location and all surrounding stations/stops.
My goal is to access all the information for all these stops and stations with as few calls as possible, while keeping datasets small for queried results.
I am currently leaning towards MongoDB and CouchDB for their GeoSpatial support that can really optimize getting small datasets. But I also need to be sure to link all the stops on a route because I will be propagating information along a transit route for that line. In this case I have found that I can benefit from a Graph DB like Neo4j and OrientDB, but from what I know, neither has GeoSpatial support nor am I 100% sure that a Graph DB would be what I need.
The perfect solution might not exist, but I come here asking for help on finding the best possible for my situation. I know I will possible have to work around limitations of whatever I choose, but I want to at least have done my research and know that its the best I can get at the moment.
I have also been suggested to splinter the data into multiple DBs, but that could get very messy because all the information is very tightly interconnected through IDs.
Any help would be appreciated.
Obviously a graph database fits 100% your problem. My advice here is to go for some geo spatial module over neo4j or orientdb, althought you have some others free and open source implementation.
I think the best one right now, with all the geo spatial thing implemented is neo4j-spatial package. But as far as I know, you can also reproduce most of the geo spatial thing on your own if necessary.
BTW talking about splitting, if the amount of data/queries will be high, I strongly recommend you to share the load and think the model in this terms. Sure you can do something.
I've used Mongo's GeoSpatial features and can offer some guidance if you need help with a C# or javascript implementation - I would recommend it to start because it's super easy to use. I'm learning all about Neo4j right now and I am working on a hybrid approach that takes advantage of both Mongo and Neo4j. You might want to cross reference the documents in Mongo to the nodes in Neo4j using the Mongo object id.
For my hybrid implementation, I'm storing profiles and any other large static data in Mongo. In Neo4j, I'm storing relationships like friend and friend-of-friend. If I wanted to analyze movies two friends are most likely to want to watch together (or really any other relationship I hadn't thought of initially), by keeping that object id reference I can simply add some code instructing each node go out and grab a list of movies from the related profile.
Added 2011-02-12:
Just wanted to follow up on this "hybrid" idea as I created prototypes for and implemented a few more solutions recently where I ended up using more than one database. Martin Fowler refers to this as "Polyglot Persistence."
I'm finding that I am often using a combination of a relational database, document database and a graph database (in my case this is generally SQL Server, MongoDB and Neo4j). Since the question is related to data modeling as much as it is to geospatial, I thought I would touch on that here:
I've used Neo4j for site organization (similar to the idea of hypermedia in the REST model), modeling social data and building recommendations (often based on social data). As a result, I will generally model this part of the application before I begin programming.
I often end up using MongoDB for prototyping the rest of the application because it provides such a simple persistence mechanism. I like to start developing an application with the user interface, so this ends up working well.
When I start moving entities from Mongo to SQL Server, the context is usually important - for instance, if I have an application that allows users to build daily reports based on periodically collected data, it may make sense to run a procedure that builds those reports each night and stores daily report objects in Mongo that may be combined into larger aggregate reports as needed (obviously this doesn't consider a few special cases, but that is not relevant to the point)...on the other hand, if users need to pull on-demand reports limited to very specific time periods, it may make sense to keep everything in SQL server and build those reports as needed.
That said, and this deserves more intense thought, here are some considerations that may be helpful:
I generally try to store entities in a relational database if I find that pulling an entity from the database [in other words(in the context of a relational database) - querying data from the database that provides the data required to generate an entity or list of entities that fulfills the requested parameters] does not require significant processing (multiple joins, for instance)
Do you require ACID compliance(aside:if you have a graph problem, you can leverage Neo4j for this)? There are document databases with ACID compliance, but there's a reason Mongo is not: What does MongoDB not being ACID compliant really mean?
One use of Mongo I saw in the wild that I thought was worthy of mention - Hadoop was being used to compute massive hash tables that were then stored in Mongo. I believe a similar approach is used by TripAdvisor for user based customization in terms of targeting offers, advertising, etc..
NoSQL only exists because MySQL users assume that all databases have their performance problems when their database grows large and/or becomes complex.
I suggest that you use PostGIS. You can use the same database for the rest of your data needs as well.