Loading big RDF file into Sesame - sesame

I'm trying to create a SPARQL endpoint based on Sesame. I installed Tomcat, PostgreSQL, and deployed a Sesame's web application. I created a repository based on PostgreSQL RDF store. Now i need to load a big ttl file (540M triples, file size is several GB) into a repository. Loading a big file over Workbench is not a good solution - it will take several days. What is the best non-programming solution to load the data? Are there tools like "console" to load data? For example, Virtuoso has isql tool for bulk loading...

There is no ready-made bulk loading tool available for Sesame that I am aware of - though Sesame-compatible triplestore vendors do have such tooling available as part of their specific database. Programming a bulk-upload solution is not particularly hard, but we somehow never got around to including such a tool in the Sesame core distribution.
540M triples, by the way, is probably too large for any of Sesame's default stores - the Native Store only scales to about 150M, and loading such a large dataset into the memory store is just too unwieldy (even if you had the available RAM). So you probably need to look into using a Sesame-compatible database provided by a third party. There are many choices available, both commercial and free/open-source, see this overview on the Sesame website for a list of some suggestions.

Related

Lucene index replication

On a load balanced environment where in i have standalone Java thread(essentially through a spring boot jar for sake of simplicity lets call it Project 1), which reads some metadata and updates lucene indexes at a certain location.
Then There is an actual web application(Project 2) through which I want to query through these indexes(which another Project 1 has created) however the index file, what are the available options:
Copy the index file periodically to the lucene of web application which would not be possible as we may have to re kick the application I trust.
Maintain both projects as one package in a war and so single instance of lucene is available to both.
Any other replication strategy??
Any help on above would be highly appreciated.
Best,
- Vaibhav
This really depends on your non functional requirements by your application and any given architectural decision driven by them.
But here some thoughts:
copy an index like from folderA to folderB sounds like a pretty bad idea. especially if both applications have to run all the time.
You don't want a direct dependency between these two applications so you have to some kind of build your own lucene component which is serving API functionalities you need.
I would recommend building a component with a proper API. This component uses lucene as library and in cases like multiple systems or instances like to use this component i would suggest a nice NRT (Near Real Time) implementation of Lucene.

J2SE desktop applications - JPA database vs Collections?

I come from a web development background and haven't done anything significant in Java in quite some time.
I'm doing a small project, most of which involves some models with relationships and straightforward CRUD operations with those objects.
JPA/EclipseLink seems to suit the problem, but this is the kind of app that has File->Open and File->Save features, i.e. the data will be stored in files by the user, rather than persisting in the database between sessions.
The last time I worked on a project like this, I stored the objects in ArrayList objects, but having worked with MVC frameworks since, that seems a bit primitive. On the other hand, using JPA, opening a file would require loading a whole bunch of objects in the database, just for the convenience of not having to write code to manage the objects.
What's the typical approach for managing model data with Java SE desktop applications?
JPA was specifically build with databases in mind. This means that typically it operates on a big datastore with objects belonging to many different users.
In a file based scenario, quite often files are not that big and all objects in the file belong to the same user and same document. In that case I'd say for a binary format the old Java serialization still works for temporary files.
For longer term or interchangeable formats XML is better suited. Using JAXB (included in the standard Java library) you can marshal and demarshal Java objects to XML using an annotation based approach that on the surface resembles JPA. In fact, I've worked with model objects that have both JPA and JAXB annotations so they can be stored in a Database as well as in an XML file.
If your desktop app however uses files that represents potentially huge datasets for which you need paging and querying, then using JPA might still be the better option. There are various small embedded DBs available for Java, although I don't know how simple it is to let a data source point to a user selected file. Normally a persistence unit in Java is mapped to a fixed data source and you can't yet create persistence units on the fly.
Yet another option would be to use JDO, which is a mapping technology like JPA, but not an ORM. It's much more independent of the backend persistence technology that's being used and indeed maps to files as well.
Sorry that this is not a real answer, but more like some things to take into account, but hope it's helpful in some way.

Where do I begin learning Lucene.NET Solr Hadoop and MapReduce?

I'm a .NET developer and I need to learn Lucene so we can run a very large scale search service that removes entries that the end user doesn't have access to. (ie a User can search for all documents with clearance level 3 or higher, but not clearance level 2 or 1)
Where do I start learning, which products should I consider? To be honest, I'm a little overwhelmed, but I'm determined to figure it all out... eventually.
If you want a book that covers all the basics of Lucene, consider "Lucene in Action". Even though the code samples are Java, you can easily port them to .NET. Of course, there also are tonnes of resources on the web, such as SO and the Lucene mailing lists which should help you along.
For project you describe, you should look at Solr since it abstracts out lots of the issues of scalability etc. and via Solrnet can easily integrate into your .NET app. To restrict access by a level, your index documents should contain a field called "Level" (say) and in the background of your user query, you append the "Level:Level-1" query, using a boolean query construct.
At this stage, my recommendation would be to stay away from Hadoop (Apache Map-reduce implementation) for your project and stick with Solr. If you are however keen to learn about it. It too has a very useful book, you guessed it "Hadoop In Action" (also from Manning Publications).
You seem to be confused about what exactly each project (Lucene/Solr/Hadoop/etc) does. So the first thing to do would be understanding the purpose of each project. Read the docs and blogs about them. If possible, buy and read books about them.
For example, MapReduce and Hadoop have nothing to do with your security requirements. Hadoop is a platform for distributed, scalable computing. But Solr is scalable on its own. You might want to use Hadoop to distribute a crawler though (e.g. Nutch).

What Backup Library/Code Do You Use For Your Dataset (Files On Disk)?

I'm implementing Backup functionality into my new (small) app, Oldaer. I've got standalone desktop files (rather than sitting in a SQL db).
Looking around, I decided on using a Clarion 3rd-Party Template that will package them into one file and then compress (huffman's) that one file. Restoring is just the reverse. Uncompress, unpack.
However, I'm not convinced this is ideal.
What Backup functionality do you implement for your dataset?
Of course, there's a lot more in "Backup/Restore" functionality. Location, Tracking/Archiving, Out-of-the-box Information (like better ways of letting the User know what was in the archive file). But that's another question.
SQL Replication, clustering, RAID 5
Just been playing with uploading datasets to Amazon S3 using the NetTalk 3rd party libraries in Clarion. Seems to work a treat. I am working on keeping multiple 'versions' of the dataasets using the MetaTags functionality.
Happy to dig out my code and discuss further if you need.

Amazon SimpleDB

Has anyone considered using something along the lines of the Amazon SimpleDB data store as their backend database?
SQL Server hosting (at least in the UK) is expensive so could something like this along with cloud file storage (S3) be used for building apps that could grow with your application.
Great in theory but would anyone consider using it. In fact is anyone actually using it now for real production software as I would love to read your comments.
This is a good analysis of Amazon services from Dare.
S3 handled what I've typically heard described as "blob storage". A typical Web application typically has media files and other resources (images, CSS stylesheets, scripts, video files, etc) that is simply accessed by name/path. However a lot of these resources also have metadata (e.g. a video file on YouTube has metadata about it's rating, who uploaded it, number of views, etc) which need to be stored as well. This need for queryable, schematized storage is where SimpleDB comes in. EC2 provides a virtual server that can be used for computation complete with a local file system instance which isn't persistent if the virtual server goes down for any reason. With SimpleDB and S3 you have the building blocks to build a large class of "Web 2.0" style applications when you throw in the computational capabilities provided by EC2.
However neither S3 nor SimpleDB provides a solution for a developer who simply wants the typical LAMP or WISC developer experience of building a database driven Web application or for applications that may have custom storage needs that don't fit neatly into the buckets of blob storage or schematized storage. Without access to a persistent filesystem, developers on Amazon's cloud computing platform have had to come up with sophisticated solutions involving backing data up manually from EC2 to S3 to get the desired experience.
I just finished writing a library to make porting an app to simpledb in Perl easy, Net::Amazon::SimpleDB::Simple because I found the Amazon client libraries painful. The library isn't on CPAN yet, but it is at http://rjurneyopen.s3.amazonaws.com/SimpleDB/Simple.pm The idea was to make it trivial to stuff hashes in and out of SimpleDB.
I just ported an app to use it. Overall I am impressed with SimpleDB... even inefficient queries take only 2-3 seconds to return. SimpleDB doesn't seem to care about the size of your table, owing to its Erlang/parallel nature. Tablescans are easy for it.
The pain comes from the fact that you can't count, sum or group by. If you plan on doing any of those things... then SimpleDB probably isn't for you. At the moment in terms of functionality it exists somewhere in between memcached and MySQL. You can SELECT ORDER BY LIMIT, which is nice. Its also nice that you don't have to scale it yourself, and its nice that it doesn't care how much you stuff into it. But more advanced operations like analytics are painful at best. You'll have to do your own calculations server side. Its also a big plus that on any computer I can use the simpledb CLI http://code.google.com/p/amazon-simpledb-cli/ to query my data.
There are some confusing 'gotchas.' For instance, attributes can have more than one value, and you have to explicitly set 'replace' when storing items. Also, storing undef or null string results in a library error, instead of deleting that attribute name/value pair or setting it null/empty string.
Learning to think in terms of a largely un-normalized way is a little strange too, which is why I would second the suggestion above that says it is best for new applications. Porting from a SQL app to SimpleDB would be painful because your application logic would have to change. The way you do things is a bit different. The amazon docs are pretty good at explaining this.
All of this is extractable in a library that sits atop SimpleDB, so for your use of SimpleDB you will want to pick a good library... you probably don't want to deal with it directly. There is some work on the PHP side to make things easy, and there is my library. There is a RAILS activesource, but it doesn't seem to do much for you.
All in all its still early in the game, but compared to other APIs (twitter comes to mind), I have to say that the SimpleDB REST API is pretty simple (especially considering that it is XML) and polite to work with. I would recommend it... depending on the requirements of your application and the economics of your use of it. If you're looking to rapidly scale a service that doesn't put a great load on the DB and don't want to bother with a scalable MySQL/memcache combo... then SimpleDB can offer a 'simple' solution for you.
I expect that its features will continue to grow and it will be a good choice for more and more applications that do more complex and interesting things. But right now it is targeted at and appropriate for your typical Web 2.0 service.
We are using SimpleDB almost exclusively for our new projects. The zero maintenance, high availability, no install aspects are just too good. And for your Ruby developers, check out SimpleRecord, an ActiveRecord like interface for SimpleDB which makes it super easy to use.
But do you really need SQL Server? Can't you live with PostgreSQL or MySQL? Both have proven to be ok for most tasks.
Now if you need SQL Server features then you're out of luck.
Another option is to rent a server. How expensive is expensive?
(I've used Amazon S3 to store images for an application, it's ok and works fine, at least for that)
I haven't used SimpleDB, but have been using combination of S3, EC2, and MySQL for our application.
As long as you are willing to use SimpleDB, then you might as well consider using MySQL (which is very scalable, and not that expensive).
On the S3 and EC2 side, it is great in practice as well.
SimpleDB works great for many applications.... if your project will require a lot of analytic reporting, joining, etc, you may consider MySQL or a hybrid-model.
If you go SimpleDB, we've developed Radquery.com for our internal use and opened it up to the public.