enCapsa -what is it and what is used for? - data-storage

It may not be a pure programming question but I'm looking for information about enCapsa. Do you know what it is, have you ever used it? I'm reading some papers about it but I can't really see how it works and what it can be used for in an IT company (and this is what i am supposed to find out).

Basically enCapsa is a shared data storage system, focused on providing a way for storing any kind of data (even from etherogeneous data sources such as different designed db tables) and consequently obtain it through a sort of human friendly queries, just like on a search engine. They offer the possibility to upload data from everywhere (it's CSV based) and later download to use wherever you need it.
Usages are many, consider it's a centralized DB accessible through web and they say it meets high security standard.
A usefull way to employ this service is to have data stored in there without the need to keep them synchronized across company computers.

Related

How does the semantic web actually work?

I know that in the web you have lots an lots of pages linked to one another and you can go from page to page and so on.
How does the semantic web work? I understand that it uses the concept of Linked Data, where data is identified and linked by URIs or IRIs and not the web pages them self. But I don't understand how the data is linked across the web when all of the data is stored in local triplestores and are linked internally in the triplestores. Are browsers capable to go from triplestore to triplestore behind the scenes and get back all kinds of data? Or how is the data actually linked? Is there a mechanism to go from data to data all across the web and use the meaning of data in real life situations, or a tool that does something like this?
Also anybody can create ontologies and define and describe anything in all kinds of different ways. Won't this lead to a big mess of data?
So, main question:
How does the semantic web and liked data actually work?
It's a tricky and multifaceted question.
First I'll answer some of you questions.
But I don't understand how the data is linked across the web when all of the data is stored in local triplestores and are linked internally in the triplestores
First of all, it is important to realize that triplestores are not a necessity. You could have SQL servers and D2RQ/R2RML mapping on top to translate queries dynamically. Or plain RDF files. Or simple JSON documents in MongoDB, etc, which you extend by adding a JSON-LD #context.
What is important, is that you serve data in one of the RDF formats such as turtle or JSON-LD
Are browsers capable to go from triplestore to triplestore behind the scenes and get back all kinds of data?
See, they don't have to because, as you mention, URIs are used so that a browser (not necessarily a web browser) can download the data. And of course these URI are URLs and are dereferenceable. Otherwise they are just identifiers.
Or how is the data actually linked?
It is linked simply by reusing identifiers for objects and properties. That's why URI (IRI) is used, so that the identifiers are globally unique and created privately within a domain. Of course there is a risk of being mischievous by creating URIs is someone else's domain. It's a separate topic though.
Is there a mechanism to go from data to data all across the web and use the meaning of data in real life situations, or a tool that does something like this?
One simple mechanism is to simply crawl RDF data and download to a local store. Simple occurrence of matching identifiers will combine the data into larger dataset with less mapping effort required. That is of course the theory because you risk data can be corrupt, incorrect or duplicated so you need some curation. Technology exists to help you do that and it's not something you wouldn't experience is traditional data warehousing. Search engines harvest semantic markup from HTML pages (RFDa/Microdata) is similar manner.
Another option is to use federated queries. SPARQL has the ability to automatically download RDF data and perform queries over it in memory.
Last but not least, there are federated queries using Triple Pattern Fragments
Now about Semantic Web
As I wrote, the question is not that simple. You mostly ask about Linked Data. There is more to Semantic Web than that:
ontologies/taxonomies
inferencing
rules
semantic/faceted search
I hope I answered your question to some extent.

JSON vs classic schema design [duplicate]

The Project
I've been asked to work on an interesting project -- what amounts to a basic Web CMS -- that uses HTML/CSS/jQuery with PHP. However, one requirement is that there won't be a database to house the data (they want flat files for the documents/pages -- preferable in JSON format).
In a very basic sense, it'll be used to generate HTML pages via a very "non-techie" interface. Each installation would only have around 20 pages, but a few may get up to 100. It has to be fairly easy to drop onto a PHP capable server and run, with very little setup needed.
What's Out There
There are tons of CMS options and quite a few flat file versions. But an OSS or other existing CMS is not an option. They need a simple propriety system.
Initial Thoughts
So flat files it is... but I'd really like to get some feedback on the drawbacks, and if it is worth the effort to try and convince them to use something like MySQL (SQLite or CouchDB are out since none of the servers can be configured to run them at the present time).
Of course the document files are pretty straightforward, but we're also talking about login info for 1 or 2 admins per installation, a few lists, as well as configs/settings (which also can easily be stored in a file with protection).
The Dilemma
If there are benefits to using MySQL rather than JOSN formatted files and some arrays in a simple project like this -- beyond my own pre-conceived notions :) -- I'll be sure to argue them.
But honestly I can't see any that outweigh their need to not have a database system.
I'd appreciate you insight and opinions.
If you can't cite a specific need for relational table design, then you're good with flat files. Build as specified. The moment you can cite a specific need, let them know; upgrading isn't that hard, if you're perception is timely (that is, if you aren;t in the position of having to normalize data that should have been integrated earlier).
It's a shame you can't use CouchDB, this seems like the perfect application for it. Keep in mind that using flat-files severely constrains your architecture and, especially, scalability.
What's the best case scenario for your CMS app? It's successful and people want to use it more? If you're using flat-files it'll be harder to service and improve your system (e.g. make it more robust, and add new features for future versions) and performance will not scale well. So "success" in this case is at best short-lived, as success translates into more and more work for less and less gains in feature-set and performance.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.
Will this be installed on any shared hosting sites. For this to work somewhat safely, a mechanism like suEXEC needs to be set up properly as the web server will need write permissions to various directories.
What would be cool with a simple site that was feed via JSON and jQuery is that the site wouldn't need to load on each click. Just the relevant data would change. You could then use hashes in the location bar to keep track of where you were (ex. http://localhost/#about)
The problem being if they are editing the raw JSON file they can mess it up pretty quick. I think your admin tools would have to generate the JSON files based on the input so that you can ensure nothing breaks. The admin tools would be more entailed then the site (though isn't that always the case with dynamic sites)
What is the predicted data sizes for the CMS?
A large reason for the use of a RDMS is quick,specific access to large amounts of data. The data format might not be large, but if there is a lot of the data, then it might be better in the long run for a RDMS.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.
While an RDBMS may be necessary for a very large CMS, a small one could run off flat files very well. A lot of CMS products out there fall down in that regard, I think, by throwing an RDBMS into the mix when there's no real need.
However, if you are using flat files, there are security issues which others have highlighted. Another issue I've come across is hosting providers using the disable_functions directive in php.ini to disable file I/O functions like fopen() and friends. If you're hosting your CMS on a box you control, you won't have this problem but if you're using a third-party provider, check first.
As the original poster, I wasn't signed in, so I'm following up to the answers so far in an answer (sorry if this is bad form).
There may instances where this is on
a shared host.
Though the JSON files can technically
be edited, this won't be the case.
The admin interface will be robust
enough to do all of the creating/editing of pages
The size for each install will be
relatively small -- 1 - 2 admins,
10-100 pages. A few lists of common
items may run longer (snippets of
copy for example).
Security will be a big issue -- any
other options suggestions on this
specifically?
Well, isn't there a problem with they being distrustful to any database system? Isn't the problem more in their thinking than in technology? Maybe they are afraid of database because it sounds complex to them. In that case, if you just present them some very simple CMS (like CMS made simple, which I've heard is really simple and the learning process is very fast), if they see everything is easy then may be they just don't care what's behind, if it's a database or whatever!
They could hear to arguments like better maintenance, lower cost of maintenance, much better handover to another webmaster than proprietary solutions (they are not dependent on you) etc.

What database for crawler/scraper?

I am currently researching what database to use for a project I am working on. Hopefully you guys can give me some hints.
The project is an automated web crawler that checks websites as per a user's request, scrapes data under certain circumstances, and creates log files of what was done.
Requirements:
Only few tables with few columns; predefining columns is no problem
No overly complex associations between models
Huge amount of date & time based queries
Due to logging, database will grow rapidly and use up a lot of space
Should be able to scale over multiple servers
Fields contain mostly ids (int), strings (around 200-500 characters max), and unix timestamps
Two different types of servers will simultaneously read/write data directly to/from it:
One(/later more) rails app that takes user input and displays results upon request
One(/later more) Node.js server that functions as the executing crawler/scraper. It will have enough load to run continuously and make dozens of database queries every second.
I assume it will neither be a graph database (no complex associations), nor a memory based key/value store (too much data to hold in cached). I'm still on the fence for every other type of database I could find, each seems to have it's merits.
So, any advice from the pros how I should decide?
Thanks.
I would agree with Vladimir that you would want to consider a document-based database for this scenario. I am most familiar with MongoDB. My reasons for using it here are as follows:
Your 'schema requirements' of "only a few tables with few columns" fits well with the NoSQL nature of MongoDB.
Same as above for "no overly complex associations between nodes" -- you will want to decide whether you'd prefer nested documents or using dbref (I prefer the former)
Huge amount of time-based data (and other scaling requirements) - MongoDB scales well via sharding or partitioning
Read/write access - this is why I am recommending MongoDB over something like Hadoop. The interactive query requirement is best met by something other than a Hadoop-style store, as this type of storage is designed for batch (rather than interactive query) requirements.
Google built a database called "BigTable" for crawling, indexing and the search related business. They released a paper about it (google for "BigTable" if you're interested). There are several open source implementations for bigtable-like designs, one of them is Hypertable. We have a blog posting describing a crawler/indexer implementation (http://hypertable.com/blog/sehrchcom_a_structured_search_engine_powered_by_hypertable/) written by the guys from sehrch.com. And looking at your requirements: all of them are supported and are common use cases.
(disclaimer: i work for hypertable.)
Take a look at document-oriented database like a CouchDB or MongoDB.

I need advise choosing a NoSQL database for a project with a lot of minute related information

I am currently working on a private project that is going to use Google's GTFS spec to get information about 100s of Public Transit agencies, their routers, stations, times, and other related information. I will be getting my information from here and the google code wiki page with similar info. There is a lot of data and its partitioned into multiple CSV formatted text files. These can be huge, some ranging in 80-100mb of data.
With the data I have, I want to translate it all into a nice solid database that I can build layers on top of to use for my project. I will be using GPS positioning to pinpoint a location and all surrounding stations/stops.
My goal is to access all the information for all these stops and stations with as few calls as possible, while keeping datasets small for queried results.
I am currently leaning towards MongoDB and CouchDB for their GeoSpatial support that can really optimize getting small datasets. But I also need to be sure to link all the stops on a route because I will be propagating information along a transit route for that line. In this case I have found that I can benefit from a Graph DB like Neo4j and OrientDB, but from what I know, neither has GeoSpatial support nor am I 100% sure that a Graph DB would be what I need.
The perfect solution might not exist, but I come here asking for help on finding the best possible for my situation. I know I will possible have to work around limitations of whatever I choose, but I want to at least have done my research and know that its the best I can get at the moment.
I have also been suggested to splinter the data into multiple DBs, but that could get very messy because all the information is very tightly interconnected through IDs.
Any help would be appreciated.
Obviously a graph database fits 100% your problem. My advice here is to go for some geo spatial module over neo4j or orientdb, althought you have some others free and open source implementation.
I think the best one right now, with all the geo spatial thing implemented is neo4j-spatial package. But as far as I know, you can also reproduce most of the geo spatial thing on your own if necessary.
BTW talking about splitting, if the amount of data/queries will be high, I strongly recommend you to share the load and think the model in this terms. Sure you can do something.
I've used Mongo's GeoSpatial features and can offer some guidance if you need help with a C# or javascript implementation - I would recommend it to start because it's super easy to use. I'm learning all about Neo4j right now and I am working on a hybrid approach that takes advantage of both Mongo and Neo4j. You might want to cross reference the documents in Mongo to the nodes in Neo4j using the Mongo object id.
For my hybrid implementation, I'm storing profiles and any other large static data in Mongo. In Neo4j, I'm storing relationships like friend and friend-of-friend. If I wanted to analyze movies two friends are most likely to want to watch together (or really any other relationship I hadn't thought of initially), by keeping that object id reference I can simply add some code instructing each node go out and grab a list of movies from the related profile.
Added 2011-02-12:
Just wanted to follow up on this "hybrid" idea as I created prototypes for and implemented a few more solutions recently where I ended up using more than one database. Martin Fowler refers to this as "Polyglot Persistence."
I'm finding that I am often using a combination of a relational database, document database and a graph database (in my case this is generally SQL Server, MongoDB and Neo4j). Since the question is related to data modeling as much as it is to geospatial, I thought I would touch on that here:
I've used Neo4j for site organization (similar to the idea of hypermedia in the REST model), modeling social data and building recommendations (often based on social data). As a result, I will generally model this part of the application before I begin programming.
I often end up using MongoDB for prototyping the rest of the application because it provides such a simple persistence mechanism. I like to start developing an application with the user interface, so this ends up working well.
When I start moving entities from Mongo to SQL Server, the context is usually important - for instance, if I have an application that allows users to build daily reports based on periodically collected data, it may make sense to run a procedure that builds those reports each night and stores daily report objects in Mongo that may be combined into larger aggregate reports as needed (obviously this doesn't consider a few special cases, but that is not relevant to the point)...on the other hand, if users need to pull on-demand reports limited to very specific time periods, it may make sense to keep everything in SQL server and build those reports as needed.
That said, and this deserves more intense thought, here are some considerations that may be helpful:
I generally try to store entities in a relational database if I find that pulling an entity from the database [in other words(in the context of a relational database) - querying data from the database that provides the data required to generate an entity or list of entities that fulfills the requested parameters] does not require significant processing (multiple joins, for instance)
Do you require ACID compliance(aside:if you have a graph problem, you can leverage Neo4j for this)? There are document databases with ACID compliance, but there's a reason Mongo is not: What does MongoDB not being ACID compliant really mean?
One use of Mongo I saw in the wild that I thought was worthy of mention - Hadoop was being used to compute massive hash tables that were then stored in Mongo. I believe a similar approach is used by TripAdvisor for user based customization in terms of targeting offers, advertising, etc..
NoSQL only exists because MySQL users assume that all databases have their performance problems when their database grows large and/or becomes complex.
I suggest that you use PostGIS. You can use the same database for the rest of your data needs as well.
http://postgis.refractions.net/

Good reasons NOT to use a relational database?

Can you please point to alternative data storage tools and give good reasons to use them instead of good-old relational databases? In my opinion, most applications rarely use the full power of SQL--it would be interesting to see how to build an SQL-free application.
Plain text files in a filesystem
Very simple to create and edit
Easy for users to manipulate with simple tools (i.e. text editors, grep etc)
Efficient storage of binary documents
XML or JSON files on disk
As above, but with a bit more ability to validate the structure.
Spreadsheet / CSV file
Very easy model for business users to understand
Subversion (or similar disk based version control system)
Very good support for versioning of data
Berkeley DB (Basically, a disk based hashtable)
Very simple conceptually (just un-typed key/value)
Quite fast
No administration overhead
Supports transactions I believe
Amazon's Simple DB
Much like Berkeley DB I believe, but hosted
Google's App Engine Datastore
Hosted and highly scalable
Per document key-value storage (i.e. flexible data model)
CouchDB
Document focus
Simple storage of semi-structured / document based data
Native language collections (stored in memory or serialised on disk)
Very tight language integration
Custom (hand-written) storage engine
Potentially very high performance in required uses cases
I can't claim to know anything much about them, but you might also like to look into object database systems.
Matt Sheppard's answer is great (mod up), but I would take account these factors when thinking about a spindle:
Structure : does it obviously break into pieces, or are you making tradeoffs?
Usage : how will the data be analyzed/retrieved/grokked?
Lifetime : how long is the data useful?
Size : how much data is there?
One particular advantage of CSV files over RDBMSes is that they can be easy to condense and move around to practically any other machine. We do large data transfers, and everything's simple enough we just use one big CSV file, and easy to script using tools like rsync. To reduce repetition on big CSV files, you could use something like YAML. I'm not sure I'd store anything like JSON or XML, unless you had significant relationship requirements.
As far as not-mentioned alternatives, don't discount Hadoop, which is an open source implementation of MapReduce. This should work well if you have a TON of loosely structured data that needs to be analyzed, and you want to be in a scenario where you can just add 10 more machines to handle data processing.
For example, I started trying to analyze performance that was essentially all timing numbers of different functions logged across around 20 machines. After trying to stick everything in a RDBMS, I realized that I really don't need to query the data again once I've aggregated it. And, it's only useful in it's aggregated format to me. So, I keep the log files around, compressed, and then leave the aggregated data in a DB.
Note I'm more used to thinking with "big" sizes.
The filesystem's prety handy for storing binary data, which never works amazingly well in relational databases.
Try Prevayler:
http://www.prevayler.org/wiki/
Prevayler is alternative to RDBMS. In the site have more info.
If you don't need ACID, you probably don't need the overhead of an RDBMS. So, determine whether you need that first. Most of the non-RDBMS answers provided here do not provide ACID.
Custom (hand-written) storage engine / Potentially very high performance in required uses cases
http://www.hdfgroup.org/
If you have enormous data sets, instead of rolling your own, you might use HDF, the Hierarchical Data Format.
http://en.wikipedia.org/wiki/Hierarchical_Data_Format:
HDF supports several different data models, including multidimensional arrays, raster images, and tables.
It's also hierarchical like a file system, but the data is stored in one magic binary file.
HDF5 is a suite that makes possible the management of extremely large and complex data collections.
Think petabytes of NASA/JPL remote sensing data.
G'day,
One case that I can think of is when the data you are modelling cannot be easily represented in a relational database.
Once such example is the database used by mobile phone operators to monitor and control base stations for mobile telephone networks.
I almost all of these cases, an OO DB is used, either a commercial product or a self-rolled system that allows heirarchies of objects.
I've worked on a 3G monitoring application for a large company who will remain nameless, but whose logo is a red wine stain (-: , and they used such an OO DB to keep track of all the various attributes for individual cells within the network.
Interrogation of such DBs is done using proprietary techniques that are, usually, completely free from SQL.
HTH.
cheers,
Rob
Object databases are not relational databases. They can be really handy if you just want to stuff some objects in a database. They also support versioning and modify classes for objects that already exist in the database. db4o is the first one that comes to mind.
In some cases (financial market data and process control for example) you might need to use a real-time database rather than a RDBMS. See wiki link
There was a RAD tool called JADE written a few years ago that has a built-in OODBMS. Earlier incarnations of the DB engine also supported Digitalk Smalltalk. If you want to sample application building using a non-RDBMS paradigm this might be a start.
Other OODBMS products include Objectivity, GemStone (You will need to get VisualWorks Smalltalk to run the Smalltalk version but there is also a java version). There were also some open-source research projects in this space - EXODUS and its descendent SHORE come to mind.
Sadly, the concept seemed to die a death, probably due to the lack of a clearly visible standard and relatively poor ad-hoc query capability relative to SQL-based RDMBS systems.
An OODBMS is most suitable for applications with core data structures that are best represented as a graph of interconnected nodes. I used to say that the quintessential OODBMS application was a Multi-User Dungeon (MUD) where rooms would contain players' avatars and other objects.
You can go a long way just using files stored in the file system. RDBMSs are getting better at handling blobs, but this can be a natural way to handle image data and the like, particularly if the queries are simple (enumerating and selecting individual items.)
Other things that don't fit very well in a RDBMS are hierarchical data structures and I'm guessing geospatial data and 3D models aren't that easy to work with either.
Services like Amazon S3 provide simpler storage models (key->value) that don't support SQL. Scalability is the key there.
Excel files can be useful too, particularly if users need to be able to manipulate the data in a familiar environment and building a full application to do that isn't feasible.
There are a large number of ways to store data - even "relational databse" covers a range of alternatives from a simple library of code that manipulates a local file (or files) as if it were a relational database on a single user basis, through file based systems than can handle multiple-users to a generous selection of serious "server" based systems.
We use XML files a lot - you get well structured data, nice tools for querying same the ability to do edits if appropriate, something that's human readable and you don't then have to worry about the db engine working (or the workings of the db engine). This works well for stuff that's essentially read only (in our case more often than not generated from a db elsewhere) and also for single user systems where you can just load the data in and save it out as required - but you're creating opportunities for problems if you want multi-user editing - at least of a single file.
For us that's about it - we're either going to use something that will do SQL (MS offer a set of tools that run from a .DLL to do single user stuff all the way through to enterprise server and they all speak the same SQL (with limitations at the lower end)) or we're going to use XML as a format because (for us) the verbosity is seldom an issue.
We don't currently have to manipulate binary data in our apps so that question doesn't arise.
Murph
One might want to consider the use of an LDAP server in the place of a traditional SQL database if the application data is heavily key/value oriented and hierarchical in nature.
BTree files are often much faster than relational databases. SQLite contains within it a BTree library which is in the public domain (as in genuinely 'public domain', not using the term loosely).
Frankly though, if I wanted a multi-user system I would need a lot of persuading not to use a decent server relational database.
Full-text databases, which can be queried with proximity operators such as "within 10 words of," etc.
Relational databases are an ideal business tool for many purposes - easy enough to understand and design, fast enough, adequate even when they aren't designed and optimized by a genius who could "use the full power," etc.
But some business purposes require full-text indexing, which relational engines either don't provide or tack on as an afterthought. In particular, the legal and medical fields have large swaths of unstructured text to store and wade through.
Also:
* Embedded scenarios - Where usually it is required to use something smaller then a full fledged RDBMS. Db4o is an ODB that can be easily used in such case.
* Rapid or proof-of-concept development - where you wish to focus on the business and not worry about persistence layer
CAP theorem explains it succinctly. SQL mainly provides "Strong Consistency: all clients see the same view, even in presence of updates".
K.I.S.S: Keep It Small and Simple
I would offer RDBMS :)
If you do not wont to have troubles with set up/administration go for SQLite.
Built in RDBMS with full SQL support. It even allows you to store any type of data in any column.
Main advantage against for example log file: If you have huge one, how are you going to search in it? With SQL engine you just create index and speed up operation dramatically.
About full text search: SQLite has modules for full text search too..
Just enjoy nice standard interface to your data :)
One good reason not to use a relational database would be when you have a massive data set and want to do massively parallel and distributed processing on the data. The Google web index would be a perfect example of such a case.
Hadoop also has an implementation of the Google File System called the Hadoop Distributed File System.
I would strongly recommend Lua as an alternative to SQLite-kind of data storage.
Because:
The language was designed as a data description language to begin with
The syntax is human readable (XML is not)
One can compile Lua chunks to binary, for added performance
This is the "native language collection" option of the accepted answer. If you're using C/C++ as the application level, it is perfectly reasonable to throw in the Lua engine (100kB of binary) just for the sake of reading configs/data or writing them out.