How should I document a HBase schema? - schema

This might be a case of me over thinking the problem, but how should I document an HBase schema? In relational database land, it is common to use UML or similar diagramming techniques to document schemas. Those approaches don't seem to fit HBase very well. To me the simplest approach is to use a spreadsheet (or any other table) to document the columns and column families. Is there a better way to do this?

There's no common method at this point. You can do it in whatever way seems clearest to you. I've used entity diagrams (generally more trouble than they're worth), XML/JSON, and pseudo CREATE TABLE statements. A couple things that might help:
A presentation I did at HBaseCon 2012 about understanding HBase schema design, which proposes a few simple guidelines.
An simple open-source tool called scoot that takes in XML files and outputs JRuby scripts for creating said schema
If you're getting into advanced stuff like "entity nesting" (i.e. using variable columns at runtime) then you're kind of on your own as far as modeling goes, for now.

Related

Should I use EAV database design model or a lot of tables

I started a new application and now I am looking at two paths and don't know which is good way to continue.
I am building something like eCommerce site. I have a categories and subcategories.
The problem is that there are different type of products on site and each has different properties. And site must be filterable by those product properties.
This is my initial database design:
Products{ProductId, Name, ProductCategoryId}
ProductCategories{ProductCategoryId, Name, ParentId}
CategoryProperties{CategoryPropertyId, ProductCategoryId, Name}
ProductPropertyValues{ProductId, CategoryPropertyId, Value}
Now after some analysis I see that this design is actually EAV model and I read that people usually don't recommend this design.
It seems that dynamic sql queries are required for everything.
That's one way and I am looking at it right now.
Another way that I see is probably named a LOT WORK WAY but if it's better I want to go there.
To make table
Product{ProductId, CategoryId, Name, ManufacturerId}
and to make table inheritance in database wich means to make tables like
Cpus{ProductId ....}
HardDisks{ProductId ....}
MotherBoards{ProductId ....}
erc. for each product (1 to 1 relation).
I understand that this will be a very large database and very large application domain but is it better, easier and performance better than the option one with EAV design.
EAV is rarely a win. In your case I can see the appeal of EAV given that different categories will have different attributes and this will be hard to manage otherwise. However, suppose someone wants to search for "all hard drives with more than 3 platters, using a SATA interface, spinning at 10k rpm?" Your query in EAV will be painful. If you ever want to support a query like that, EAV is out.
There are other approaches however. You could consider an XML field with extended data or, if you are on PostgreSQL 9.2, a JSON field (XML is easier to search though). This would give you a significantly larger range of possible searches without the headaches of EAV. The tradeoff would be that schema enforcement would be harder.
This questions seems to discuss the issue in greater detail.
Apart from performance, extensibility and complexity discussed there, also take into account:
SQL databases such as SQL Server have full-text search features; so if you have a single field describing the product - full text search will index it and will be able to provide advanced semantic searches
take a look at no-sql systems that are all the rage right now; scalability should be quite good with them and they provide support for non-structured data such as the one you have. Hadoop and Casandra are good starting points.
You could very well work with the EAV model.
We do something similar with a Logistics application. It is built on .net though.
Apart from the tables, your application code has to handle the objects correctly.
See if you can add generic table for each object. It works for us.

How should I design the database for Django?

I like to use a GUI application to design databases using ERD. Currently I am using the EER Diagram of the free MySQLWorkbench.
Once I like the way the ERD looks, I Forward Engineer the ERD in MySQLWorkbench to create the actual database. Then I introspect the MySQL database with django-admin.py inspectdb to Reverse Engineer into an output of a Python snippet code for Django's models.py.
But then I have to take the inspectdb output and manually edit it to my liking. One particular part I really don't like to do is manually eliminating each join table from a many-to-many relationship.
Is there a good (and preferably free) GUI ERD design program out there specifically designed for Django?
If you wish to design your models at the database level, in the way that you're describing, you are going to need to do exactly that: design your SQL and then convert that into Django models.
This is not the normal way of designing a Django application: typically you would design the models as you wanted them to be, and only put a lot of effort into schema design if you need to resolve some performance problems. Django models aren't really meant to be an abstraction of a relational database: they're meant to be an abstraction of your application's persisted objects, which happens to be implemented on top of a relational database.
There is nothing wrong with wanting/needing to do an explicit schema design, but it makes you a bit of an outlier (most web devlelopers don't), hence the difficulty you're having finding tools suited to your needs.
The closest thing is the graph_models command which is part of Django command extensions. This lets you visualize your models (you still write them in python code, but the visual representation will help you iterate faster).
It is not usual to design a database layout for django apps as such, and doing so is likely to lead to a sub-optimal design.
Instead, just design your model classes, taking account of how you will query them, and the way that ForeignKey (and the other relations types) work. If you don't do this, you are likely to find that your app suffers from conceptual mismatch.

Schema-agnostic OLAP-like tool?

Do there exist any (ideally free or open-source) tools for performing OLAP analyses on arbitrary tables in a relational database, without requiring any advance specification of dimensional hierarchies, cardinalities, or any other meta-information about the table beyond what can be extracted automatically from the table itself?
My inability to Google for anything like what I'm describing makes me suspect I'm using incorrect terminology and what I'm searching for isn't properly considered to be OLAP. If this is the case, what I specifically want is anything that would let technically unsophisticated users create cross-tab or contingency table aggregations using tables in a relational DB without needing to write elaborate SQL queries.
Or, in other words, I'd like something that mimics Excel's PivotTables on a larger scale. I appreciate that Excel does indeed generate extensive caches behind the scenes when you make a PivotTable, but it does this without the user having to explain to it which caches need creating. This is the functionality I'm trying to find elsewhere, if it exists.
The best options I know of are Excel and Access, but of course they are not open source. This space kinda got trampled in the explosion of interest in what is now called Business Intelligence and a lot of companies got bought by MS and others. It's pretty thin now as far as I can tell. I'll watch this thread though.
The most useful paradigm to attach to is I think spreadsheets and there's not much competition there any more. Google Docs spreadsheets can import csv etc. exported from databases, and there's a pivot chart available, but not much more.
The other place I've seen OLAP capabilities is in the Adobe Flex libraries to build on with ActionScript if you have any inclination in that direction. As usual, Adobe manages to get it about 90% right but doesn't quite provide a whole product.
icCube aims to setup an OLAP cube as simply as possible. It is not schema-agnostic, but I guess this is quite simple to define dimensions and facts from existing DB tables. Nevertheless, this could be not so "simple" depending on your tables - difficult to say without knowledge about them. I guess there's no generic easy solution ;-)
Then you can use Excel pivot table (amongst others) to access the cubes. Note as far as I know Excel does not do any caching neither aggregation when connecting to a cube. Indeed, it is generating all the required MDX requests to the cube.
Hope that helps.

Migrating procedural, antique CRUD code and proprietary DBMS to OO ORM on SQL

Please excuse my long-winded explanation, but I wanted to be as explicit as possible in the hopes of getting as much useful feedback on my situation as possible. You can skip to the questions at the bottom if you are impatient.
Explanation
At my current job, development is done in an antiquated language that is hard-wired to a proprietary DBMS that comes with the language. The language is CRUD-focused, and is essentially a glorified database querying/reporting/updating language with some programming features bolted on as an afterthought. Most programs are top-down procedures and there is very little code reuse; updating a record often requires updating many entangled, related records at the same time that you just need to "know about" as the proprietary database has no inherent foreign key relationships. If a table needs to be updated, we generally must grep our source code and update every procedure that creates/updates records for that table and recompile. I could go on with other annoyances, but needless to say, I am looking for a way to abstract away as much of this behavior as possible into reusable code segments.
The language has semi-recently added some support for object-oriented development, and I have been able to demonstrate the benefits of reusable code to my coworkers with a recent project written using OO constructs. However, my project was only possible because it was a rare task that did not require interacting with our database.
I have really been trying hard to find a way to create re-usable code using OO techniques with this language, but since everything is so database-focused, what I really need is a way to create container classes around our table designs, putting most of our data processing logic into class methods and merging N related tables into 1 singular class. This has brought me to the idea of ORM frameworks, which of course is non-existent on the language I am using at work.
What I have found, is that the DBMS for this language can run a SQL99 engine concurrently with the proprietary language engine, and it includes JDBC and ODBC drivers. This has opened the door for me to explore migration strategies, which is where I think we eventually need to go. Since the SQL engine runs concurrently with the old engine, it is possible for us to do an incremental migration, running new code alongside old code with an eventual goal of migrating our data to a "pure" SQL DBMS when all the old code is replaced.
I initially did quite a bit of reading and proposed Java (using JPA2 for ORM) to my manager, but I think I scared him as he views Java as being a bit heavyweight for our needs. I then did a little more digging and re-proposed Ruby using the JRuby interpreter (using either ActiveRecord or DataMapper for ORM), which was much better received as Rails seems to fit in well with the re-shifting of our development to Web-based front-ends that we are attempting to move to with our old cludgy code, and of course because the ability to interact with Java if the need arises is a great capability.
The Questions
Nearly all of the reading I have
been doing about ORM is focused on
starting with a class structure, and
creating the mapped database
structure as a secondary process.
Is going the other way around
(starting with an existing database
and mapping classes to it) a very
odd thing to do?
Assuming question #1 == true, how
flexible are existing ORM frameworks
such as JPA2, ActiveRecord,
DataMapper etc. to "imperfect" table
design? I am sure we will have to
do some refactoring of existing
table design, but would like to know
if I am undertaking a Herculean task
before I waste too much time on the
effort.
If anyone has a better idea for
language+ORM, I would love to hear
it. It must be SQL-ready using JDBC
or ODBC to fit into our incremental
migration plan.
If anyone has any experience on a similar effort and could point out any helpful resources (especially books), I would be very grateful!
Nearly all of the reading I have been doing about ORM is focused on starting with a class structure, and creating the mapped database structure as a secondary process. Is going the other way around (starting with an existing database and mapping classes to it) a very odd thing to do?
Not really. There are several approaches when dealing with the persistence layer of an application:
Top-down: You start with the object model and the mappings and you derive the database schema from that data.
Bottom-up: You start with your data model i.e. the database schema and you derive the object model and the mappings from the tables.
Middle-out: You start with the mapping and you generate the object model and the tables.
Meet-in-the-middle: You start with an existing database schema and an existing object model, you develop a mapping to map between the two (you can even introduce an additional object layer and brige the existing one).
The top-down approach is the most object-oriented but the meet-in-the-middle approach is probably the most common.
Assuming question #1 == true, how flexible are existing ORM frameworks such as JPA2, ActiveRecord, DataMapper etc. to "imperfect" table design? I am sure we will have to do some refactoring of existing table design, but would like to know if I am undertaking a Herculean task before I waste too much time on the effort.
I would say that JPA is not the most flexible, it will not deal very well with exotic or heavily denormalized schemas (the result might be ugly from an OO point of view). Accesses that don't go through JPA might also be a problem. A data mapper tool like iBatis (now mybatis) will give you more flexibility.
If anyone has a better idea for language+ORM, I would love to hear it. It must be SQL-ready using JDBC or ODBC to fit into our incremental migration plan.
I know that RoR can deal with existing databases, I'm just not sure what the result will look like. But I don't really have enough experience with RoR so I'll let experts elaborate on this.
If anyone has any experience on a similar effort and could point out any helpful resources (especially books), I would be very grateful!
I suggest to browse Scott Ambler website and his book(s):
The Process of Database Refactoring: Strategies for Improving Database Quality
More food for thought:
Working Effectively with Legacy Code by Michael Feathers
Clean Code by Robert Martin

Is SQL the ''assembler'' of the NoSQL database world?

I recently came across http://www.fossil-scm.org/index.html/doc/tip/www/theory1.wiki by D. Richard Hipp, the developer responsible for SQLite.
it go me thinking, is Fossil the only NoSQL database that uses SQL?
Do others uses SQL as a 'High Level Scripting Language'?
From the article, it sounds like Fossil isn't a database any more than git is a database. Yes, it's a thing that contains data, and yes, it's backed by a database, but it seems pretty far from a database itself. So the first part of of your question basically relies on a faulty assumption. There is a database called Friendly which uses MySQL to store schema-less models, but it seems like an awkward bandaid sort of solution at best.
I'm certainly not familiar with all of the NoSQL options out there, but, to my knowledge, none of the well-though-of ones use SQL for anything. MongoDB and CouchDB, the two I'm most familiar with, both use Javascript as part of their query interface, though in very different ways. MongoDB has queries more like what you'd expect from a relational database: you can write an arbitrary query for all documents that match a certain set of attributes. However, unlike a relational database, there's no such thing as a join (you'll only ever get a list of distinct documents back, not compound documents) and you can write arbitrary Javascript code to select documents. CouchDB, on the other hand, does not allow arbitrary queries. Instead, you create views (which are essentially simpler key-value stores) using map/reduce functions written in Javascript and then query those views from a start key to and end key.
In both cases, the type of information being transmitted to the server to perform the query isn't well-suited for the type of problem that SQL is good at solving. The trade-off to SQL being so high-level (to use the logic of the author of the paper) is that it's only suitable for a very narrow set of problems.
The creator of Fossil / SQLite is working and pushing UnQL as the NoSQL standard:
UnQL means Unstructured Query Language.
It's an open query language for JSON, semi-structured and document
databases.
It looks like a stripped down version of SQL.