How to Design HBase Schema - schema

We have currently one running project which uses RDBMS database( with lots of tables and stored procedures for manipulating data). The current flow is like : the data access layer will call stored procedures, which will insert/delete/update or fetch data from RDBMS(please note that these stored procedures are not doing any bulk proccesses.). The current data structure contains lots of primary key, foreign key relation ship and have lots of updates to existing database tables.a I just want to know whether we can use HBase for our purpose? then how can we use Hadoop with HBase replacing RDBMS?

You need to ask yourself, what is the RDBMS not doing for you, and what is it that you hope to achieve by moving to Hadoop/HBase?
This article may help. There are a lot more.
http://it.toolbox.com/blogs/madgreek/nosql-vs-rdbms-apples-and-oranges-37713
If the purpose is trying new technology, I suggest trying their tutorial/getting started.
If it's a clear problem you're trying to solve, then you may want to articulate the problem.
Good Luck!

I hesitate to suggest replacing your current rdbms simply because of the large developer effort that you've already spent. Consider that your organization probably has no employees with the needed experience for hbase. Moving to hbase with the attendant data conversion and application rewriting will be very expensive and risky.

Related

Sql Database structure for housing historical data and display changes

Good morning,
This is more of a concept question then anything.
I am looking to design a database and interface that will track changes to the entries (in this case people) and display those changes readily.
(user experience would look something like this)
for user A
Date Category Activity
8/8/14 change position position 1 -> position 2
8/9/14 change department department a -> department b
...
...
the visual experience seem like it would benefit from an E-A-V design, however i am designing the database to be easy to data mine and from my reading, i think that E-A-V is not the right way to go.
does it make sense to duplicate data just to display it?
if not, does anyone have a suggestion of how to query the history table and display? (currently using jquery and php to leverage the db...i suppose i could do something interesting from a coding perspective to get it done)
thank you for your help,
Travis
Creating an efficient operational database environment and a creating an 'easy-to-data mine' environment are two separate (and often opposing) goals.
Others might disagree with me but in my opinion it is best to create your database based on operational readiness (This means using your E-A-V design as mentioned above) and then worry about data transformation later. This may make it inconvenient later to transform the data to allow for easy mining but it will accomplish an incredibly important goal which is to eliminate the possibility for data error.
Once you have a good system in place where you can collect data appropriately, then you can create a warehouse or datamart environment to more conveniently extract that data.
This may sound like a lot of work but from a data integrity perspective, it is much safer than trying to create some system that is designed entirely for reporting. That's my personal opinion at least.
(sorry cannot comment yet)
You have to analyse the data you need to persist.
if you have only a couple of tables, with no relationship, you probably don't need the database.
In this case the database solution probably will be slower(connection/transmission/security overhead ...).
well if it's a few MBs of data, I would keep everything in one table.
You can easily load the whole data set in memory and do what you need to do.

Moving data from production db to datawarehouse (SQL Server)

We are developing a reporting module for our software, and because of this we need to move some data from the system's production db into a datawarehouse db which will be used as the datasource for the reports (SQL Server reporting).
The schema in the production DB is quite old, so once we have data in the DW DB, we will need some additional fields (for example, calculating a correct datetime colum out of the prod db's 'date' and 'time' integer columns. (Don't ask, it's old.)
We are discussing internally how to do this in an efficient manner. Right now, it is implemented in a fugly SSIS job that basically tears down the entire DW DB every night and builds it up again from the prod db, doing data transformations as it goes. This doesn't scale very well.
I've been looking into using "newer" technologies, like for example SQL Server replication to move data in a more granular fashion.
My questions about this is:
-With replication the "move data" part is obviously solved, but not the data transform part. I know I can create update triggers on the DW DB, but all table-related triggers seem to be wiped whenever I do a reinitialize on the subscription, which makes it hard to set up.
I'm not looking for an exact answer here, more a hint on which direction to take this. Sorry if the question is a bit blurry.
update:
thanks for the good points below. This is software we're selling to customers, so I'm a big fan of having as few as possible "config items" for the customer to set up and maintain. The SSIS package as it stands today is one more "item" for the customer to keep tabs on, along with its schedules.
Replication intriguied me because it completely abscracts the whole CRUD "dilemma" when moving data, but you may be right - SSIS would still be better, as long as the SSIS logic is created a bit smarter than today.
Data might be quite large tho, so wiping and reimporting everything like we do today is definetely a problem that needs adressing
.
I don't think replication is a good idea. It would be if the source and destination schemas were exactly the same, but as you pointed out, they are not. And also, all the calculations you mention the SSIS is doing, you still would have to do it because replication wouldn't.
I think SSIS is the way to go, I mean, this is exactly why it exists.
Since you are recreating the DB on each load and if the amount of calculations and changes are not big and you don't need to do things lookups to get surrogate keys from natural keys, you could create views on your main database to try to mimic the structure of the destination database so you can do direct inserts (pretty much a source component mapped to a destiantion component)
maybe if you specify what's the real issue with SSIS you want to solve, it would be easier to help.
Just a quick update on this: The CDC functionality of SQL Server seems to be what we need to look into, this functionality integrates nicely with SSIS. Thanks for the hint on Slowly Changing Dimensions, and SSIS!

Database model refactoring for combining multiple tables of heterogeneous data in SQL Server?

I took over the task of re-developing a database of scientific data which is used by a web interface, where the original author had taken a 'table-per-dataset' approach which didn't scale well and is now fairly difficult to manage with more than 200 tables that have been created. I've spent quite a bit of time trying to figure out how to wrangle the thing, but the datasets contain heterogeneous values, so it is not reasonably possible to combine them into one table with a set schema for column definitions.
I've explored the possibility of EAV, XML columns, and ended up attempting to go with a table with many sparse columns since the database is running on SQL Server 2008. The DBAs are having some issues with my recently created sparse columns causing some havoc with their backup scripts, so I'm left wondering again if there isn't a better way to do this. I know that EAV does not lead to decent performance, and my experiments with XML data types also demonstrated poor performance, probably thanks to the large number of records in some of the tables.
Here's the summary:
Around 200 tables, most of which have a few columns containing floats and small strings
Some tables have as many as 15,000 records
Table schemas are not consistent, as the columns depended on the number of samples in the original experimental data.
SQL Server 2008
I'll be treating most of this data as legacy in the new version I'm developing, but I still need to be able to display it and query it- and I'd rather not have to do so by dynamically specifying the table name in my stored procedures as it would be with the current multi-table approach. Any suggestions?
I would suggest that the first step is looking to rationalise the data through views; attempt to consolidate similar data sets into logical pools through views.
You could then look at refactoring the code to look at the views, and see if the web platform operates effectively. From there you could decided whether or not the view structure is beneficial and if so, look to physically rationalising the data into a new table.
The benefit of using views in this manner is you should be able to squeak a little performance out of indexes on the views, and it should also give you a better handle on the data (that said, since you are dev'ing the new version, it would suggest you are perfectly capable of understanding the problem domain).
With 200 tables as simple raw data sets, and considering you believe your version will be taking over, I would probably go through the prototype exercise of seeing if you can't write the views to be identically named to what your final table names will be in V2, that way you can also backtest if your new database structure is in fact going to work.
Finally, a word to the wise, when someone has built a database in the way you've described, without looking at the data, and really knowing the problem set; they did it for a reason. Either it was bad design, or there was a cause for what now appears on the surface to be bad design; you raise consistency as an issue - look to wrap the data and see how consistent you can make it.
Good luck!

SQL and Flat Files... In harmony?

I was just thinking, how quick it would be to store the actual data of an application in a flat file.
Now, you can't just go storing everything in a flat file... sometimes sorts and searches are required, and to go through directories and files recursively could be a pain.
Now, imagine, you stored all your search-able data in a database, and had a pointer field, that pointed to a data file?
This would be very specific per app, however- so long as all my search-able data is stored in the database, why should I store the actual data in a database?
(Locking, Data integrity aside) it would be faster, I am sure... but how much, and is it worth doing it?
Well you often want to do things in queries beyond search on the data. For instance you might might not search on a field called cost_center, but you might have a case statment that processes things differently depending on the information in the field. Or you might need to concatenate information together. You might update one field based onthe information in another field. You might not search on a field today and need to search on it tomorrow.
A properly designed relational database can easily perform well with terrabytes of data.
And frankly you should never even consider "data integrity aside". If you don't have data integrity you don't have data.
As to whether what you want is a good idea, it depends on the type of data you are storing and the types of things you intend to do with it. There isn't enough information to say for sure.
Well "Locking, Data integrity aside" should mean a faster system. If you drop constraints you should improve performance.
But in practical terms, I don't think it's going to be faster. There's lot of development time behind RDBMSs and that's why they are quick. Sure, non-relational databases are performing better than them in highly parallel situations and scenarios which take advantage of their qualities, for instance. However, your idea does not offer an improvement such as exploiting parallelism... any performance advantage would come from dropping the qualities of RDBMSs...
As well as other answers...
Sharing of data: how are multiple clients going to access data on a share?
Backup/Restore: synching of text and "searchable"
Security/permissions on text data
Change anomalies
There is no need to implement a SQL database just to perform searches. Lots of applications store their data in XML, and you can search in many ways, e.g., using Lucene. How fast it is entirely depends on the quantity of data and how you structure it - just like a database.
It can perform very fast, but can complicate things when you want to run more than one app server.
BTrieve was essential what you describe. Back in the DOS days it was a very fast database.

How to efficiently archive older parts of a big (multi-GB) SQL Server database?

Right now I am working on a solution to archive older data from a big working database to a separate archive database with the same schema. I move the data using SQL scripts and SQL Server Management Objects (SMO) from a .Net executable written in C#.
The archived data should still be accessible and even (occassionally) changeable, we just want it out of the way to keep the working database lean and fast.
Hurling large portions of data around and managing the relations between tables has proven to be quite a challenge.
I wonder if there is a better way to archive data with SQL Server.
Any ideas?
I think if you still want/need the data to be accessible, then partitioning some of your biggest or most-used tables could be an option.
Yep, use table and index partitioning with filegroups.
You don't even have to change the select statements, only if you want to get the last bit of speed out of the result.
Another option can be the workload balancing with two servers and two way replication between them.
We are in a similar situation. For regulatory reasons we cannot delete data for a set period of time, but many of our tables grow very large and unwieldy and realistically much of the data that is older than a month can be removed with few day-to-day problems.
We currently programatically prune the tables, using a custom .NET/shell combination app using BCP to backup files which can be zipped and left on an out of the way network share. This isn't particularly accessible, but it is more space efficient. (It is complicated by our needing to keep certain historic dates, rather than being able to truncate at a certain size or with key fields in certain ranges.)
We are looking into alternatives but, I think surprisingly, there's not much in the way of best-practice on this discussion!