Schema & data separation - data-lakehouse

I was going through an AWS webinar and I found this slide where they recommend separate your data & schema...they mentioned that if we separate out then it's easy for each to be evolved separately but what's the point in data change without schema change or vice versa? what use cases can have data evolution but not schema evolution or vice versa, appreciate your valuable inputs. I'm working on a lakehouse architecture and this question always haunts me!

Related

I need advise choosing a NoSQL database for a project with a lot of minute related information

I am currently working on a private project that is going to use Google's GTFS spec to get information about 100s of Public Transit agencies, their routers, stations, times, and other related information. I will be getting my information from here and the google code wiki page with similar info. There is a lot of data and its partitioned into multiple CSV formatted text files. These can be huge, some ranging in 80-100mb of data.
With the data I have, I want to translate it all into a nice solid database that I can build layers on top of to use for my project. I will be using GPS positioning to pinpoint a location and all surrounding stations/stops.
My goal is to access all the information for all these stops and stations with as few calls as possible, while keeping datasets small for queried results.
I am currently leaning towards MongoDB and CouchDB for their GeoSpatial support that can really optimize getting small datasets. But I also need to be sure to link all the stops on a route because I will be propagating information along a transit route for that line. In this case I have found that I can benefit from a Graph DB like Neo4j and OrientDB, but from what I know, neither has GeoSpatial support nor am I 100% sure that a Graph DB would be what I need.
The perfect solution might not exist, but I come here asking for help on finding the best possible for my situation. I know I will possible have to work around limitations of whatever I choose, but I want to at least have done my research and know that its the best I can get at the moment.
I have also been suggested to splinter the data into multiple DBs, but that could get very messy because all the information is very tightly interconnected through IDs.
Any help would be appreciated.
Obviously a graph database fits 100% your problem. My advice here is to go for some geo spatial module over neo4j or orientdb, althought you have some others free and open source implementation.
I think the best one right now, with all the geo spatial thing implemented is neo4j-spatial package. But as far as I know, you can also reproduce most of the geo spatial thing on your own if necessary.
BTW talking about splitting, if the amount of data/queries will be high, I strongly recommend you to share the load and think the model in this terms. Sure you can do something.
I've used Mongo's GeoSpatial features and can offer some guidance if you need help with a C# or javascript implementation - I would recommend it to start because it's super easy to use. I'm learning all about Neo4j right now and I am working on a hybrid approach that takes advantage of both Mongo and Neo4j. You might want to cross reference the documents in Mongo to the nodes in Neo4j using the Mongo object id.
For my hybrid implementation, I'm storing profiles and any other large static data in Mongo. In Neo4j, I'm storing relationships like friend and friend-of-friend. If I wanted to analyze movies two friends are most likely to want to watch together (or really any other relationship I hadn't thought of initially), by keeping that object id reference I can simply add some code instructing each node go out and grab a list of movies from the related profile.
Added 2011-02-12:
Just wanted to follow up on this "hybrid" idea as I created prototypes for and implemented a few more solutions recently where I ended up using more than one database. Martin Fowler refers to this as "Polyglot Persistence."
I'm finding that I am often using a combination of a relational database, document database and a graph database (in my case this is generally SQL Server, MongoDB and Neo4j). Since the question is related to data modeling as much as it is to geospatial, I thought I would touch on that here:
I've used Neo4j for site organization (similar to the idea of hypermedia in the REST model), modeling social data and building recommendations (often based on social data). As a result, I will generally model this part of the application before I begin programming.
I often end up using MongoDB for prototyping the rest of the application because it provides such a simple persistence mechanism. I like to start developing an application with the user interface, so this ends up working well.
When I start moving entities from Mongo to SQL Server, the context is usually important - for instance, if I have an application that allows users to build daily reports based on periodically collected data, it may make sense to run a procedure that builds those reports each night and stores daily report objects in Mongo that may be combined into larger aggregate reports as needed (obviously this doesn't consider a few special cases, but that is not relevant to the point)...on the other hand, if users need to pull on-demand reports limited to very specific time periods, it may make sense to keep everything in SQL server and build those reports as needed.
That said, and this deserves more intense thought, here are some considerations that may be helpful:
I generally try to store entities in a relational database if I find that pulling an entity from the database [in other words(in the context of a relational database) - querying data from the database that provides the data required to generate an entity or list of entities that fulfills the requested parameters] does not require significant processing (multiple joins, for instance)
Do you require ACID compliance(aside:if you have a graph problem, you can leverage Neo4j for this)? There are document databases with ACID compliance, but there's a reason Mongo is not: What does MongoDB not being ACID compliant really mean?
One use of Mongo I saw in the wild that I thought was worthy of mention - Hadoop was being used to compute massive hash tables that were then stored in Mongo. I believe a similar approach is used by TripAdvisor for user based customization in terms of targeting offers, advertising, etc..
NoSQL only exists because MySQL users assume that all databases have their performance problems when their database grows large and/or becomes complex.
I suggest that you use PostGIS. You can use the same database for the rest of your data needs as well.
http://postgis.refractions.net/

Schema-agnostic OLAP-like tool?

Do there exist any (ideally free or open-source) tools for performing OLAP analyses on arbitrary tables in a relational database, without requiring any advance specification of dimensional hierarchies, cardinalities, or any other meta-information about the table beyond what can be extracted automatically from the table itself?
My inability to Google for anything like what I'm describing makes me suspect I'm using incorrect terminology and what I'm searching for isn't properly considered to be OLAP. If this is the case, what I specifically want is anything that would let technically unsophisticated users create cross-tab or contingency table aggregations using tables in a relational DB without needing to write elaborate SQL queries.
Or, in other words, I'd like something that mimics Excel's PivotTables on a larger scale. I appreciate that Excel does indeed generate extensive caches behind the scenes when you make a PivotTable, but it does this without the user having to explain to it which caches need creating. This is the functionality I'm trying to find elsewhere, if it exists.
The best options I know of are Excel and Access, but of course they are not open source. This space kinda got trampled in the explosion of interest in what is now called Business Intelligence and a lot of companies got bought by MS and others. It's pretty thin now as far as I can tell. I'll watch this thread though.
The most useful paradigm to attach to is I think spreadsheets and there's not much competition there any more. Google Docs spreadsheets can import csv etc. exported from databases, and there's a pivot chart available, but not much more.
The other place I've seen OLAP capabilities is in the Adobe Flex libraries to build on with ActionScript if you have any inclination in that direction. As usual, Adobe manages to get it about 90% right but doesn't quite provide a whole product.
icCube aims to setup an OLAP cube as simply as possible. It is not schema-agnostic, but I guess this is quite simple to define dimensions and facts from existing DB tables. Nevertheless, this could be not so "simple" depending on your tables - difficult to say without knowledge about them. I guess there's no generic easy solution ;-)
Then you can use Excel pivot table (amongst others) to access the cubes. Note as far as I know Excel does not do any caching neither aggregation when connecting to a cube. Indeed, it is generating all the required MDX requests to the cube.
Hope that helps.

Database Modification or start over?

So I'm currently working on rebuilding an existing website that is used internally at my company for project management, at heart it is a bug tracking utility that has some customer support and accounting operations linked into it.
Currently the database model is very repetitive, a good example of this is, currently a UserId is linked into a record (FK relationship into a user table that contains all the information about the user) and then all the information about the user also exists in the table.
I've been tasked with improving the website and the functionality of the model; however, I want to reduce the repetition of data in the website (is this normalization or is that the breaking apart of unlinked items into separate tables?). I'm not sure what the best method of doing this would be. I'm thinking of generating the creation scripts for the database and creating a new database project in VS to then modify the database, then generating some scripts to populate the new database model from the old database.
I plan on using the Entity Framework and ASP. NET MVC 2 to build the website as I think it provides the most flexible model moving forward for the modification and maintenance of the website.
The reason I ask all of this is because I'm very familiar with using databases and modifying existing ones to be used in applications and websites but I'm trying to discover the best way to build one.
I'm curious if there is any material on the best way to do this or if I should be using a different tool to do this with?
Edit: Providing more information on the model
There are 4 major areas that we have that are used:
Cases (Bugs, Features, Working Tasks, Etc)
2 .Tickets (Tech Support Events)
Errors (Errors Generated from our logging Library, Basically a stack trace with customer information)
License (Keeps track of each customers License allows modification to those licenses)
These are the Objects that are intermixed and used throughout the above 4 major areas.
Users (People who use the system)
Customers (People who use our software)
Stores (Places where our customers use our software)
Products (Our Software)
Relationships
Cases:
A Cases has to have a User, can have a Customer, Store, Error, Ticket and/or Product
Tickets
A Ticket has to have a User and a Customer, can have a Store, Error and/or Product
Errors:
A Error has to have a Product, Can Have a Case, Ticket, Store, and/or Product
Licenses:
A Licenses has to have a Product and Customer, can have a Store
Like I said very basic website, with a not super complex database, if done correctly.
Currently the database has no FK constraints, replication of lots of information across each table and lots of extra tables that are duplicates with different names.
E.g.
Each Case type has a separate table so there is a FeatureRequest, Bug, Tasks, Completed, etc table that all contain the same information.
Normalization is about storing data without redundancy or anomalies.
One example of an anomaly could be when attributes about a user in your main table are not in sync with the users table. Someone changes information about that user in one table without reflecting the changes in the redundant copy. The problem is that it's hard to know which change is the correct one.
Some people think that normalization is just about breaking apart tables into littler tables, because that's what they see as the most common type of change. But that's not the goal of normalization. It's just by coincidence that most mistakes of non-normalization involve stuffing too much data into one table where multiple tables would be correct.
It's hard to answer your question about whether to modify your database in-place or whether to create a whole new database and migrate to it.
What I would do in your case is to design a properly normalized database, and then examine the differences between that and your existing database. Imagine what you would have to do for each difference, to change your old database to the new one, versus a data migration. It could be that only a few changes are needed, only dropping the redundant columns. Or it could be that some major rework is needed. It's impossible to tell until you do the work of creating a normalized data model so you can compare.
The bigger task might be to adapt your application code that uses the database. One way to ease this transition is to create database views on top of the normalized database, which mimic your old non-normalized database. That way hopefully you don't have to rewrite every bit of code in your app all at once, you can keep some of it the same at least until you can refactor the code.
Also having a good set of regression tests in place is ideal, so you can be sure your app still does all the tasks it is supposed to do, as you refactor the database and the code that uses the database.
Re your comment: You mention that you're adding new functionality to the user model at the same time. I would find it too confusing to try to do this simultaneously with refactoring. Refactoring typically does not change functionality, it only changes implementation. But refactoring adds value because it makes the code easier to maintain or debug, improves efficiency, or prepares you to make future functionality changes more easily.
I would recommend that you bit the bullet and add your new user model features to the old non-normalized database. It's good to get the benefit of new features in the short term, and also you need to develop those features first to understand them well enough to account for them in your big refactoring project.
Here are some suggestions for resources to help you truly understand what normalization means:
SQL and Relational Theory by C. J. Date
A Simple Guide to Five Normal Forms in Relational Database Theory by William Kent
Database Normalization at Wikipedia and its sub-pages for each respective normal form
SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming by me, Bill Karwin. I wrote a chapter about database normalization that I hope explains it in plain English and with good examples.
Here are a couple of resources for managing changes to a database:
Refactoring Databases by Scott W. Ambler and Pramodkumar J. Sadalage
Agile Database Techniques: Effective Strategies for the Agile Software Developer by Scott W. Ambler
How long do you have, and how big is the database?
It's very difficult to answer this question black and white without being immersed in your environment and business case. It really doesn't seem like your limitation is technology wise, just to choose between solutions.
Re-creating is what programmers instinctively go for. However, in the "real world", sometimes we spend a lot of effort into something that isn't that used or wont last that long.
So food for thought. How long will it take you to re-do the database, how much will it cost. Will working with what's existent be sufficient for the functionality asked?

Creating a simple database schema

I'm new to SQL and could use some help in creating a database schema for my program, which manages and installs programs for my home network. Are there any guidelines/tutorials for creating database schemas?
Probably the most important concept to understand before you design your schema (you'll thank yourself for it later, trust me! :-) is that of Normalisation. The tutorial at db.grussell.org doesn't look too shabby and will give you a good grounding. In fact, if you click the "Up One Level" link and take a look around, some of the other information might be quite useful as well.
My "top tip" is: Write it down on paper or in notepad, or anything other than a database, before you start writing code. Get a good idea of what you need your schema to be able to do before you set it in stone (And by "set it in stone" I mean, realise that you've written a load of code against the schema that would have to be re-written if you change it to do what you've just realised you now need).
Designing Databases is a separate field of study and expertise. It cannot be condensed into one answer. Since you are interested in tutorials, look at the section on Database Design in any text book on Database Management Systems. I would recommend
Database System Concepts, 5e, Abraham Silberschatz, Henry F.Korth, Sudarshan
In database design, remember the following
1) You are identifying the important objects of interest in your home network. Try to avoid excessive indulgence in the processes themselves though they are important to identify the important data units that you need to capture
2) Use ER/UML modelling techniques to come up with a Data Model Diagram/design. There are many case tools that can help you in drawing this.
3) Use the principles of Database Normalization to fine tune your schema to avoid data redundancies. Redundant data leads to the following side effects: Inability to maintain consistency among redundant data, Inability to store some data in an elegant manner
3) Forward engineer your design to DDL statements for the DB of your choice. Most case tools support this.
Case tools:
Microsoft Visio
ER Studio (very expensive)
TOAD data modeller
There are many open source tools too. You can try Dia. This does not support forward engineering

MySQL design question - which is better, long tables or multiple databases?

So I have an interesting problem that's been the fruit of lots of good discussion in my group at work.
We have some scientific software producing SQLlite files, and this software is basically a black box. We don't control its table designs, formats, etc. It's entirely conceivable that this black box's output could change, and our design needs to be able to handle that.
The SQLlite files are entire databases which our user would like to query across. There are two ways (we see) of implementing this, one, to create a single database and a backend in Python that appends tables from each database to the master database, and two, querying across separate databases' tables and unifying the results in Python.
Both methods run into trouble when the black box produces alters its table structures, say for example renaming a column, splitting up a table, etc. We have to take this into account, and we've discussed translation tables that translate queries of columns from one table format to another.
We're interested in ease of implementation, how well the design handles a change in database/table layout, and speed. Also, a last dimension is how well it would work with existing Python web frameworks (Django doesn't support cross-database queries, and neither does SQLAlchemy, so we know we are in for a lot of programming.)
If you find yourself querying across databases, you should look into consolidating. Cross-database queries are evil.
If your queries are essentially relegated to individual databases, then you may want to stick with multiple databases, as clearly their separation is necessary.
You cannot accommodate arbitrary changes in a database's schema without categorizing and anticipating that change in some way. In the very best case with nontrivial changes, you can sometimes simply ignore new data or tables, in the worst case, your interpretation of the data will entirely break down.
I've encountered similar issues where users need data pivoted out of a normalized schema. The schema does NOT change. However, their required output format requires a fixed number of hierarchical levels. Thus, although the database design accommodates all the changes they want to make, their chosen view of that data cannot be maintained in the face of their changes. Thus it is impossible to maintain the output schema in the face of data change (not even schema change). This is not to say that it's not a valid output or input schema, but that there are limits beyond which their chosen schema cannot be used. At this point, they have to revise the output contract, the pivoting program (which CAN anticipate this and generate new columns) can then have a place to put the data in the output schema.
My point being: the semantics and interpretation of new columns and new tables (or removal of columns and tables which existing logic may depend on) is nontrivial unless new columns or tables can be anticipated in some way. However, in these cases, there are usually good database designs which eliminate those strategies in the first place:
For instance, a particular database schema can contain any number of tables, all with the same structure (although there is no theoretical reason they could not be consolidated into a single table). A particular kind of table could have a set of columns all similarly named (although this "array" violates normalization principles and could be normalized into a commonkey/code/value schema).
Even in a data warehouse ETL situation, a new column is going to have to be determined whether it is a fact or a dimensional attribute, and then if it is a dimensional attribute, which dimension table it is best assigned to. This could somewhat be automated for facts (obvious candidates would be scalars like decimal/numeric) by inspecting the metadata for unmapped columns, altering the DW table (yikes) and then loading appropriately. But for dimensions, I would be very leery of automating somethings like this.
So, in summary, I would say that schema changes in a good normalized database design are the least likely to be able to be accommodated because: 1) the database design already anticipates and accommodates a good deal of change and flexibility and 2) schema changes to such a database design are unlikely to be able to be anticipated very easily. Conversely, schema changes in a poorly normalized database design are actually more easy to anticipate as shortcomings in the database design are more visible.
So, my question to you is: How well-designed is the database you are working from?
You say that you know that you are in for a lot of programming...
I'm not sure about that. I would go for a quick and dirty solution not a 'generic' solution because generic solutions like the entity attribute value model often have a bad performance. Don't do client side joining (unifying the results) inside your Python code because that is very slow. Use SQL for joining, it is designed for that purpose. Users can also make their own reports with all kind of reporting tools that generate sql statements. You don't have to do everything in your app, just start with solving 80% of the problems, not 100%.
If something breaks because something inside the black box changes you can define views for backward compatibility that keeps your app functioning.
Maybe the scientific software will add a lot of new features and maybe it will change its datamodel because of those new features..? That is possible but then you will have to change your application anyways to take profit from those new features.
It sounds to me as if your problem isn't really about MySQL or SQLlite. It's about the sharing of data, and the contract that needs to exist between the supplier of data and the user of the same data.
To the extent that databases exist so that data can be shared, that contract is fundamental to everything about databases. When databases were first being built, and database theory was first being solidified, in the 1960s and 1970s, the sharing of data was the central purpose in building databases. Today, databases are frequently used where files would have served equally well. Your situation may be a case in point.
In your situation, you have a beggar's contract with your data suppliers. They can change the format of the data, and maybe even the semantics, and all you can do is suck it up and deal wth it. This situation is by no means uncommon.
I don't know the specifics of your situation, so what follows could be way off target.
If it was up to me, I would want to build a database that was as generic, as flexible, and as stable as possible, without losing the essential features of structured and managed data. Maybe, some design like star schema would make sense, but I might adopt a very different design if I were actually in your shoes.
This leaves the problem of extracting the data from the databases you are given, transforming the data into the stable format the central database supports, and loading it into the central database. You are right in guessing that this involves a lot of programming. This process, known as "ETL" in data warehousing texts, is not the simplest of programming challenges.
At least ETL collects all the hard problems in one place. Once you have the data loaded into a database that's built for your needs, and not for the needs of your suppliers, turning the data into valuable information should be relatively easy, at least at the programming or SQL level. There are even OLAP tools that make using the data as simple as a video game. There are challenges at that level, but they aren't the same kind of challenges I'm talking about here.
Read up on data warehousing, and especially data marts. The description may seem daunting to you at first, but it can be scaled down to meet your needs.