SQL Azure Federations and the Atomic Unit Identity - sql

I've started work on my first Azure application, and I'm learning a lot as I go. One of the features I discovered recently was Federations in SQL Azure, essentially the SQL Azure sharding implementation so we can scale horizontally.
My project started using SQL Server, and was already largely grouped by user Profile, so I decided that makes the most sense to federate on. I've created the federation, including all of the child tables with one snag - Identity is not supported. I get why it's not supported, what I'm not sure on is what best to replace it with. This seems like a huge problem that someone else must have solved, but I haven't been able to find much.
I could just use UniqueIdentifier, but I read that can be a pain to split on. I'm also not too sure of what other performance issues I could run into using a GUID as my Primary Key for federated tables.
I'm using this with Entity Framework, but haven't got to the point of making that federation friendly yet. From what I can tell, it's not much more complicated than executing some code to select your federation before writing your LINQ query, but I'll cross that bridge when I get to it.
For the moment, I have no idea how best to actually add items to my federation, because there is no good solution to generating an identity.
Any advice would be greatly appreciated.

I'm using the GUID when using the SQL Azure Federation, it's almost the best choice when data sharding. Assuming if you are using Identity in many federation members this will cause the duplicate of your primary value. When you need to merge the data back, or archive, how do you deal with these records.
People thought the GUID is low performance when data insert, especially if we use it as a clustered index. But I never met this problem. Or I should say, there are many tuning places we can do rather than this one.

So I can't talk to the EF question. But I can't comment on the idea of using Uniqueidentifier as your key type. This, in my mind, is the best choice. UniqueIdentifier is actually very easy to split on... the reason people think it's hard is they forget what a UniqueIdentifier is. The GUID that we all know and love is a Hex representation of a 128 bit integer. This means that we can use standard Integer operations with it and thus it's actually as easy to work with as the Int (aut number) you know and love.
While it's not specifically about SQL Azure federations (it's about Windows Azure Storage) this blog post of mine on using the GUID type for sharding should give you all you need to know.
http://www.syringe.net.nz/CommentView,guid,cebe3e19-85e6-4d5b-bc24-afb6f66aaeb1.aspx

Related

Can I take advantage of Yugabytes compatability?

Yugabyte seems to support Redis, Cassandra and SQL queries. Do they work with each other? For example, can I write data with Cassandra API and later perform SQL queries against them?
These APIs do not work with each other as is, meaning you would not be able to query YCQL data from YSQL. This is because the data types are all not always present in the other APIs, and they often have different semantics.
That said, we get asked this a lot and the plan is to enable this scenario using a foreign data wrapper. So, in effect, you would be able to "import" the YCQL table into the YSQL side and use it there. Note that PostgreSQL already has a bunch of these wrappers (for example, see this generic list of PG FDWs here - it has entries for Cassandra and Redis). The idea is to re-use/enhance these and get them to work out of the box.
If you're interested, please open a GitHub issue and we can continue there. Would love to understand your use-case better to make sure we are able to address it and work with you closely on this.

What do I need to know about databases in order to create a quality Django app?

I'm trying to optimize my site and found this nice little Django doc:
Database Access Optimization, which suggests profiling followed by indexing and the selection of proper fields as the starting point for database optimization.
Normally, the django docs explain things pretty well, even things that more experienced programmers might consider "obvious". Not so in this case. After no explanation of indexing, the doc goes on to say:
We will assume you have done the obvious things above.
Uhhh. Wait! What the heck is indexing?
Obviously I can figure out what indexing is via google, my question is: what is it that I need to know as far as database stuff goes in order to create a scalable website? What should I be aware of about the Django framework specifically? What other "obvious" things ought I know? Where can I learn them?
I'm looking to get pointed in a direction here. I don't need to learn anything and everything about SQL, I just want to be informed enough to build my app the right way.
Thanks in advance!
I encourage you to read all that the other answers suggest and whatever else you can find on the subject, because it's all good information to know and will make you a better programmer.
That said, one of the nice things about Django and other similar frameworks is that for the most part you don't have to know what's going on behind the scenes in the DB. Django adds indexes automatically for fields that need them. The encouragement to add more is based on the use cases of your app. If you continually query based on one particular field, you should ensure that that field is indexed. It might be already (if it's a foreign key, primary key, etc.), but other random fields typically aren't.
There's also various optimizations that are database client-specific. Django can't do much here because it's goal is to remain database independent. So, if you're using PostgreSQL, MySQL, whatever, read about optimizations and best practices concerning those particular clients.
Wikipedia database design, and database normalization http://en.wikipedia.org/wiki/Database_design, and http://en.wikipedia.org/wiki/Database_normalization are two very important concepts, in addition to indexing.
In addition to these, having a basic understanding of your database of choice is necessary. Being able to add users, set permissions, and create a database are key things that you should know.
Learning how to backup your data is also a crucial thing.
The list keeps getting longer, one should also be aware of the db relationships that django handles for you, OneToOne, ManyToMany, ManyToOne. https://docs.djangoproject.com/en/dev/topics/db/models/
The performance impact of JOINs shouldn't be ignored. Access model properties in django is so easy, but understanding that some of Foreign Key relationships could have huge performance impacts is something to consider too.
Once you have a basic understanding of these things you should be at a pretty good starting point for creating a non-trivial django app!
Wikipedia has a nice article about database indexes, they are similar(ish) to an index in a book i.e. lets you (the computer) find things faster because you just look at the index (probably a very bad example :-)
As for performance there are many things you can do and presumably as it is a very detailed subject in itself, and is something that is particular to each RDBMS then it would be distracting / irrelevant for them (django) to go into great detail. Best thing is really to google performance tips for your particular RDBMS. There are some general tips such as indexing, limiting queries to only return the required data etc.
I think one of the main things is a good design, sticking as much as possible to Normal Form and in general actually taking your database into consideration before programming your models etc (which clearly you seem to be doing). Naming conventions are also a big plus, remembering explicit is better then implicit :-)
To summarise:
Learn/understand the fundamentals such as the relational model
Decide on a naming convention
Design your database perhaps using an ERM tool
Prefer surrogate ID's
Use the correct data type of minimum possible size
Use indexes appropriately and don't over index
Avoid unecessary/over querying
Prioritise security and stability over raw performance
Once you have an up and running database 'tune' the database analysing/profiling settings, queries, design etc
Backup and archive regularly - cron
Hang out here :-)
If required advance into replication (master/slave - django supports this quite well too)
Consider upgrading your hardware
Don't get too hung up about it

SQLce DAL, Linq-to-Sql or EntityFramework

I'm learning databases, using SqlCe, and need business object to database mapping.
Currently I try to decide if to use Linq to Sql, or EntityFramework. (I understand a bit L2S, but haven't familiarized with EF yet)
The program will only be developed and used by myself, so I have good control of the priorities:
I don't need to consider potential change of database type or data storage type, as I'm quite certain SQLce will stay sufficient.
I DO expect continued development and changes to the data scheme while the program is in active use; change business object properties (Hence database columns), and possibly overall table scheme. So old data must be transported to new scheme.
I also want to keep a decent degree of layer separation DAL/BLL, although this may not be necessary, it is good for me to learn these principles.
My question is: With these priorities, would I have any benefit by choosing either Linq2Sql vs. EntityFramwork? (and please explain why)
Btw, the project involves very simple table scheme and relations with only ~4 tables total.
Thanks!
u can use Linq to sql for this,actually linq to sql is the subset of adoentity framnework.
as per ur need its better to use linq to sql becoz ur database is not complicated as well it just have some tables. linq to sql is easy to use in respect to adoentitiesframeowrk
Keep in mind that Linq2Sql only works with MS SQL Server out of the box, not with SqlCe.
As it seems, there are some tricks to get it to work, but I never tried it myself...no idea if it works as well as with the "real" SQL Server.
So I guess Entity Framework would be the safer choice.

Which Database can i Safely use a GUID as Primary Key besides SQL Server?

The reason I want to use a Guid is because in the event that I have to split the database into I won't have primary keys that overlap on both databases. So if I use a Guid there won't be any overlapping. I also want to use the GUID in the url also, so the Guid will need to be Indexed.
I will be using ASP.NET C# as my web server.
Postgres has a UUID type. MySQL has a UUID function. Oracle has a SYS_GUID function.
As others have said you can use GUIDs/UUIDs in pretty much any modern DB. The algorithm for generating a GUID is pretty straitforward and you can be reasonably sure that you won't get dupes however there are some considerations.
+) Although GUIDs are generally representations of 128 Bit values the actual format used differs from implementation to implemenation - you may want to consider normalizing them by removing non-significant characters (usually dashes or spaces).
+) To absolutely ensure uniqueness you can also append a value to the guid. For example if you're worried about MS and Oracle guids colliding add "MS" to the former and "Or" to the latter - now even if the guids themselves do collide they keys won't.
As others have mentioned however there is a potentially severe price to pay here: your keys will be large (128 bits) and won't index very well (although this is somewhat dependent on the implementation).
The techique works very well for small databases (especially those where the entire dataset can fit in memory) but as DBs grow you'll definately have to accept a performance trade-off.
One thing you might consider is a hybrid approach. Without more information it's hard to really know what you're trying to do so these might not help:
1) Remember that primary keys don't have to be a single column - you can have a simple numeric key to identify your rows and another row, containing a single value, that identifies the database that hosts the data or created the key. Creating the primary key as aggregate of both columns allows indexing to index fewer complex values and should be significantly faster.
2) You can "fake it" by constructing the key as a concatenated field (as in the above idea to append a DB identifier to the key). So your key would be a simple number followed by some DB identifier (perhaps a guid for each DB).
Indexing such a value (since the values would still be sequential) should be much faster.
In both cases you'll have some manual work to do if you ever do split the DB(s) - you'll have to update some keys with a new DB ID, but this would be a one-time,infrequent event. In exchange you can tune your DB much better.
There are definately other ways to ensure data integrity across mutiple databases. Many enterprise DBMSs have tools built-in for clustering data across multiple servers or databases, some have special tools or design patterns that make it easier, etc.
In short I would say that guids are nice and simple and do what you want, but that you should only consider them if either a) the dataset is small or b) the DBMS has specific features to optimize their use as keys (for example sequential guids). If the datasets are going to be very large or if you're trying to limit DBMS-specific dependencies I would play around more with optimizing a "key + identifier" strategy.
Most any RDBMS you will use can take any number and type of columns as a PK. So, if you're storing the GUID as a CHAR(n) for some length n, you should be fine. Now, I'm not sure if this is advisable, as I'm guessing indexing on CHARs is not as efficient as on integers.
Hope that helps.
I suppose you could store a GUID as an int128 as well.
Both mySQL and postgres are known to support GUID datatypes (I believe it's called UUID but it's the same thing).
Unless I have completely lost my memory, a properly designed 3rd+ normal form database schema does not rely on unique ints, or by extension GUIDs or UUIDs for primary keys. Nor does it use intermediate lookup tables of ints/GUIDS/UUIDS to relate the tables containing the data.
You should grind your schema until it expresses the relations amongst tables of data in terms of the data in the tables, not auto-generated identifiers that have no intrinsic relationship to the data.
I freely grant that you may just possibly be doing something that really really requires GUIDs (or auto-increment integers) for primary keys. But I seriously doubt that is the case - it almost never is.
You can implement your own membership provider based on whatever database schema you choose to design. It's nowhere near as tricky as it may look at first.
google "roll your own membership provider" for plenty of pointers.
In my theoretical little world, you'd be able to do this with SQLite. You'd generate the Guid from .Net and write it to the SQLite database as a string. You could also index that field.
You do loose some of the index benefits because it'd be stored as a string but it should be fully backwards compatible so that you could import/export to/from SQL Server.
From looking through the comments it looks like you are trying to use a different database to MS SQL with the ASP.net membership provider - as others have mentioned you could roll your own provider to use a different DB however a quick Google search turned up a few ready made options:
MySQL Provider
MySQL Provider 2
SqlLite Provider
Hope these help
If you are using other MS technologies already you should consider Sql Server Express.
http://www.microsoft.com/express/sql/default.aspx
It is a real implementation of MS Sql Server and it is free. It does have significant limitations as you might imagine, but if your product can fit inside those you get the support, developer community and stability of Sql Server and a clear upgrade path if you need to grow.

Is using MS SQL Identity good practice?

Is using MS SQL Identity good practice in enterprise applications? Isn't it make difficulties in creating business logic, and migrating database from one to another?
Personally I couldn't live without identity columns and use them everywhere however there are some reasons to think about not using them.
Origionally the main reason not to use identity columns AFAIK was due to distributed multi-database schemas (disconnected) using replication and/or various middleware components to move data. There just was no distributed synchronization machinery avaliable and therefore no reliable means to prevent collisions. This has changed significantly as SQL Server does support distributing IDs. However, their use still may not map into more complex application controlled replication schemes.
They can leak information. Account ID's, Invoice numbers, etc. If I get an invoice from you every month I can ballpark the number of invoices you send or customers you have.
I run into issues all the time with merging customer databases and all sides still wanting to keep their old account numbers. This sometimes makes me question my addiction to identity fields :)
Like most things the ultimate answer is "it depends" specifics of a given situation should necessarily hold a lot of weight in your decision.
Yes, they work very well and are reliable, and perform the best. One big benefit of using identity fields vs non, is they handle all of the complex concurrency issues of multiple callers attempting to reserve new id's. This may seem like something trivial to code but it's not.
These links below offer some interesting information about identity fields and why you should use them whenever possible.
DB: To use identity column or not?
http://www.codeproject.com/KB/database/AgileWareNewGuid.aspx?display=Print
http://www.sqlmag.com/Article/ArticleID/48165/sql_server_48165.html
The question is always:
What are the chances that you're realistically going to migrate from one database to another? If you're building a multi-db app it's a different story, but most apps don't ever get ported over to a new db midstream - especially when they start out with something as robust as SQL Server.
The identity construct is excellent, and there's really very few reasons why you shouldn't use it. If you're interested, I wrote a blog article on some of the common myths surrounding identity values.
The IDENTITY Property: A Much-Maligned Construct in SQL Server
Yes.
They generally works as intended, and you can use the DBCC CHECKIDENT command to manipulate and work with them.
The most common idea of an identity is to provide an ordered list of numbers on which to base a primary key.
Edit: I was wrong about the fill factor, I didn't take into account that all of the inserts would happen on one side of the B-tree.
Also, In your revised question, you asked about migrating from one DB to another:
Identities are perfectly fine as long as the migrating is a one-way replication. If you have two databases that need to replicate to each other, a UniqueIdentifier column may be your best bet.
See: When are you truly forced to use UUID as part of the design? for a discussion on when to use a UUID in a database.
Good article on identities, http://www.simple-talk.com/sql/t-sql-programming/identity-columns/
IMO, migrating to another RDBMS is rarely needed these days. Even if it is needed, the best way to develop portable applications is to develop a layer of stored procedures isolating your application from proprietary features:
http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/02/24/writing-ansi-standard-sql-is-not-practical.aspx