moving from sql server to cassandra - sql

I have a data intensive project for which I wrote the code recently, the data and sp live in a MS SQL db. My initial estimate is that the db will grow to 50TB, then it will become fairly static in growth. The final application will perform lots of row level look ups and readings, with a very small percentile of db write backs.
With the above scenario in mind, its being suggested that I should look at a NoSQL option in order to scale to the large load of data and transactions, and after a bit of research the roads leads to Cassandra (while considering MongoDB as a second alternative)
I would appreciate your guidance with the following set of initial questions:
-Does Cassandra support the concept of store procs?
-Would I be able to install and run the 50TB db on a single node (single Windows Server)?
-Does Cassandra support/leverage multiple CPUs in single server (ex: 4 CPUs)?
-Would open source version be able to support the 50TB db? or would I need to purchase the ENT version?
Regards,
-r

Does Cassandra support the concept of store procs?
Cassandra does not support stored procedures. However there is a feature called "prepared statements" which allows you to submit a CQL query once, and then have it executed multiple times with different parameters. But the set of things you can do with prepared statements is limited to regular CQL. In particular you can not do things like loops, conditional statements or other interesting things. But you do get some measure of protection against injection attacks and savings on multiple compilations.
Would I be able to install and run the 50TB db on a single node (single Windows Server)?
I am not aware of anything that would prevent you from running a 50TB database on one node, but you may require lots of memory to keep things relatively smooth, as you RAM/storage ratio is likely to be very low and thus impact your ability to cache disk data meaningfully. What is not recommended, however, is running a production setup on Windows. Cassandra uses some Linux specific IO optimizations, and is tested much more thoroughly on Linux. Far-out setups like you're suggesting are especially likely to be untested on Windows.
Does Cassandra support/leverage multiple CPUs in single server (ex: 4 CPUs)?
Yes
Would open source version be able to support the 50TB db? or would I need to purchase the ENT version?
The Apache distro does not have any usage limits baked into it (it makes little sense in an open source project, if you think about it). Neither does the free version from DataStax, the Community Edition.

Related

Can Microsoft Azure be used to run an Access VBA simulation model?

I have some simulation models that I routinely use that were built in Microsoft Access VBA. I have just became aware of Microsoft Azure (I know I am late to the show), and was curious to know if there was anyway to run my model via Azure's distributed computing services to make them faster?
I saw something call SQL Azure on the website but I didn't entirely understand the product. 95% of the computation that exists in the VBA model are sql commands.
If you have any knowledge or experience I would love to hear from you.
SQL Azure is most like a remote SQL Server which - as you know - knows nothing about VBA.
But you can create one or more virtual machines hosted at Azure and install your application on this/these. Then, as needed, you can assign expanded CPU resources to these machines for as little as one hour.
Azure has a free tier. Create an account and you have access to quite some resources to evaluate.
While Azure is distributed, it also utility computing (that means slow in terms of processing ability and considerable amounts of “governing” in terms of CPU available).
The other issue of course is that running SQL on the cloud OS Azure means that any data you pull into Access/VBA has to occur over a VERY slow network connection. This connection is 1000’s of times slower than a local Access table.
So the real issue then becomes transfer of data. You could I suppose re-write the VBA code into t-SQL code and much dump the use of Access. However t-sql is even less suited to simulation type software than that of VBA (and procedural t-sql code is not that fast in terms of execution speed either).
So between bandwidth issues, and that of t-sql on sql server being a rather limited language when it comes to writing + running lots of procedural code (which most simulation software entails), then this approach is likely the wrong approach and wrong technology here.

What database strategy to choose for a large web application

I have to rewrite a large database application, running on 32 servers. The hardware is up to date, each machine has two quad core Xeon and 32 GByte RAM.
The database is multi-tenant, each customer has his own file, around 5 to 10 GByte each. I run around 50 databases on this hardware. The app is open to the web, so I have no control
on the load. There are no really complex queries, so SQL is not required if there is a better solution.
The databases get updated via FTP every day at midnight. The database is read-only.
C# is my favourite language and I want to use ASP.NET MVC.
I thought about the following options:
Use two big SQL servers running SQL Server 2012 to serve the 32 servers with data. On the 32 servers running IIS hosting providing REST services.
Denormalize the database and use Redis on each webserver. Use booksleeve as a Redis client.
Use a combination of SQL Server and Redis
Use SQL Server 2012 together with Hadoop
Use Hadoop without SQL Server
What is the best way for a read-only database, to get the best performance without loosing maintainability? Does Map-Reduce make sense at all in such a scenario?
The reason for the rewrite is, the old app written in C++ with ISAM technology is too slow, the interfaces are old fashioned and not nice to use from an website, especially when using ajax.
The app uses a relational datamodel with many tables, but it is possible to write one accerlerator table where all queries can be performed on, and all other information from the other tables are possible by a simple key lookup.
Few questions. What problems have come up that you're rewriting this? What do the query patterns look like? It sounds like you would be most comfortable with a SQLServer + caching (memcached) to address whatever issues that are causing you to rewrite this. Redis is good, but you won't need the data structure features with the db handling queries, and you don't need persistance if it's only being used as a cache. Without knowing more about the problem, I guess I'd look at MongoDB to handle data sharding, redundant storage, and caching all in one solution. There are no special machines in this setup, redundancy can be configured, and the load should balance well.
This question is almost an opinion piece. I'd personally prefer an Oracle RAC with TimesTen for caching if performance is of the utmost importance, and if volume of concurrent reads is high during the day.
There's a white paper here...
http://www.oracle.com/us/products/middleware/timesten-in-memory-db-504865.pdf
The specs of the disk subsystem and organization of indexes and data files across physical disks is probably the most important factor though.

Single logical SQL Server possible from multiple physical servers?

With Microsoft SQL Server 2005, is it possible to combine the processing power of multiple physical servers into a single logical sql server? Is it possible on SQL Server 2008?
I'm thinking, if the database files were located on a SAN and somehow one of the sql servers acted as a kind of master, then processing could be spread out over multiple physical servers, for instance even allowing simultaneous updates where there was no overlap, and in the case of read-only queries on unlocked tables no limit.
We have an application that is limited by the speed of our sql server, and probably stuck with server 2005 for now. Is the only option to get a single more powerful physical server?
Sorry I'm not an expert, I'm not sure if the question is a stupid one.
TIA
Before rushing out and buying new hardware, find out where your bottlenecks really are. Many locking problems can be solved with the appropriate indexes for your workload.
For example, I've seen instances where placing tempDB on SSD solved performance issues and saved the client buying an expensive new server.
Analyse your workload: How Can I Log and Find the Most Expensive Queries?
With SQL Server 2008 you can utilise the Management Data Warehouse (MDW) to capture your workload.
White Paper: SQL Server 2008 Performance and Scale
Also: please be aware that a SAN solution is not necessarily a faster I/O solution than directly attached storage. It depends on the SAN, number of Physical disks in a LUN, LUN subscription and usage, the speed of the HBA's and several other hardware factors...
Optimizing the app may be a big job of going through all business logic and lines of code. But looking for the most expansive query can easily locate the bottleneck area. Maybe it only happens to a couple of the biggest tables, views or stored procedures. Add or fine tune an index may help right the way. If bumping up the RAM is possible try that option as well. That is cheap and easy configure.
Good luck.
You might want to google for "sql server scalable shared database". Yes you can store your db files on a SAN and use multiple servers, but you're going to have to meet some pretty rigid criteria for it to be a performance boost or even useful (high ratio of reads to writes, small enough dataset to fit in memory or a fast enough SAN, multiple concurrent accessors, etc, etc).
Clustering is complicated and probably much more expensive in the long run than a bigger server, and far less effective than properly optimized application code. You should definitely make sure your app is well optimized.

How to configure a Firebird Database to run in memory

I'm running a software called Fishbowl inventory and it is running on a firebird database (Windows server 2003) at this time the fishbowl software is running extremely slow when more then one user accesses the software. I'm thinking I maybe able to speed up the application by forcing the database to run "In Memory". However I can not find documentation on how to do this. Any help would be greatly appreciated.
Thank you in advance.
Robert
Firebird does not have memory tables - they may or may not be added in future versions (>3) but certainly not in the upcoming 2.5. There can be any other number of reasons why your software is slow with multiple users; however, Firebird itself has pretty good concurrency, so make sure you find the actual bottleneck first.
+1 to Holger. Find the bottleneck first.
Sinática Monitor may help you.
In-memory tables are nice either for OLAP (when data is not changing) or for temporary internal data storage.
In both cases data loss is not danger.
Pity that FB has no in-memory mode. I think about using SQLite as result.
As for caching, i think simple parallel thread that reads all the blocks of database file would make it in-memory - in OS cache if OS has enough memory.
But i also think, that OS already cached as much of DB file as it could and agressive forcing to cache would make overall performance even worse.
I had read an article some time ago, from someone who did a memory drive (like in old DOS) and ran a Database there. The problem is if anything fails, you lose everything. You should do backups very often to ensure a minimum of security.
Not a good idea at all I think.

Scaling cheaply: MySQL and MS SQL

How cheap can MySQL be compared to MS SQL when you have tons of data (and joins/search)? Consider a site like stackoverflow full of Q&As already and after getting dugg.
My ASP.NET sites are currently on SQL Server Express so I don't have any idea how cost compares in the long run. Although after a quick research, I'm starting to envy the savings MySQL folks get.
MSSQL Standard Edition (32 or 64 bit) will cost around $5K per CPU socket. 64 bit will allow you to use as much RAM as you need. Enterprise Edition is not really necessary for most deployments, so don't worry about the $20K you would need for that license.
MySQL is only free if you forego a lot of the useful tools offered with the licenses, and it's probably (at least as of 2008) going to be a little more work to get it to scale like Sql Server.
In the long run I think you will spend much more on hardware and people than you will on just the licenses. If you need to scale, then you will probably have the cash flow to handle $5K here and there.
The performance benefits of MS SQL over MySQL are fairly negligible, especially if you mitigate them with server and client side optimzations like server caching (in RAM), client caching (cache and expires headers) and gzip compression.
I know that stackoverflow has had problems with deadlocks from reads/writes coming at odd intervals but they're claiming their architecture (MSSQL) is holding up fine. This was before the public beta of course and according to Jeff's twitter earlier today:
the range of top 32 newest/modified
questions was about 20 minutes in the
private beta; now it's about 2
minutes.
That the site hasn't crashed yet is a testament to the database (as well as good coding and testing).
But why not post some specific numbers about your site?
MySQL is extremely cheap when you have the distro (or staff to build) that carries MySQL Enterprise edition. This is a High Availability version which offers multi-master replication over many servers.
Pros are low (license-) costs after initial purchase of hardware (Gigs of RAM needed!) and time to set up.
The drawbacks are suboptimal performance with many joins, no full-text indexing, stored procesures (I think) and one need to replicate grants to every master node.
Yet it's easier to run than the replication/proxy balancing setup that's available for PostgreSQL.