How to store 15 x 100 million 32-byte records for sequential access? - sql

Me got 15 x 100 million 32-byte records. Only sequential access and appends needed. The key is a Long. The value is a tuple - (Date, Double, Double). Is there something in this universe which can do this? I am willing to have 15 seperate databases (sql/nosql) or files for each of those 100 million records. I only have a i7 core and 8 GB RAM and 2 TB hard disk.
I have tried PostgreSQL, MySQL, Kyoto Cabinet (with fine tuning) with Protostuff encoding.
SQL DBs (with indices) take forever to do the silliest query.
Kyoto Cabinet's B-Tree can handle upto 15-18 million records beyond which appends take forever.
I am fed up so much that I am thinking of falling back on awk + CSV which I remember used to work for this type of data.

If you scenario means always going through all records in sequence then it may be an overkill to use a database. If you start to need random lookups, replacing/deleting records or checking if a new record is not a duplicate of an older one, a database engine would make more sense.
For the sequential access, a couple of text files or hand-crafted binary files will be easier to handle. You sound like a developer - I would probably go for an own binary format and access it with help of memory-mapped files to improve the sequential read/append speed. No caching, just a sliding window to read the data. I think that it would perform better and even on usual hardware than any DB would; I did such data analysis once. It would also be faster than awking CSV files; however, I am not sure how much and if it satisfied the effort to develop the binary storage, first of all.
As soon as the database becomes interesting, you can have a look at MongoDB and CouchDB. They are used for storing and serving very large amounts of data. (There is a flattering evaluation that compares one of them to traditional DBs.). Databases usually need a reasonable hardware power to perform better; maybe you could check out how those two would do with your data.
--- Ferda

Ferdinand Prantl's answer is very good. Two points:
By your requirements I recommend that you create a very tight binary format. This will be easy to do because your records are fixed size.
If you understand your data well you might be able to compress it. For example, if your key is an increasing log value you don't need to store it entirely. Instead, store the difference to the previous value (which is almost always going to be one). Then, use a standard compression algorithm/library to save on data size big time.

For sequential reads and writes, leveldb will handle your dataset pretty well.

I think that's about 48 gigs of data in one table.
When you get into large databases, you have to look at things a little differently. With an ordinary database (say, tables less than a couple million rows), you can do just about anything as a proof of concept. Even if you're stone ignorant about SQL databases, server tuning, and hardware tuning, the answer you come up with will probably be right. (Although sometimes you might be right for the wrong reason.)
That's not usually the case for large databases.
Unfortunately, you can't just throw 1.5 billion rows straight at an untuned PostgreSQL server, run a couple of queries, and say, "PostgreSQL can't handle this." Most SQL dbms have ways of dealing with lots of data, and most people don't know that much about them.
Here are some of the things that I have to think about when I have to process a lot of data over the long term. (Short-term or one-off processing, it's usually not worth caring a lot about speed. A lot of companies won't invest in more RAM or a dozen high-speed disks--or even a couple of SSDs--for even a long-term solution, let alone a one-time job.)
Server CPU.
Server RAM.
Server disks.
RAID configuration. (RAID 3 might be worth looking at for you.)
Choice of operating system. (64-bit vs 32-bit, BSD v. AT&T derivatives)
Choice of DBMS. (Oracle will usually outperform PostgreSQL, but it costs.)
DBMS tuning. (Shared buffers, sort memory, cache size, etc.)
Choice of index and clustering. (Lots of different kinds nowadays.)
Normalization. (You'd be surprised how often 5NF outperforms lower NFs. Ditto for natural keys.)
Tablespaces. (Maybe putting an index on its own SSD.)
Partitioning.
I'm sure there are others, but I haven't had coffee yet.
But the point is that you can't determine whether, say, PostgreSQL can handle a 48 gig table unless you've accounted for the effect of all those optimizations. With large databases, you come to rely on the cumulative effect of small improvements. You have to do a lot of testing before you can defensibly conclude that a given dbms can't handle a 48 gig table.
Now, whether you can implement those optimizations is a different question--most companies won't invest in a new 64-bit server running Oracle and a dozen of the newest "I'm the fastest hard disk" hard drives to solve your problem.
But someone is going to pay either for optimal hardware and software, for dba tuning expertise, or for programmer time and waiting on suboptimal hardware. I've seen problems like this take months to solve. If it's going to take months, money on hardware is probably a wise investment.

Related

What is the expected performance gap switching from SQL to TSDB for handling time series?

We are in the case of using a SQL database for a single node storage of roughly 1 hour of high frequency metrics (several k inserts a second). We quickly ran into I/O issues which proper buffering would not simply handle, and we are willing to put time into solving the performance issue.
I suggested to switch to a specialised database for handling time series, but my colleague stayed pretty skeptical. His argument is that the gain "out of the box" is not guaranteed as he knows SQL well and already spent time optimizing the storage, and we in comparison do not have any kind of TSDB experience to properly optimize it.
My intuition is that using a TSDB would be much more efficient even with an out of box configuration but I don't have any data to measure this, and internet benchs such as InfluxDB's are nowhere near trustable. We should run our own, except we can't affoard to loose time in a dead end or a mediocre improvement.
What would be, in my use case but very roughly, the performance gap between relational storage and TSDB, when it comes to single node throughput ?
This question may be bordering on a software recommendation. I just want to point one important thing out: You have an existing code base so switching to another data store is expensive in terms of development costs and time. If you have someone experienced with the current technology, you are probably better off with a good-faith effort to make that technology work.
Whether you switch or not depends on the actual requirements of your application. For instance, if you don't need the data immediately, perhaps writing batches to a file is the most efficient mechanism.
Your infrastructure has ample opportunity for in-place growth -- more memory, more processors, solid-state disk (for example). These might meet your performance needs with a minimal amount of effort.
If you cannot make the solution work (and 10k inserts per second should be quite feasible), then there are numerous solutions. Some NOSQL databases relax some of the strict ACID requirements of traditional RDBMSs, providing faster throughout.

Why Spark SQL considers the support of indexes unimportant?

Quoting the Spark DataFrames, Datasets and SQL manual:
A handful of Hive optimizations are not yet included in Spark. Some of
these (such as indexes) are less important due to Spark SQL’s
in-memory computational model. Others are slotted for future releases
of Spark SQL.
Being new to Spark, I'm a bit baffled by this for two reasons:
Spark SQL is designed to process Big Data, and at least in my use
case the data size far exceeds the size of available memory.
Assuming this is not uncommon, what is meant by "Spark SQL’s
in-memory computational model"? Is Spark SQL recommended only for
cases where the data fits in memory?
Even assuming the data fits in memory, a full scan over a very large
dataset can take a long time. I read this argument against
indexing in in-memory database, but I was not convinced. The example
there discusses a scan of a 10,000,000 records table, but that's not
really big data. Scanning a table with billions of records can cause
simple queries of the "SELECT x WHERE y=z" type take forever instead
of returning immediately.
I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.
I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?
Indexing input data
The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.
Indexing Distributed Data Structures:
standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.
That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.
Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.
Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.
In general, the utility of indexes is questionable at best. Instead, data partitioning is more important. They are very different things, and just because your database of choice supports indexes doesn't mean they make sense given what Spark is trying to do. And it has nothing to do with "in memory".
So what is an index, anyway?
Back in the days when permanent storage was crazy expensive (instead of essentially free) relational database systems were all about minimizing usage of permanent storage. The relational model, by necessity, split a record into multiple parts -- normalized the data -- and stored them in different locations. To read a customer record, maybe you read a customer table, a customerType table, take a couple of entries out of an address table, etc. If you had a solution that required you to read the entire table to find what you want, this is very costly, because you have to scan so many tables.
But this is not the only way to do things. If you didn't need to have fixed-width columns, you can store the entire set of data in one place. Instead of doing a full-table scan on a bunch of tables, you only need to do it on a single table. And that's not as bad as you think it is, especially if you can partition your data.
40 years later, the laws of physics have changed. Hard drive random read/write speeds and linear read/write speeds have drastically diverged. You can basically do 350 head movements a second per disk. (A little more or less, but that's a good average number.) On the other hand, a single disk drive can read about 100 MB per second. What does that mean?
Do the math and think about it -- it means if you are reading less than 300KB per disk head move, you are throttling the throughput of your drive.
Seriouusly. Think about that a second.
The goal of an index is to allow you to move your disk head to the precise location on disk you want and just read that record -- say just the address record joined as part of your customer record. And I say, that's useless.
If I were designing an index based on modern physics, it would only need to get me within 100KB or so of the target piece of data (assuming my data had been laid out in large chunks -- but we're talking theory here anyway). Based on the numbers above, any more precision than that is just a waste.
Now go back to your normalized table design. Say a customer record is really split across 6 rows held in 5 tables. 6 total disk head movements (I'll assume the index is cached in memory, so no disk movement). That means I can read 1.8 MB of linear / de-normalized customer records and be just as efficient.
And what about customer history? Suppose I wanted to not just see what the customer looks like today -- imagine I want the complete history, or a subset of the history? Multiply everything above by 10 or 20 and you get the picture.
What would be better than an index would be data partitioning -- making sure all of the customer records end up in one partition. That way with a single disk head move, I can read the entire customer history. One disk head move.
Tell me again why you want indexes.
Indexes vs ___ ?
Don't get me wrong -- there is value in "pre-cooking" your searches. But the laws of physics suggest a better way to do it than traditional indexes. Instead of storing the customer record in exactly one location, and creating a pointer to it -- an index -- why not store the record in multiple locations?
Remember, disk space is essentially free. Instead of trying to minimize the amount of storage we use -- an outdated artifact of the relational model -- just use your disk as your search cache.
If you think someone wants to see customers listed both by geography and by sales rep, then make multiple copies of your customer records stored in a way that optimized those searches. Like I said, use the disk like your in memory cache. Instead of building your in-memory cache by drawing together disparate pieces of persistent data, build your persistent data to mirror your in-memory cache so all you have to do is read it. In fact don't even bother trying to store it in memory -- just read it straight from disk every time you need it.
If you think that sounds crazy, consider this -- if you cache it in memory you're probably going to cache it twice. It's likely your OS / drive controller uses main memory as cache. Don't bother caching the data because someone else is already!
But I digress...
Long story short, Spark absolutely does support the right kind of indexing -- the ability to create complicated derived data from raw data to make future uses more efficient. It just doesn't do it the way you want it to.

Scalability of Using MySQL as a Key/Value Database

I am interested to know the performance impacts of using MySQL as a key-value database vs. say Redis/MongoDB/CouchDB. I have used both Redis and CouchDB in the past so I'm very familiar with their use cases, and know that it's better to store key/value pairs in say NoSQL vs. MySQL.
But here's the situation:
the bulk of our applications already have lots of MySQL tables
We host everything on Heroku (which only has MongoDB and MySQL, and is basically 1-db-type per app)
we don't want to be using multiple different databases in this case.
So basically, I'm looking for some info on the scalability of having a key/value table in MySQL. Maybe at three different arbitrary tiers:
1000 writes per day
1000 writes per hour
1000 writes per second
1000 reads per hour
1000 reads per second
A practical example is in building something like MixPanel's Real-time Web Analytics Tracker, which would require writing very often depending on traffic.
Wordpress and other popular software use this all the time: Post has "Meta" model which is just key/value, so you can add arbitrary properties to an object which can be searched over.
Another option is to store a serializable hash in a blob but that seems worse.
What is your take?
I'd say that you'll have to run your own benchmark because it is only you that knows the following important aspects:
the size of the data to be stored in this KV table
the level of parallelism you want to achieve
the number of existing queries reaching your MySQL instance
I'd also say that depending on the durability requirements for this data, you'll also want to test multiple engines: InnoDB, MyISAM.
While I do expect some NoSQL solutions to be faster, based on your constraints you may find out that MySQL will perform good enough for your requirements.
SQL databases are more and more used as a persistance layer, with computations and delivery cached in Key-Value repositories.
With this in mind, those guys have done quite a test here:
InnoDB inserts 43,000 records per second AT ITS PEAK*;
TokuDB inserts 34,000 records per second AT ITS PEAK*;
This KV inserts 100 millions of records per second (2,000+ times more).
To answer your question, a Key-Value repository is more than likely to outdo MySQL by several orders of magnitude:
Processing 100,000,000 items:
kv_add()....time:....978.32 ms
kv_get().....time:....297.07 ms
kv_free()....time:........0.00 ms
OK, your test was 1,000 ops per second, but it can't hurt to be able to do 1,000 times more!
See this for further details (they also compare it with Tokyo Cabinet).
There is no doubt that using a NOSQL solution is going to be faster, since it is simpler.
NOSQL and Relational do not compete with each other, they are different tools that can solve different problems.
That being said for 1000 writes/day or per hour, MySQL will have no problem.
For 1000 per second you will need some fancy hardware to get there. For the NOSQL solution you will probably still need some distributed file system.
It also depends on what you are storing.
Check out the series of blog posts here where the author runs tests comparing MongoDB and MySQL performance, and fights through the MySQL performance tuning mess. MongoDB was doing ~100K row reads per second, MySQL in c/s mode was doing 43K max, but with the embedded library he managed to get it up to 172K row reads per second.
It sounds a little complicated to get that high on a single node, so ymmv.
The writes/second question is a little harder, but this still might give you some ideas on configs to try.
You should first implement it in the simplest way then compare that. Always test things. This means:
Create a schema that's representative of your use case.
Create queries representative of your use case.
Create significant amounts of dummy data representive of your use case.
In a variety of loops, including both random access and sequential, bench mark it.
Ensure you use concurrency (run many processes randomly hammering the server with all kinds of queries representative of your use cases).
Once you have that, measure, test. There are different ways you can go about it. Some tests can be simple but might be less realistic. Measure throughput and latency.
Then try to optimise it.
MySQL has one particular limitation for KV which is the standard Engines with persistence use indexes optimised for range lookups, not for KV, which might introduce some overhead, though it's also difficult to have things such as hash work with persistent storage due to rehashing. Memory tables support a hash index.
Many people associate certain things with being slow such as SQL, RELATIONAL, JOINS, ACID, etc.
When using an ACID capable relational database, you don't have to necessarily use ACID or relations.
While joins have a bad reputation for being slow this is usually down to misconceptions about joins. Often people simply write bad queries. This is made more difficult as SQL is declarative, it can get things wrong, especially with JOINs where there are often multiple ways to perform the join. What people are actually getting out of NoSQL in this case is imperative. NoDeclaritive would be more accurate as that's the problem with SQL a lot of people are having. Quite often people simply lack indexes. That's not an argument in favour of joins but rather to illuminate where people can get it wrong on speed.
Traditional databases can be extremely fast if you do certain special things for that such as ignoring data integrity or handling it elsewhere. You don't have to wait for the harddrive to flush writes, you don't have to enforce relations, you don't have to enforce unique constraints, you don't have to use transactions but if you do replace safety with speed then you need to know what you're doing.
NoSQL solutions by comparison first and foremost tend to be designed to support various modes of scaling out of the box. The performance of an individual node might not be quite what you expect. NoSQL solutions also struggle for general use with many having quite unusual performance characteristics or limited feature sets.

gDatabase Optimization: Need a really big database to test some of the features of sql server

I have done database optimization for dbs upto 3GB size. Need a really large database to test optimization.
Simply generating a lot of data and throwing it into a table proves nothing about the DBMS, the database itself, the queries being issued against it, or the applications interacting with them, all of which factor into the performance of a database-dependent system.
The phrase "I have done database optimization for [databases] up to 3 GB" is highly suspect. What databases? On what platform? Using what hardware? For what purposes? For what scale? What was the model? What were you optimizing? What was your budget?
These same questions apply to any database, regardless of size. I can tell you first-hand that "optimizing" a 250 GB database is not the same as optimizing a 25 GB database, which is certainly not the same as optimizing a 3 GB database. But that is not merely on account of the database size, it is because databases that contain 250 GB of data invariably deal with requirements that are vastly different from those addressed by a 3 GB database.
There is no magic size barrier at which you need to change your optimization strategy; every optimization requires in-depth knowledge of the specific data model and its usage requirements. Maybe you just need to add a few indexes. Maybe you need to remove a few indexes. Maybe you need to normalize, denormalize, rewrite a couple of bad queries, change locking semantics, create a data warehouse, implement caching at the application layer, or look into the various kinds of vertical scaling available for your particular database platform.
I submit that you are wasting your time attempting to create a "really big" database for the purposes of trying to "optimize" it with no specific requirements in mind. Various data-generation tools are available for when you need to generate data fitting specific patterns for testing against a specific set of scenarios, but until you have that information on hand, you won't accomplish very much with a database full of unorganized test data.
The best way to do this is to create your schema and write a script to populate it with lots of random(ish) dummy data. Random, meaning that your text-fields don't necessarily have to make sense. 'ish', meaning that the data distribution and patterns should generally reflect your real-world DB usage.
Edit: a quick Google search reveals a number of commercial tools that will do this for you if you don't want to write your own populate scripts: DB Data Generator, DTM Data Generator. Disclaimer: I've never used either of these and can't really speak to their quality or usefulness.
Here is a free procedure I wrote to generate Random person names. Quick and dirty, but it works and might help.
http://www.joebooth-consulting.com/products/genRandNames.sql
I use Red-Gate's Data Generator regularly to test out problems as well as loads on real systems and it works quite well. That said, I would agree with Aaronnaught's sentiment in that the overall size of the database isn't nearly as important as the usage patterns and the business model. For example, generating 10 GB of data on a table that will eventually get no traffic will not provide any insight into optimization. The goal is to replicate the expected transaction and storage loads you anticipate to occur in order to identify bottlenecks before they occur.

Is there any performance reason to use powers of two for field sizes in my database?

A long time ago when I was a young lat I used to do a lot of assembler and optimization programming. Today I mainly find myself building web apps (it's alright too...). However, whenever I create fields for database tables I find myself using values like 16, 32 & 128 for text fields and I try to combine boolean values into SET data fields.
Is giving a text field a length of 9 going to make my database slower in the long run and do I actually help it by specifying a field length that is more easy memory aligned?
Database optimization is quite unlike machine code optimization. With databases, most of the time you want to reduce disk I/O, and wastefully trying to align fields will only make less records fit in a disk block/page. Also, if any alignment is beneficial, the database engine will do it for you automatically.
What will matter most is indexes and how well you use them. Trying tricks to pack more information in less space can easily end up making it harder to have good indexes. (Do not overdo it, however; not only do indexes slow down INSERTs and UPDATEs to indexed columns, they also mean more work for the planner, which has to consider all the possibilities.)
Most databases have an EXPLAIN command; try using it on your selects (in particular, the ones with more than one table) to get a feel for how the database engine will do its work.
The size of the field itself may be important, but usually for text if you use nvarchar or varchar it is not a big deal. Since the DB will take what you use. the follwoing will have a greater impact on your SQL speed:
don't have more columns then you need. bigger table in terms of columns means the database will be less likely to find the results for your queries on the same disk page. Notice that this is true even if you only ask for 2 out of 10 columns in your select... (there is one way to battle this, with clustered indexes but that can only address one limited scenario).
you should give more details on the type of design issues/alternatives you are considering to get additional tips.
Something that is implied above, but which can stand being made explicit. You don't have any way of knowing what the computer is actually doing. It's not like the old days when you could look at the assembler and know pretty well what steps the program is going to take. A value that "looks" like it's in a CPU register may actually have to be fetched from a cache on the chip or even from the disk. If you are not writing assembler but using an optimizing compiler, or even more surely, bytecode on a runtime engine (Java, C#), abandon hope. Or abandon worry, which is the better idea.
It's probably going to take thousands, maybe tens of thousands of machine cycles to write or retrieve that DB value. Don't worry about the 10 additional cycles due to full word alignments.