Improving our algorithm with SQLite vs storing everything in memory - sql

The problem...I’m trying to figure out a way to make our algorithm faster.
Our algorithm...is written in C and runs on an embedded Linux system with little memory and a lackluster CPU. The entire algorithm makes heavy use of 2d arrays and stores them all in memory. At a high level, the algorithm’s input data, which is a single array of 250 doubles (0.01234, 0.02532….0.1286), is compared to a larger 2d array, which is 20k+ rows x 250 doubles. The input data is compared against the 20k+ rows using a for loop. For each iteration, the algorithm performs computations and stores those results in memory.
I’m not an embedded software developer, I am a cloud developer that uses databases (Postgres, mainly). Our embedded software doesn’t make use of any databases and, since that is what I know, I thought I’d look into SQLite.
My approach...applying what I know about databases, I'd go about it this way: I would have a single table with 6 columns: id, array, computation_1, computation_2, computation_3, and computation_4. I’d store all 20k+ rows in this table with the computation_* columns initially defaulted to null. Then I’d have the algorithm loop through each entry and update the values for each computation_* column accordingly. For graphical purposes, the table would look like this:
Storing arrays in a database doesn't seem like a good fit so I don't immediately understand if there is a benefit to doing this. But, it seems like it would replace the extensive use of malloc()/calloc() we have baked into the algorithm.
My question is...can SQLite help speed up our algorithm if I use it in the way I've described? Since I don’t know how much benefit this would provide, if any, I thought I’d ask the experts here on SO before going down this path. If it will (or won't) provide an improvement, I'd like to know why from a technical standpoint so that I can learn.
Thanks in advance.

As you have described it so far, SQLite won't help you.
A relational database stores data into tables with various indexes and so on. When it receives SQL, it compiles it into a bytecode program, and then it runs that bytecode program in an interpreter against those tables. You can learn more about SQLite's bytecode from https://www.sqlite.org/opcode.html.
This has a lot of overhead compared to native data structures in a low-level language. In my experience the difference is up to several orders of magnitude.
Why, then, would anyone use a database? It is because you'd have to write a lot of potentially buggy code to match it. Doubly so if you've got multiple users at the same time. Furthermore the database query optimizer is able to find efficient plans for computing complex joins that are orders of magnitude more efficient than what most programmers produce on their own.
So a database is not a recipe for doing arbitrary calculations more efficiently. But if you can describe what you are doing in SQL (particularly if it involves joins), the database may be able to find a much more efficient calculation than the one you're currently performing.
Even in that case, squeezing performance out of a low-end embedded system is a case where it may be worth figuring out what a database would do, and then writing code to do that directly.

Related

When to use a query or code [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am asking for a concrete case for Java + JPA / Hibernate + Mysql, but I think you can apply this question to a great number of languages.
Sometimes I have to perform a query on a database to get some entities, such as employees. Let's say you need some specific employees (the ones with 'John' as their firstname), would you rather do a query returning this exact set of employees, or would you prefer to search for all the employees and then use a programming language to retrieve the ones that you are interested with? why (ease, efficiency)?
Which is (in general) more efficient?
Is one approach better than the other depending on the table size?
Considering:
Same complexity, reusability in both cases.
Always do the query on the database. If you do not you have to copy over more data to the client and also databases are written to efficiently filter data almost certainly being more efficient than your code.
The only exception I can think of is if the filter condition is computationally complex and you can spread the calculation over more CPU power than the database has.
In the cases I have had a database the server has had more CPU power than the clients so unless overloaded will just run the query more quickly for the same amount of code.
Also you have to write less code to do the query on the database using Hibernates query language rather than you having to write code to manipulate the data on the client. Hibernate queries will also make use of any client caching in the configiration without you having to write more code.
There is a general trick often used in programming - paying with memory for operation speedup. If you have lots of employees, and you are going to query a significant portion of them, one by one (say, 75% will be queried at one time or the other), then query everything, cache it (very important!), and complete the lookup in memory. The next time you query, skip the trip to RDBMS, go straight to the cache, and do a fast look-up: a roundtrip to a database is very expensive, compared to an in-memory hash lookup.
On the other hand, if you are accessing a small portion of employees, you should query just one employee: data transfer from the RDBMS to your program takes a lot of time, a lot of network bandwidth, a lot of memory on your side, and a lot of memory on the RDBMS side. Querying lots of rows to throw away all but one never makes sense.
In general, I would let the database do what databases are good at. Filtering data is something databases are really good at, so it would be best left there.
That said, there are some situations where you might just want to grab all of them and do the filtering in code though. One I can think of would be if the number of rows is relatively small and you plan to cache them in your app. In that case you would just look up all the rows, cache them, and do subsequent filtering against what you have in the cache.
It's situational. I think in general, it's better to use sql to get the exact result set.
The problem with loading all the entities and then searching programmatically is that you ahve to load all the entitites, which could take a lot of memory. Additionally, you have to then search all the entities. Why do that when you can leverage your RDBMS and get the exact results you want. In other words, why load a large dataset that could use too much memory, then process it, when you can let your RDBMS do the work for you?
On the other hand, if you know the size of your dataset is not too, you can load it into memory and then query it -- this has the advantage that you don't need to go to the RDBMS, which might or might not require going over your network, depending on your system architecture.
However, even then, you can use various caching utilities so that the common query results are cached, which removes the advantage of caching the data yourself.
Remember, that your approach should scale over time. What may be a small data set could later turn into a huge data set over time. We had an issue with a programmer that coded the application to query the entire table then run manipulations on it. The approach worked fine when there were only 100 rows with two subselects, but as the data grew over the years, the performance issues became apparent. Inserting even a date filter to query only the last 365 days, could help your application scale better.
-- if you are looking for an answer specific to hibernate, check #Mark's answer
Given the Employee example -assuming the number of employees can scale over time, it is better to use an approach to query the database for the exact data.
However, if you are considering something like Department (for example), where the chances of the data growing rapidly is less, it is useful to query all of them and have in memory - this way you don't have to reach to the external resource (database) every time, which could be costly.
So the general parameters are these,
scaling of data
criticality to bussiness
volume of data
frequency of usage
to put some sense, when the data is not going to scale frequently and the data is not mission critical and volume of data is manageable in memory on the application server and is used frequently - Bring it all and filter them programatically, if needed.
if otherwise get only specific data.
What is better: to store a lot of food at home or buy it little by little? When you travel a lot? Just when hosting a party? It depends, isn't? Similarly, the best approach is a matter of performance optimization. That involves a lot of variables. The art is to both prevent painting yourself into a corner when designing your solution and optimize later, when you know your real bottlenecks. A good starting point is here: en.wikipedia.org/wiki/Performance_tuning One think could be more or less universally helpful: encapsulate your data access well.

How to store 15 x 100 million 32-byte records for sequential access?

Me got 15 x 100 million 32-byte records. Only sequential access and appends needed. The key is a Long. The value is a tuple - (Date, Double, Double). Is there something in this universe which can do this? I am willing to have 15 seperate databases (sql/nosql) or files for each of those 100 million records. I only have a i7 core and 8 GB RAM and 2 TB hard disk.
I have tried PostgreSQL, MySQL, Kyoto Cabinet (with fine tuning) with Protostuff encoding.
SQL DBs (with indices) take forever to do the silliest query.
Kyoto Cabinet's B-Tree can handle upto 15-18 million records beyond which appends take forever.
I am fed up so much that I am thinking of falling back on awk + CSV which I remember used to work for this type of data.
If you scenario means always going through all records in sequence then it may be an overkill to use a database. If you start to need random lookups, replacing/deleting records or checking if a new record is not a duplicate of an older one, a database engine would make more sense.
For the sequential access, a couple of text files or hand-crafted binary files will be easier to handle. You sound like a developer - I would probably go for an own binary format and access it with help of memory-mapped files to improve the sequential read/append speed. No caching, just a sliding window to read the data. I think that it would perform better and even on usual hardware than any DB would; I did such data analysis once. It would also be faster than awking CSV files; however, I am not sure how much and if it satisfied the effort to develop the binary storage, first of all.
As soon as the database becomes interesting, you can have a look at MongoDB and CouchDB. They are used for storing and serving very large amounts of data. (There is a flattering evaluation that compares one of them to traditional DBs.). Databases usually need a reasonable hardware power to perform better; maybe you could check out how those two would do with your data.
--- Ferda
Ferdinand Prantl's answer is very good. Two points:
By your requirements I recommend that you create a very tight binary format. This will be easy to do because your records are fixed size.
If you understand your data well you might be able to compress it. For example, if your key is an increasing log value you don't need to store it entirely. Instead, store the difference to the previous value (which is almost always going to be one). Then, use a standard compression algorithm/library to save on data size big time.
For sequential reads and writes, leveldb will handle your dataset pretty well.
I think that's about 48 gigs of data in one table.
When you get into large databases, you have to look at things a little differently. With an ordinary database (say, tables less than a couple million rows), you can do just about anything as a proof of concept. Even if you're stone ignorant about SQL databases, server tuning, and hardware tuning, the answer you come up with will probably be right. (Although sometimes you might be right for the wrong reason.)
That's not usually the case for large databases.
Unfortunately, you can't just throw 1.5 billion rows straight at an untuned PostgreSQL server, run a couple of queries, and say, "PostgreSQL can't handle this." Most SQL dbms have ways of dealing with lots of data, and most people don't know that much about them.
Here are some of the things that I have to think about when I have to process a lot of data over the long term. (Short-term or one-off processing, it's usually not worth caring a lot about speed. A lot of companies won't invest in more RAM or a dozen high-speed disks--or even a couple of SSDs--for even a long-term solution, let alone a one-time job.)
Server CPU.
Server RAM.
Server disks.
RAID configuration. (RAID 3 might be worth looking at for you.)
Choice of operating system. (64-bit vs 32-bit, BSD v. AT&T derivatives)
Choice of DBMS. (Oracle will usually outperform PostgreSQL, but it costs.)
DBMS tuning. (Shared buffers, sort memory, cache size, etc.)
Choice of index and clustering. (Lots of different kinds nowadays.)
Normalization. (You'd be surprised how often 5NF outperforms lower NFs. Ditto for natural keys.)
Tablespaces. (Maybe putting an index on its own SSD.)
Partitioning.
I'm sure there are others, but I haven't had coffee yet.
But the point is that you can't determine whether, say, PostgreSQL can handle a 48 gig table unless you've accounted for the effect of all those optimizations. With large databases, you come to rely on the cumulative effect of small improvements. You have to do a lot of testing before you can defensibly conclude that a given dbms can't handle a 48 gig table.
Now, whether you can implement those optimizations is a different question--most companies won't invest in a new 64-bit server running Oracle and a dozen of the newest "I'm the fastest hard disk" hard drives to solve your problem.
But someone is going to pay either for optimal hardware and software, for dba tuning expertise, or for programmer time and waiting on suboptimal hardware. I've seen problems like this take months to solve. If it's going to take months, money on hardware is probably a wise investment.

What database is most efficient for simple group by queries on tons of data?

For each account, I have millions of data items (rows in analytics logs), each with 20-50 numeric properties (they can be null too). I need to show them stats which mostly involve queries like SELECT SUM(f1), f2, f3 WHERE f4>f5 GROUP BY f2, f3. The aggregation functions are sometimes more complex than SUM(), and GROUP BY sometimes involves simple functions like ROUND(). The problem is that such queries are built in the user interface and can be run on any combination of those properties (though there are some popular combinations of course).
Once in the database, the data would most likely not be modified, only read. It should be possible to easily add/remove properties – not necessarily realtime in database terms, but it should not require complete table blocks like in MySQL.
What SQL or NoSQL databases would be best to handle these kinds of queries? I was thinking of PostgreSQL or MongoDB, even though in the latter I will most likely have to use MapReduce rather than its Group feature because of its limitations.
Any other advice on performance of such queries? Does this sound possible to do at all, or do I absolutely have to ask users to pre-define which exact queries they want to run?
Any ideas would be much appreciated.
What query performance are you looking for? How often will it be queried?
If you're OK with query performance in the low minutes and have a similarly low query rate, then you can use a relational table with a main table for the data items, and a join table for the properties. Be sure to put a combined index on the second table on the combination (property_type, data_item_id, property_value) to guarantee good query performance. You don't actually need property_value in there, but if you have it then queries can pull their data from the index in a highly efficient manner, which will make joins much, much easier. You can do this with any relational database. I happen to like PostgreSQL, but MySQL can also work. (But less efficiently on complex queries.)
If you follow this strategy then each property you want will require you to add yet another join. But the joins will be fairly efficient.
You can build this kind of application in an RDBMS or in a NoSQL database (Berkeley DB for example, has both a key-value pair API and a SQL API). The key-value pair API is a nice option, since it supports some pretty low level optimizations that may help when looking at how to tune the performance to meet your application needs.
Another option is to look into a columnar data store, but even that kind of product is going to have to retrieve data from multiple columns (which is slow in these kinds of databases) in order to resolve the kinds of queries that you list.
Ultimately the issue here boils down to disk I/O VS cache and data organization. The more data that you can fit into memory, the less I/O you have to perform and I/O is going to be the performance killer. The more compact you can make the data, the more rows will fit in the memory that you have. I would suggest looking into Berkeley DB, especially the key-value pair API. You can then choose to create one or more tables with the properties organized in an manner that optimizes the most frequent kinds of access. Additionally, if you're using the key-value pair API, take a look at the Bulk Get functions -- this allows you to fetch and process whole groups of records at a time.
You may also want to create and maintain some "well known" statistical results (in memory and/or persisted on disk) that allow you to take "shortcuts" when the user is asking for a value that has already been computed.
Good luck in your research.
What you've described - essentially ad hoc aggregate queries on data that does not need to be realtime - is what OLAP solutions are very good at. In addition to other suggestions you've seen, you should look into whether an OLAP solution makes sense for you.

gDatabase Optimization: Need a really big database to test some of the features of sql server

I have done database optimization for dbs upto 3GB size. Need a really large database to test optimization.
Simply generating a lot of data and throwing it into a table proves nothing about the DBMS, the database itself, the queries being issued against it, or the applications interacting with them, all of which factor into the performance of a database-dependent system.
The phrase "I have done database optimization for [databases] up to 3 GB" is highly suspect. What databases? On what platform? Using what hardware? For what purposes? For what scale? What was the model? What were you optimizing? What was your budget?
These same questions apply to any database, regardless of size. I can tell you first-hand that "optimizing" a 250 GB database is not the same as optimizing a 25 GB database, which is certainly not the same as optimizing a 3 GB database. But that is not merely on account of the database size, it is because databases that contain 250 GB of data invariably deal with requirements that are vastly different from those addressed by a 3 GB database.
There is no magic size barrier at which you need to change your optimization strategy; every optimization requires in-depth knowledge of the specific data model and its usage requirements. Maybe you just need to add a few indexes. Maybe you need to remove a few indexes. Maybe you need to normalize, denormalize, rewrite a couple of bad queries, change locking semantics, create a data warehouse, implement caching at the application layer, or look into the various kinds of vertical scaling available for your particular database platform.
I submit that you are wasting your time attempting to create a "really big" database for the purposes of trying to "optimize" it with no specific requirements in mind. Various data-generation tools are available for when you need to generate data fitting specific patterns for testing against a specific set of scenarios, but until you have that information on hand, you won't accomplish very much with a database full of unorganized test data.
The best way to do this is to create your schema and write a script to populate it with lots of random(ish) dummy data. Random, meaning that your text-fields don't necessarily have to make sense. 'ish', meaning that the data distribution and patterns should generally reflect your real-world DB usage.
Edit: a quick Google search reveals a number of commercial tools that will do this for you if you don't want to write your own populate scripts: DB Data Generator, DTM Data Generator. Disclaimer: I've never used either of these and can't really speak to their quality or usefulness.
Here is a free procedure I wrote to generate Random person names. Quick and dirty, but it works and might help.
http://www.joebooth-consulting.com/products/genRandNames.sql
I use Red-Gate's Data Generator regularly to test out problems as well as loads on real systems and it works quite well. That said, I would agree with Aaronnaught's sentiment in that the overall size of the database isn't nearly as important as the usage patterns and the business model. For example, generating 10 GB of data on a table that will eventually get no traffic will not provide any insight into optimization. The goal is to replicate the expected transaction and storage loads you anticipate to occur in order to identify bottlenecks before they occur.

Is there any performance reason to use powers of two for field sizes in my database?

A long time ago when I was a young lat I used to do a lot of assembler and optimization programming. Today I mainly find myself building web apps (it's alright too...). However, whenever I create fields for database tables I find myself using values like 16, 32 & 128 for text fields and I try to combine boolean values into SET data fields.
Is giving a text field a length of 9 going to make my database slower in the long run and do I actually help it by specifying a field length that is more easy memory aligned?
Database optimization is quite unlike machine code optimization. With databases, most of the time you want to reduce disk I/O, and wastefully trying to align fields will only make less records fit in a disk block/page. Also, if any alignment is beneficial, the database engine will do it for you automatically.
What will matter most is indexes and how well you use them. Trying tricks to pack more information in less space can easily end up making it harder to have good indexes. (Do not overdo it, however; not only do indexes slow down INSERTs and UPDATEs to indexed columns, they also mean more work for the planner, which has to consider all the possibilities.)
Most databases have an EXPLAIN command; try using it on your selects (in particular, the ones with more than one table) to get a feel for how the database engine will do its work.
The size of the field itself may be important, but usually for text if you use nvarchar or varchar it is not a big deal. Since the DB will take what you use. the follwoing will have a greater impact on your SQL speed:
don't have more columns then you need. bigger table in terms of columns means the database will be less likely to find the results for your queries on the same disk page. Notice that this is true even if you only ask for 2 out of 10 columns in your select... (there is one way to battle this, with clustered indexes but that can only address one limited scenario).
you should give more details on the type of design issues/alternatives you are considering to get additional tips.
Something that is implied above, but which can stand being made explicit. You don't have any way of knowing what the computer is actually doing. It's not like the old days when you could look at the assembler and know pretty well what steps the program is going to take. A value that "looks" like it's in a CPU register may actually have to be fetched from a cache on the chip or even from the disk. If you are not writing assembler but using an optimizing compiler, or even more surely, bytecode on a runtime engine (Java, C#), abandon hope. Or abandon worry, which is the better idea.
It's probably going to take thousands, maybe tens of thousands of machine cycles to write or retrieve that DB value. Don't worry about the 10 additional cycles due to full word alignments.