Sorting in application vs sorting in DB - sql

When querying for the top N results, I can ask the DB to sort the results OR I can sort them myself.
I read a lot about performance and memory advantage that the DB has over in-app sorting. However, assuming I write an optimal sorting code, isn't the performance equal in both options?
Both are using the same CPU, both can allocate threads and both can allocate more space in memory to perform the sort.
All the answers I found in the subject are more of less the same - saying
"just let the DB do it, it will do it better than you", or
"the rule of thumb is do anything in the DB unless a specific need arises such as complex sorts..."
So, Why choose DB-sorting over in-app sorting (besides saving the network bandwidth by not asking for millions of table entries to sort upon)?

With an app sort you need transfer all the DATA, with a database sort you need just transfer N rows !
Database implement already the most efficient sort algorythm.
If index already exist, DMBS can return top N without sort the data.
Edit :
If your dataset is very small, it can be stored in memory client side, then you can ordered it by the app. Can be a good solution, if you need reorder data without refresh data from your DB.
In other case use DB sort.

Related

Neo4j import tool and querying

I have some very basic conceptual questions related to functioning of neo4j.
1. First questions is about import tool. I am importing around 150 million nodes and a similar amount of relationships. When I do an upload the output on command terminal prints the number of nodes uploaded and then prepare node index. What is this node index? Where is it actually used? I see that the created index information is present in the graph_db=>schema=>label. What is this index and where is it actually used? Running a cypher query with does not show that index is being used anywhere.
2. Second questions is about the heap memory size of neo4j. What I understood that while running cypher queries, results are stored in heap. Once the heap is full, a garbage collection happens. What if I run a cypher statement that produces results that can not be kept in heap i.e. the result of query is bigger than the heap size. Would neo4j switch to disk? or would it produce an error.
Thanks for clearing these questions in advance.
Best,
What is this node index? Where is it actually used?
The index is just that - a database index. A database index is what's used to help you look up nodes really quickly. Say you put 1 million :Person nodes into a database, then 1 million :Location nodes in a database. When you MATCH (p:Person { last_name: "Smith" } you want the database to search through only the :Person nodes, and not all 2 million. The index is what makes that happen.
Read up on indexes in neo4j
What is this index and where is it actually used?
The index by label is basically a searchable collection of nodes categorized by label (in this case :Person and :Location) that the database engine uses to speed lookups. This is a greatly simplified answer, but basically accurate. This is a very good thing, you definitely want it. Performance of getting data out of the database would be quite bad without it.
Indexes are all about trading computation time and storage for better performance. Basically, the database pre-orders all of the nodes in a certain way (which costs you up-front computation time, and also a small amount of storage on disk) in exchange for having a nice data structure in place that makes queries very fast. Generally in database terms, you'll find that if you do a lot of read-only queries (fetching data) you really, really want indexes. If your workload is mostly just adding stuff (not lookups), they're not as good.
Running a cypher query with does not show that index is being used anywhere.
Yes, it's invisible, but when you search for something in Cypher using a label, neo4j is exploiting that index. It may be invisible but it's being used to optimize your query.
What I understood that while running cypher queries, results are stored in heap
Well that's only partially true; in some senses everything in java is stored in the heap. But results stream back from the database. If you issue a query that results in 1 million results, it is not the case that all 1 million go into the heap immediately. They get pulled in blocks at a time (I don't know how many at a time, the db engine handles that). At any given time, what's in heap is the set you need right now, not everything.
What if I run a cypher statement that produces results that can not be kept in heap i.e. the result of query is bigger than the heap size
See earlier answer. You can do this without problem, because the entire set generally isn't in the heap. In database terms, we'd say you get a "cursor" back, that lets you iterate through results. You do not get a huge result set back. The gotcha here is that if you have 1million results, you can iterate through them once. Need to run through them a second time? Avoid doing that, or issue the query again.
Would neo4j switch to disk?
No - if/when any swapping to disk happened, in any case that would be an operating system decision dealing with your main memory. It's possible it would happen, but that wouldn't have much to do with neo4j.
or would it produce an error
Nope, neo4j doesn't care how big your result set it. With the "cursor" concept, you can get 1 result or 10 billion results, both will work.

Rails / SQL ... better to access database once and store data in array?

The heart of my app is a multi-conditional comparison using an input array and parameters stored in a few database tables.
I'm trying to make the process most efficient ... and I think this could lead to some good conversation about using memory vs. accessing a database.
Here's one example:
I have a Merchant, MerchantUserRelation, and User table.
Most of the data I need to store in a temporary array is from MerchantUserRelation, BUT at one point, I need to check if it's the user's birthday (user.birthdate.today?).
To me, it seems there are two options:
Create a temporary array with only data from UserMerchantRelation and then access the database separately for the user.birthdate.today? method (2 hits to the database), --or--
Create a slightly-larger temporary array with both the data from UserMerchantRelation AND User (and thus hit the database only once)
For this example I recognize the differences are EXTREMELY small (read: negligible), but what if the array sizes and # of database accesses required were much larger?
Thank you for any references and/or insights!
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil - Donald Knuth
As you said, the difference is negligible. And there is no absolute solution: memory or db hits. So why care them now? If your app grows and really met bottleneck, you can always profile and refactor to overcome it, by whatever workable methods.

How Expensive is SQL ORDER BY?

I don't quite understand how a SQL command would sort a large resultset. Is it done in memory on the fly (i.e. when a query is perfomed)?
Is is going to be faster to sort using ORDER BY in SQL rather than sort say a linked list of objects containing the results in a language like Java (assuming a fast built-in sort, probably using quicksort)?
It will almost certainly be more efficient to sort the data in the database. Databases are designed to deal with large data volumes. And there are various optimizations available to the database that would not be available to the middle tier. If you plan on writing a hyper-efficient sort routine in the middle tier that takes advantage of information that you have about your data that the database doesn't (i.e. farming the data out to a cluster of dozens of middle tier machines so that the sort never spills to disk, taking advantage of the fact that your data is mostly ordered to choose an algorithm that wouldn't normally be particularly efficient), you can probably beat the database's sort speed. But that tends to be rare.
Depending on the query, for example, the database optimizer may choose a query plan that returns the data in order without performing a sort. For example, the database knows that the data in an index is sorted so it may choose to do an index scan to return the data in order without ever having to materialize and sort the entire result set. If it does have to materialized the entire result, it only needs the columns you are sorting by and some sort of row identifier (i.e. a ROWID in Oracle) rather than sorting an entire row of data like a naive middle tier implementation is likely to do. For example, if you have a composite index on (col1, col2) and you decide to sort on UPPER(col2), LOWER(col1), the database could read the col1 & col2 values from the index, sort the row identifiers, and then go fetch the data from the table. Of course, the database doesn't have to do this-- the optimizer will take into account the cost of doing a sort against the cost of fetching the data from the table or from the various indexes. The database may well conclude that the most efficient approach is to do a table scan, read the entire row into memory, and sort it. It may conclude that leveraging an index results in more I/O to fetch the data but makes up for it by reducing or eliminating the sort costs.
The answer is... it depends. If the ORDER BY part can be done by using an index in the database, then the execution plan for the query will use that index and the results will come back in the right order straight from the DB. If not, then the database will perform the sorting, but it's likely better at it than you reading all the results into memory (and certainly better than reading the results into a linked list).
The exact method depends on the product you are using, but normally a fully-featured DBMS has multiple sort algorithms at its disposal. Some work on disk, optimizing for space over time, some work in memory, optimizing for speed. Check the source code of the available open source ones, if you are interested in the gory details.
It's unlikely that you are going to get better results by doing the sorting yourself or using some other library, although there can be pathological cases such as some operating system's qsort() having problems with certain data distributions. Try it out if you must, but prefer using a DBMS to manage your data, because that's what they are good at.
Unless sort is index based if you use database sort you are guaranteeing you will wait for entire result set to be resolved and sorted in the database before you see even a single row of the result set.
If you sort it yourself data may be incrementally streamed (better for network constrained environment) and perhaps incrementally useful to application reducing execution delay even if sorting operation consumes the same amount of total time.
Depending on deployment scenario it might make a big difference where the extra costs associated with sorting should be paid out. In scenarios I work with middle tier is disposable and scalable while data tier is more expensive to scale out. If it costs the same CPU but database CPU costs 5x or 10x in terms of operational cost it becomes cheaper in real terms to do it outside the database.

Is O(1) access to a database row is possible?

I have an table which use an auto-increment field (ID) as primary key. The table is append only and no row will be deleted. Table has been designed to have a constant row size.
Hence, I expected to have O(1) access time using any value as ID since it is easy to compute exact position to seek in file (ID*row_size), unfortunately that is not the case.
I'm using SQL Server.
Is it even possible ?
Thanks
Hence, I expected to have O(1) access
time using any value as ID since it is
easy to compute exact position to seek
in file (ID*row_size),
Ah. No. Autoincrement does not - even without deletions -guarantee no holes. Holes = seek via index. Ergo: your assumption is wrong.
I guess the thing that matters to you is the performance.
Databases use indexes to access records which are written on the disk.
Usually this is done with B+ tree indexes, which are logbn where b for internal nodes is typically between 100 and 200 (optimized to block size, see ref)
This is still strictly speaking logarithmic performance, but given decent number of records, let's say a few million, the leaf nodes can be reached in 3 to 4 steps and that, together with all the overhead for query planning, session initiation, locking, etc (that you would have anyway if you need multiuser, ACID compliant data management system) is certainly for all practical reasons comparable to constant time.
The good news is that an indexed read is O(log(n)) which for large values of n gets pretty close to O(1). That said in this context O notation is not very useful, and actual timings are far more meanigful.
Even if it were possible to address rows directly, your query would still have to go through the client and server protocol stacks and carry out various lookups and memory allocations before it could give the result you want. It seems like you are expecting something that isn't even practical. What is the real problem here? Is SQL Server not fast enough for you? If so there are many options you can use to improve performance but directly seeking an address in a file is not one of them.
Not possible. SQL Server organizes data into a tree-like structure based on key and index values; an "index" in the DB sense is more like a reference book's index and not like an indexed data structure like an array or list. At best, you can get logarithmic performance when searching on an indexed value (PKs are generally treated as an index). Worst-case is a table scan for a non-indexed column, which is linear. Until the database gets very large, the seek time of a well-designed query against a well-designed table will pale in comparison to the time required to send it over the network or even a named pipe.

Performance benefit when SQL query is limited vs calling entire row?

How much of a performance benefit is there by selecting only required field in query instead of querying the entire row? For example, if I have a row of 10 fields but only need 5 fields in the display, is it worth querying only those 5? what is the performance benefit with this limitation vs the risk of having to go back and add fields in the sql query later if needed?
It's not just the extra data aspect that you need to consider. Selecting all columns will negate the usefulness of covering indexes, since a bookmark lookup into the clustered index (or table) will be required.
It depends on how many rows are selected, and how much memory do those extra fields consume. It can run much slower if several text/blobs fields are present for example, or many rows are selected.
How is adding fields later a risk? modifying queries to fit changing requirements is a natural part of the development process.
The only benefit I know of explicitly naming your columns in your select statement is that if a column your code is using gets renamed your select statement will fail before your code. Even better if your select statement is within a proc, your proc and the DB script would not compile. This is very handy if you are using tools like VS DB edition to compile/verify DB scripts.
Otherwise the performance difference would be negligible.
The number of fields retrieved is a second order effect on performance relative to the large overhead of the SQL request itself -- going out of process, across the network to another host, and possibly to disk on that host takes many more cycles than shoveling a few extra bytes of data.
Obviously if the extra fields include a megabyte blob the equation is skewed. But my experience is that the transaction overhead is of the same order, or larger, than the actual data retreived. I remember vaguely from many years ago than an "empty" NOP TNS request is about 100 bytes on the wire.
If the SQL server is not the same machine from which you're querying, then selecting the extra columns transfers more data over the network (which can be a bottleneck), not forgetting that it has to read more data from the disk, allocate more memory to hold the results.
There's not one thing that would cause a problem by itself, but add things up and they all together cause performance issues. Every little bit helps when you have lots of either queries or data.
The risk I guess would be that you have to add the fields to the query later which possibly means changing code, but then you generally have to add more code to handle extra fields anyway.