Java 8 stream vs jpa postrgresql orderby. What is better for performance? - sql

I'am using jpa entitymenager with postgresql and java 8.
I need to show some data order by name.
What is faster and have better perfomance:
make a Query to the database like
#Query("select t from Table t order by t.someField ")
or just get all records from the database and sort them using java 8 stream api like
someCollection.stream().sorted((e1, e2) -> e1.getSomeField()
.compareTo(e2.getSomeField())).

In general if you can sort with SQL, just go ahead. If your sorting column is indexed, then sorting will be trivial: PostgreSQL will just read this index which already contains the resulting order. Even if your sorting column is not indexed, DBMS may do it more effectively. For example, it's not necessary to hold the whole rows in memory during sorting inside DBMS, you just need the values from the sorted column and row ID. After you get the properly ordered list of row IDs, you can send the rows to the client in streaming way. Also when sorting really big tables DBMS may dump some data to hard-disk to reduce memory usage.
Note that DBMS sort is performed on DBMS side which can be completely different server, thus the resulting speed also depends on whether DBMS server or application server is more powerful or has more free resources right now.
If you want to sort the results in Java, probably it would be better to do in-place sort using someCollection.sort(Comparator.comparing(e -> e.getSomeField())) (assuming that your someCollection is the List). This will reduce the consumed memory and number of times your data should be copied. The in-place sorting is the most effective for array-based lists like ArrayList.
Also it should be noted that sorting results may be different as they may depend on current DBMS collation (in Java you just sort strings by UTF-16 code point values unless custom Collator is used).

Related

How to generate a numeric identifier for entries based on a string

I'm working in Redshift SQL syntax, and want to know a way to convert a string id for each entry in a table to a numeric id (since numeric joins between tables are supposedly much quicker and more efficient than string joins).
Currently the ids look like this - a bunch of strings with both numbers and letters
01r00001ABCDeAAF
01r00001IJKLmAAN
...
01r00001OPQRtAAN
What I would like is to turn this into a purely numeric identifier, using the string id as an input and ensuring that each output is unique and corresponds only to a single input with no collisions (which can be replicated across tables so that accurate joins are possible).
I've tried using some hash functions within SQL like CHECKSUM() and BINARY_CHECKSUM() over the columns, but I'm a little unclear which would be the most applicable here - I understand some are case-sensitive and others aren't, while some generate collisions and others don't.
First, your reference for strings versus integers is based on an entirely different database. I would not generalize from SQL Server performance to other databases, particularly a massively parallel columnar database. There is also a lot of information that is taken out of context and generalized to wrong situations.
Second, you can test on tables in Amazon Redshift. Generating the data and doing the tests should be faster than modifying existing data. You will probably find no need to change anything.
You need to understand what is happening "under the hood" before making a change like this, particularly if you think it is for performance reasons.
Strings can be troublesome for a variety of reasons. First, they can have different collations or character sets -- information that is hidden. Such differences would preclude the use of indexes -- a major hit in a database such as SQL Server. Not using indexes is generally not an issue in Redshift.
Strings can also have variable lengths. This makes indexes slightly less efficient. They also require a wee bit more overhead to compare than numbers, because those collations and character sets need to be taken into account. They also need to be compared character-by-character, whereas most hardware has built-in comparisons for numbers. The extra cycles here is usually minimal compared to the cost of moving data.
When you do a join in Amazon Redshift, the first thing it is going to do is collocate the data, probably by hashing the values and sending the data to the same nodes in the parallel environment. Moving the data is expensive. Hashing the values, much less so.
In Redshift, you should be more concerned about how your data is distributed. Although I haven't tested it, adding a new column that is a number might make the query more expensive, because in a columnar database, the number of columns referenced has an impact on performance.

ORDER BY performance when executing a query in Oracle

I have been working on a Spring application that connects to an Oracle database.
After three years, the amount of records in our tables are so much bigger that the response time of queries is so bad and our customer is dissatisfied.
So, I searched and got this URL for Oracle performance tuning.
The factor's 22 of this URL tells to NOT use ORDER BY in the query when the response time is important. So, if I omit ORDER BY from my query, the response time is more than half than with ORDER BY.
But I can not omit ORDER BY from my query because the customer needs sorting.
How do I fix my problem, so that I have ordering and a response time?
one of the best sulotion that Markus Winand metion in his blog that is using pipelined order by and it's detail in in the this link
The factor's 22 of this URL tell that DO NOT use order by in the query
when the response time is important, I omit order by from my query for
this the response time is more half than the first
On the Internet, you should always question every advice you get.
In order for the ORDER BY clause to be fast, you need to use the right index. Make sure the sorting is done using a database index, therefore avoiding a full-table scan or an explicit sort operation. When in doubt, just search for SQL performance issues on Markus Winand's Use the Index Luke site or, even better, read his SQL Performance Explained book.
So, you should make sure that the Buffer Pool is properly configured and you have enough RAM to hold the data working set and indexes as well.
If you really have huge data (e.g. billions of records), then you should use partitioning. Otherwise, for tens or hundreds of millions of records, you could just scale vertically using more RAM.
Also, make sure you use compact data types. For example, don't store an Enum ordinal value into a 32-bit integer value since a single byte would probably be more than enough to store all Enum values you might use.

Indexes in SQL vs ORDER BY clause

What is the difference between using Indexes in SQL vs Using the ORDER BY clause?
Since from what I understand , the Indexes arrange the specified column(s) in an ordered manner that helps the query engine in looking through the tables quickly (and hence prevents table scan).
My question - why can't the query engine simply use the ORDER BY for improving performance?
Thanks !
You put the tag as sql-server-2008 but the question has nothing to do with SQL server. This question will apply to all databases.
From wikipedia:
Indexing is a technique some storage engines use for improving
database performance. The many types of indexes share the common
property that they reduce the need to examine every entry when running
a query. In large databases, this can reduce query time/cost by orders
of magnitude. The simplest form of index is a sorted list of values
that can be searched using a binary search with an adjacent reference to the location of the entry, analogous to the index in the back of a book. The same data can have multiple indexes (an employee database could be indexed by last name and hire date).
From a related thread in StackExchange
In the SQL world, order is not an inherent property of a set of data.
Thus, you get no guarantees from your RDBMS that your data will come
back in a certain order -- or even in a consistent order -- unless you
query your data with an ORDER BY clause.
To answer why the indexes are necessary?
Note the bolded text about indexing regarding the reduction in the need to examine every entry. In the absence of an index when an ORDER BY is issued in SQL, every entry need to be examined which increases the number of entries.
ORDER BY is applied only when reading. A single column may be used in indexes in which case there could be several different kinds of ordering in sql query requests. It is not possible to define the indexes unless we understand how the query requests are made.
A lot of times indexes are added once new patterns of querying emerge so as to keep the query performant which mean index creation is driven by how you defined your ORDER BY in SQL.
Query engine which processes your SQL with/without ORDER BY, defines your execution plan and does not understand Storage of data. The Data retrieved from a query engine may be partly from memory if the data was in cache and partly/fully from disk. When reading from disk in the storage engine will uses the indexes to figure the quickly read data.
ORDER BY effects the performance of a query when reading. Index effects the performance of a query when doing all the Create, Read, Update and Delete operations.
A query engine may choose to use an index or totally ignore the index based on the data characteristics.

How DBMS Process different types of joins?

How a database engines process sql joins? Do they apply different technique to process different type of joins? Explanation with example will be appreciated.
Query evaluation is very complex. I recommend that you pick up a Database textbook and read the portion on query evaluation of your favourite DMBS's documentation.
In a nutshell, there exist 3 main types of algorithms: single pass, loop based, and sort/merged based. Each is used depending on the number of tuples in the tables to join, the expected number of joined tuples, the size of memory and the disk speed (if properly tuned), the existence of indexes, and how good the planner of the DBMS is.
Single pass happen when the table to be joined fits in memory.
Loop-based is usually done when one table fits completely in memory (they can be index, or hash based).
Multiple passes are required for sort/merge based joins.
This URL has some good examples:
http://etutorials.org/SQL/Postgresql/Part+I+General+PostgreSQL+Use/Chapter+4.+Performance/Understanding+How+PostgreSQL+Executes+a+Query/
--dmg

Is there a SQL ANSI way of starting a search at the end of table?

In a certain app I must constantly query data that are likely to be amongst the last inserted rows. Since this table is going to grow a lot, I wonder if theres a standard way of optimizing the queries by making them start the lookup at the table's end. I think I would get the same optmization if the database stored data for the table in a stack-like structure, so the last inserted rows would be searched first.
The SQL spec doesn't mention anything about maintaining the insertion order. In practice, most of decent DB's also doesn't maintain it. Then it stops here. Sorting the table first ain't going to make it faster. Just index the column(s) of interest (at least the ones which you use in the WHERE).
One of the "tenets" of a proper RDBMS is that this kind of matters shouldn't concern you or anyone else using the DB.
The DB engine is "free" to use whatever method it wants to store/retrieve records, so if you want to enforce a "top" behaviour do what other suggested: add a timestamp field to the table (or tables), add an index on it and query using it as a sort and/or query criteria (e.g.: you poll the table each minute, and ask for records with timestamp>=systime-1 minute)
There is no standard way.
In some databases you can specify the sort order on an index.
SQL Server allows you to write ASC or DESC on an index:
[ ASC | DESC ]
Determines the ascending or descending sort direction for the particular index column. The default is ASC.
In MySQL you can also write ASC or DESC when you create the index but currently this is ignored. It might be implemented in a future version.
Add a counter or a time field in your table, sort on it and get top rows.
In other words: You should forget the idea that SQL tables are accessed in any particular order by default. A seqscan does not mean the oldest rows will be searched first, only that all rows will be checked. If you want to optimize some search you add indexes on some fields. What you are looking for is probably indexes.
If your data is indexed, it won't matter. The index is doing a binary search, not a sequential scan.
Unless you're doing TOP 1 (or something like it), the SELECT will have to scan the whole table or index anyway.
According to Data Independence you shouldn't care. That said a clustered index would probably suit your needs if you typically look for a date range. (sorting acs/desc shouldn't matter but you should try it out.)
If you find that you really need it you can also shard your database to increase perf on the most recently added data.
If you have enough rows that its actually becomming a problem, and you know how many "the most recently inserted rows" should be, you could try a round-about method.
Note: Even for pretty big tables, this is less efficient, but once your main table gets big enough, I've seen this work wonders for user-facing performance.
Create a "staging" table that exactly mimics your table's structure. Whenever you insert into your main table, also insert into your "staging" area. Limit your "staging" area to n rows by using a trigger to delete the lowest id row in the table when a new row over your arbitrary maximum is reached (say, 10,000 or whatever your limit is).
Then, queries can hit that smaller table first looking for the information. Since the table is arbitrarilly limited to the last n rows, it's only looking in the most recent data. Only if that fails to find a match would your query (actually, at this point a stored procedure because of the decision making) hit your main table.
Some Gotchas:
1) Make sure your trigger(s) is(are) set up properly to maintain the correct concurrancy between your "main" and "staging" tables.
2) This can quickly become a maintenance nightmare if not handled properly- and depending on your scenario it be be a little finiky.
3) I cannot stress enough that this is only efficient/useful in very specific scenarios. If yours doesn't match it, use one of the other answers.
ISO/ANSI Standard SQL does not consider optimization at all. For example the widely recognized CREATE INDEX SQL DDL does not appear in the Standard. This is because the Standard makes no assumptions about the underlying storage medium and nor should it. I regularly use SQL to query data in text files and Excel spreadsheets, neither of which have any concept of database indexes.
You can't do this.
However, there is a way to do something that might be even better. Depending on the design of your table, you should be able to create an index that keeps things in almost the order of entry. For example, if you adopt the common practice of creating an id field that autoincrements, then that index is just about in chronological order.
Some RDBMSes permit you to declare a backwards index, that is, one that descends instead of ascending. If you create a backwards index on the ID field, and if the optimizer uses that index, it will look at the most recent entries first. This will give you a rapid response for the first row.
The next step is to get the optimizer to use the index. You need to use explain plan to see if the index is being used. If you ask for the rows in order of id descending, the optimizer will almost certainly use the backwards index. If not you may be able to use hints to guide the optimizer.
If you still need to avoid reading all the rows in order to avoid wasting time, you may be able to use the LIMIT feature to declare that you only want, say 10 rows, and no more, or 1 row and no more. That should do it.
Good luck.
If your table has a create date, then I'd reverse sort by that and take the top 1.