Reverse query order in ScyllaDB in version 5 - scylla

I am having trouble finding a good up to date resource if reverse query orders are as bad as they were when Scylla started. Previously all rows needs to be read in memory and sorted before returning the result set when using a reverse query order.
However I was wondering if I should still worry about reverse query orders when using the latest Scylla version and doing reverse order queries on a single partition. I see that this issue was closed: https://github.com/scylladb/scylladb/issues/1413 and I read improvements in https://www.slideshare.net/ScyllaDB/scylla-summit-2022-scylla-50-new-features-part-1.
Should I still worry? Or should I still create materialized views to omit this problem (or other solutions)?

Reverse queries are now only slightly slower than forward queries.
See https://www.scylladb.com/product/release-notes/scylladb-open-source-5-0/ for full details.

Related

limit query time in the client side of PostgreSQL

I'm trying to query the PostgreSQL database, but this is a public server and I really don't want to waste a lot of CPU for a long time.
So I wonder if there is some way to limit my query time for a specific duration, for example, 3/5/10 minutes.
I assume that there is syntax like limit but not for results amount but for query duration.
Thanks for any kind of help.
Set statement_timeout, then your query will be terminated with an error if it exceeds that limit.
According to this documentation:
statement_timeout (integer)
Abort any statement that takes more than
the specified number of milliseconds, starting from the time the
command arrives at the server from the client. If
log_min_error_statement is set to ERROR or lower, the statement that
timed out will also be logged. A value of zero (the default) turns
this off.
Setting statement_timeout in postgresql.conf is not recommended
because it would affect all sessions.
Here is a potential catch with using LIMIT to control how long a query might run. Rightfully, you ought to also be using ORDER BY with your query, to tell Postgres how exactly it should limit the size of the result set. But the caveat here is that Postgres would typically have to materialize the entire result set when using LIMIT with ORDER BY, and then also possibly sort, which might take longer than just reading in the entire result set.
One workaround to this might be to just use LIMIT without ORDER BY. If the execution plan does not include reading the entire table and sorting, it might be one way to do what you want. However, keep in mind that if you go this route, Postgres would have free license to return any records from the table it wishes, in any order. This is probably not what you want from a business or reporting point of view.
But, a much better approach here would be just tune your query using things like indices, and make it faster to the point where you don't need to resort to a LIMIT trick.

Cassandra query execution sequencing vs Eventual consistency issue

I am confused with cassandra eventual constancy vs query sequencing, i have following questions
If I send 2 queries in sequence (without turning on isIdempotent property). First query is to Delete record and second query is to create records. Is it possible that the query 2 executes before query one.
my java code will look like this
public void foo(){
delete(entity);//First delete a record
create(entity); //Second create a record
}
another thing I am not specifying any timestamp in my query.
2) My second question is, Cassandra is eventually consistent. And if I send both the above queries in sequential order and it doesnt get replicated to all nodes, will those queries maintain the order when actually its getting replicated to all nodes?
I tried to look cassandra documentation , although it talks about query sequencing in batch operations, but it doesnt talk about query sequencing in non batch operation.
I am using cassandra 2.1
By default, in modern versions, we use client side timestamps. Check the driver documentation here:
https://datastax.github.io/java-driver/manual/query_timestamps/
Based on the timestamp, C* operates using LWW heuristics (last write wins) if the create has an earlier timestamp than the delete, a query won't return data. If the create has a newer timestamp, it will.
If you need linearization, i.e. the guarantee that certain operations will be executed in sequence, you can use lightweight transactions based on paxos:
http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0

What is the best way to ensure consistent ordering in an Oracle query?

I have an program that needs to run queries on a number of very large Oracle tables (the largest with tens of millions of rows). The output of these queries is fed into another process which (as a side effect) can record the progress of the query (i.e., the last row fetched).
It would be nice if, in the event that the task stopped half way through for some reason, it could be restarted. For this to happen, the query has to return rows in a consistent order, so it has to be sorted. The obvious thing to do is to sort on the primary key; however, there is probably going to be a penalty for this in terms of performance (an index access) versus a non-sorted solution. Given that a restart may never happen this is not desirable.
Is there some trick to ensure consistent ordering in another way? Any other suggestions for maintaining performance in this case?
EDIT: I have been looking around and seen "order by rowid" mentioned. Is this useful or even possible?
EDIT2: I am adding some benchmarks:
With no order by: 17 seconds.
With order by PK: 46 seconds.
With order by rowid: 43 seconds.
So any order by has a savage effect on performance, and using rowid makes little difference. Accepted answer is - there is no easy way to do it.
The best advice I can think of is to reduce the chance of a problem occurring that might stop the process, and that means keeping the code simple. No cursors, no commits, no trying to move part of the data, just straight SQL statements.
Unless a complete restart would be a completely unacceptable disaster, I'd go for simplicity without any part-way restart code at all.
If you want some order and queried data is unsorted then you need to sort it anyway, and spend some resources to do sorting.
So, there are at least two variants for optimization:
Minimize resources spent on sorting;
Query already sorted data.
For the first variant Oracle on its own calculates a best variant to minimize data access and overall query time. It may be possible to choose sorting order involved in unique index which already used by optimizer, but it's a very questionable tactic.
Second variant is about index-organized tables and about forcing Oracle with hints to use some specific index. It seems Ok if you need to process nearly all records in some specific table, but if selectivity of query is high it's significantly slows a process, even on a single table.
Think about a table with surrogate primary key which holds data with 10-year transaction history. If you need data only for previous year and you force order by primary key then Oracle need to process records in all 10 years one-by-one to find all records which belongs to a single year.
But if you need data for 9 years from this table then full table scan may be faster than index-based choice.
So selectivity of your query is a key to choose between full table scan and result sorting.
For storing results and restarting query a good solution is to use Oracle Streams Advanced Queuing to fed another process.
All unprocessed messages in queue redirected to Exception Queue where it may be processed separately.
Because you don't specify exact ordering for selected messages I suppose that you need ordering only to maintain unprocessed part of records. If it's true then with AQ you don't need ordering at all and may, even, process records in parallel.
So, finally, from my point of view Buffered Queue is what you really need.
You could skip ordering and just update the records you processed with something like SET is_processed = 'Y' or SET date_processed = sysdate. Complete restartability and no ordering.
For performance you can partition by is_processed. Yes, partition key changes might be slow, but it is all about trade-offs.

Apache Solr - indexing a DB table appears to retrieve more records than contained in the table

I'm very new to Solr so If I am saying something that doesn't make sense please let me know.
I've recently setup Solr 4.0 beta and it is working quite well. It is setup with DIH to read a view from a MySQL DB. The view contains about 20 million rows and 16 columns. A number of the columns have a lot of NULL values. The performance of the DB is quite good -I get sub-second query times against the view when I run a query manually.
I pointed Solr at the view and it began the index process. I came back four hours later to check on it and discovered that not only was it still indexing but that it reported as fetching 200+ million.
Am I mis-understanding how Solr works? I was under the assumption that it would fetch the same number of rows as what is in the DB -which is about 20 million. Or, is it actually counting each field as an item fetched? Or, even worse, is it in some kind of loop?
I did some prior testing with a small sub-set of the data from the very same view by limiting the query to 100,000 records. On completion, it reported as having fetched exactly 100,000. I am not getting any warnings/errors in the logs either.
Any ideas on what's happening?
The number is represent row in db. Could you post your db-data-config.xml file? I think you should check your sql again.

libpq very slow for large (20 million record) database

I am new to SQL/RDBMS.
I have an application which adds rows with 10 columns in PostgreSQL server using the libpq library. Right now, my server is running on same machine as my visual c++ application.
I have added around 15-20 million records. The simple query of getting total count is taking 4-5 minutes using select count(*) from <tableName>;.
I have indexed my table with the time I am entering the data (timecode). Most of the time I need count with different WHERE / AND clauses added.
Is there any way to make things fast? I need to make it as fast as possible because once the server moves to network, things will become much slower.
Thanks
I don't think network latency will be a large factor in how long your query takes. All the processing is being done on the PostgreSQL server.
The PostgreSQL MVCC design means each row in the table - not just the index(es) - must be walked to calculate the count(*) which is an expensive operation. In your case there are a lot of rows involved.
There is a good wiki page on this topic here http://wiki.postgresql.org/wiki/Slow_Counting with suggestions.
Two suggestions from this link, one is to use an index column:
select count(index-col) from ...;
... though this only works under some circumstances.
If you have more than one index see which one has the least cost by using:
EXPLAIN ANALYZE select count(index-col) from ...;
If you can live with an approximate value, another is to use a Postgres specific function for an approximate value like:
select reltuples from pg_class where relname='mytable';
How good this approximation is depends on how often autovacuum is set to run and many other factors; see the comments.
Consider pg_relation_size('tablename') and divide it by the seconds spent in
select count(*) from tablename
That will give the throughput of your disk(s) when doing a full scan of this table. If it's too low, you want to focus on improving that in the first place.
Having a good I/O subsystem and well performing operating system disk cache is crucial for databases.
The default postgres configuration is meant to not consume too much resources to play nice with other applications. Depending on your hardware and the overall utilization of the machine, you may want to adjust several performance parameters way up, like shared_buffers, effective_cache_size or work_mem. See the docs for your specific version and the wiki's performance optimization page.
Also note that the speed of select count(*)-style queries have nothing to do with libpq or the network, since only one resulting row is retrieved. It happens entirely server-side.
You don't state what your data is, but normally the why to handle tables with a very large amount of data is to partition the table. http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
This will not speed up your select count(*) from <tableName>; query, and might even slow it down, but if you are normally only interested in a portion of the data in the table this can be helpful.