SQLite sqlite3_step() hangs with big database - sql

I'm writing a small Objective-C library that works with an embedded SQLite database.
The SQLite version I'm using is 3.7.13 (checked with SELECT sqlite_version())
My query is:
SELECT ROUND(AVG(difference), 5) as distance
FROM (
SELECT (
SELECT A.timestamp - B.timestamp
FROM ExampleTable as B
WHERE B.timestamp = (
SELECT MAX(timestamp)
FROM ExampleTable as C
WHERE C.timestamp < A.timestamp
)
) as difference
FROM ExampleTable as A
ORDER BY timestamp)
Basically it outputs the average timestamp difference between rows ordered by timestamp.
I tried the query on a sample database with 35k rows and it runs in around 100ms. So far so good.
I then tried the query on another sample database with 100k rows and it hangs at sqlite3_step() taking up 100% of CPU usage.
Since I cannot step into sqlite3_step() with the debugger, is there another way I can get a grasp of where is the function hanging or a debug log of what is the issue here?
I also tried running other queries from my library on the 100k rows database and there is no issue, but it's also true that these are simple queries with no subquery. Maybe this is the issue?
Thanks
UPDATE
This is the output of EXPLAIN QUERY PLAN as requested:
"1","0","0","SCAN TABLE ExampleTable AS A"
"1","0","0","EXECUTE CORRELATED SCALAR SUBQUERY 2"
"2","0","0","SCAN TABLE ExampleTable AS B"
"2","0","0","EXECUTE CORRELATED SCALAR SUBQUERY 3"
"3","0","0","SEARCH TABLE ExampleTable AS C"
"1","0","0","USE TEMP B-TREE FOR ORDER BY"
"0","0","0","SCAN SUBQUERY 1"

Looking up rows by their timestamp value can be optimized with an index on this column:
CREATE INDEX whatever ON ExampleTable(timestamp);
And this query is inefficient: ORDER BY does not affect values that are averaged, and the timestamp values in B and C are always identical, so you can drop one of them:
SELECT ROUND(AVG(difference), 5) AS distance
FROM (
SELECT timestamp -
(SELECT MAX(timestamp)
FROM ExampleTable AS B
WHERE timestamp < A.timestamp)
AS difference
FROM ExampleTable AS A)

I eventually went with this solution:
CREATE TABLE tmp AS SELECT timestamp FROM ExampleTable ORDER BY timestamp
SELECT ROUND(AVG(difference), 5)
FROM (
SELECT (
SELECT A.timestamp - B.timestamp
FROM tmp as B
WHERE B.rowid = A.rowid-1
) as difference
FROM tmp as A
ORDER BY timestamp)
DROP TABLE ExampleTable
Actually I went further and I am only using this strategy for high number of rows (> 40k), since the other strategy (single query) works better for "small" tables.

Related

Efficiently selecting distinct (a, b) from big table

I have a table with around 54 million rows in a Postgres 9.6 DB and would like to find all distinct pairs of two columns (there are around 4 million such values). I have an index over the two columns of interest:
create index ab_index on tbl (a, b)
What is the most efficient way to get such pairs? I have tried:
select a,b
from tbl
where a>$previouslargesta
group by a,b
order by a,b
limit 1000
And also:
select distinct(a,b)
from tbl
where a>previouslargesta
order by a,b
limit 1000
Also this recursive query:
with recursive t AS (
select min(a) AS a from tbl
union all
select (select min(a) from tickets where a > t.a)
FROM t)
select a FROM t
But all are slooooooow.
Is there a faster way to get this information?
Your table has 54 million rows and ...
there are around 4 million such values
7,4 % of all rows is a high percentage, an index can mostly only help by providing pre-sorted data, ideally in an index-only scan. There are more sophisticated techniques for smaller result sets (see below), and there are much faster ways for paging which returns much fewer rows at a time (see below) but for the general case a plain DISTINCT may be among the fastest:
SELECT DISTINCT a, b -- *no* parentheses
FROM tbl;
-- ORDER BY a, b -- ORDER BY wasn't not mentioned as requirement ...
Don't confuse it with DISTINCT ON, which would require parentheses. See:
Select first row in each GROUP BY group?
The B-tree index ab_index you have on (a, b) is already the best index for this. It has to be scanned in its entirety, though. The challenge is to have enough work_mem to process all in RAM. With standard settings it occupies at least 1831 MB on disk, typically more with some bloat. If you can afford it, run the query with a work_mem setting of 2 GB (or more) in your session. See:
Configuration parameter work_mem in PostgreSQL on Linux
SET work_mem = '2 GB';
SELECT DISTINCT a, b ...
RESET work_mem;
A read-only table helps. Else you need aggressive enough VACUUM settings to allow an index-only scan. And some more RAM, yet, would help (with appropriate settings) to keep the index cashed.
Also upgrade to the latest version of Postgres (11.3 as of writing). There have been many improvements for big data.
Paging
If you want to add paging as indicated by your sample query, urgently consider ROW value comparison. See:
Optimize query with OFFSET on large table
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
SELECT DISTINCT a, b
FROM tbl
WHERE (a, b) > ($previous_a, $previous_b) -- !!!
ORDER BY a, b
LIMIT 1000;
Recursive CTE
This also may or may not be faster for the general big query as well. For the small subset, it becomes much more attractive:
WITH RECURSIVE cte AS (
( -- parentheses required du to LIMIT 1
SELECT a, b
FROM tbl
WHERE (a, b) > ($previous_a, $previous_b) -- !!!
ORDER BY a, b
LIMIT 1
)
UNION ALL
SELECT x.a, x.b
FROM cte c
CROSS JOIN LATERAL (
SELECT t.a, t.b
FROM tbl t
WHERE (t.a, t.b) > (c.a, c.b) -- lateral reference
ORDER BY t.a, t.b
LIMIT 1
) x
)
TABLE cte
LIMIT 1000;
This can make perfect use of your index and should be as fast as it gets.
Further reading:
Optimize GROUP BY query to retrieve latest row per user
For repeated use and no or little write load on the table, consider a MATERIALIZED VIEW, based on one of the above queries - for much faster read performance.
I cannot guarantee for performance at Postgres, but this is a technique i had used on sql server in a similar case and proven faster than others:
get distinct A into a temp a
get distinct B into a temp b
cross a and b temps to Cartesian into a temp abALL
rank the abALL (optionally)
create a view myview as select top 1 a,b from tbl (your_main_table)
join temp abALL with myview into a temp abCLEAN
rank abCLEAN here if you havent rank above

How to performance test Redshift Views?

I have a question about testing query performance for views in redshift.
I have two tables: table_a and table_b:
- table a and table b have different sort key defined.
- table a has 6 fields for sort key.
- table b has 4 fields for sort key.
- both tables share some column names/types but table a is a super set of table b.
I created a view v_combined. The view combines data from table a and table b based on dates queried. For example if query is made before date XYZ, the view will source table a. Otherwise it sources table b.
create view as v_combined
select a as x, b as y, c as z, to_timestamp(time_field::TEXT, 'YYYYMMDD'):timestamp as date
from table_a
where date < "MY_DATE"
union all
select * from table_b
where date > "MY_DATE"
I did a comparison between the view and corresponding table:
select count(*) from v_combined where date < "MY_DATE"
select count(*) from table_a where date < "MY_DATE"
select count(*) from v_combined where date > "MY_DATE"
select count(*) from table_b where date > "MY_DATE"
select * from v_combined where date < "MY_DATE" limit 10000
select * from table_a where date < "MY_DATE" limit 10000
select * from v_combined where date > "MY_DATE" limit 10000
select * from table_b where date > "MY_DATE" limit 10000
(1) and (2) have similar execution time as expected.
(3) and (4) have similar execution time as expected.
(5) seems to have longer execution time than (6).
(7) seems to have longer execution time than (8).
What is the best way to test the performance of a view in redshift?
I'd say that the best way to test the performance of a view is to run test queries exactly like you did!
The performance of this particular View will always be poor because it is doing a UNION ALL.
In (5), it needs to get ALL rows from both tables before applying the LIMIT, whereas (6) only needs to access table_a and can stop as soon as it hits the limit.
If you need good performance from queries like this, you could consider creating a combined table (rather than a view). Run a daily (or hourly?) script to recreate the table from the combined data. That way, queries will run much faster.

Getting peak value of a column in table till this date

I have an oracle table having columns {date, id, profit, max_profit}.
I have data in date and profit, and I want highest value of profit till date in max_profit, I am using query below
UPDATE MY_TABLE a SET a.MAX_PROFIT = (SELECT MAX(b.PROFIT)
FROM MY_TABLE b WHERE b.DATE <= a.DATE
AND a.id = b.id)
This is giving me correct result, but I have millions of rows for which query is taking considerable time, any faster way of doing it ?
You can use a MERGE statement with an analytic function:
MERGE INTO my_table dst
USING (
SELECT ROWID rid,
MAX( profit ) OVER ( PARTITION BY id ORDER BY "DATE" ) AS max_profit
FROM my_table
) src
ON ( src.rid = dst.ROWID )
WHEN MATCHED THEN
UPDATE SET max_profit = src.max_profit;
When you do something like "SELECT MAX(...)" you're going to scan all the records implicated in the 'WHERE" part of the query, so you want to make getting all those records as easy on the database as possible.
Do you have an index on the table that includes the id and date columns?
Depending on the behavior of this application, if you're doing a lot fewer updates/inserts (as opposed to doing a ton of reads during reporting or some other process), a possible performance enhancement might be to keep the value you're storing in the max_profit column up to date somewhere while you're changing the data. Have you considered a separate table that just stores the profit calculation for each possible date?

Query using Rownum and order by clause does not use the index

I am using Oracle (Enterprise Edition 10g) and I have a query like this:
SELECT * FROM (
SELECT * FROM MyTable
ORDER BY MyColumn
) WHERE rownum <= 10;
MyColumn is indexed, however, Oracle is for some reason doing a full table scan before it cuts the first 10 rows. So for a table with 4 million records the above takes around 15 seconds.
Now consider this equivalent query:
SELECT MyTable.*
FROM
(SELECT rid
FROM
(SELECT rowid as rid
FROM MyTable
ORDER BY MyColumn
)
WHERE rownum <= 10
)
INNER JOIN MyTable
ON MyTable.rowid = rid
ORDER BY MyColumn;
Here Oracle scans the index and finds the top 10 rowids, and then uses nested loops to find the 10 records by rowid. This takes less than a second for a 4 million table.
My first question is why is the optimizer taking such an apparently bad decision for the first query above?
An my second and most important question is: is it possible to make the first query perform better. I have a specific need to use the first query as unmodified as possible. I am looking for something simpler than my second query above. Thank you!
Please note that for particular reasons I am unable to use the /*+ FIRST_ROWS(n) */ hint, or the ROW_NUMBER() OVER (ORDER BY column) construct.
If this is acceptable in your case, adding a WHERE ... IS NOT NULL clause will help the optimizer to use the index instead of doing a full table scan when using an ORDER BY clause:
SELECT * FROM (
SELECT * FROM MyTable
WHERE MyColumn IS NOT NULL
-- ^^^^^^^^^^^^^^^^^^^^
ORDER BY MyColumn
) WHERE rownum <= 10;
The rational is Oracle does not store NULL values in the index. As your query was originally written, the optimizer took the decision of doing a full table scan, as if there was less than 10 non-NULL values, it should retrieve some "NULL rows" to "fill in" the remaining rows. Apparently it is not smart enough to check first if the index contains enough rows...
With the added WHERE MyColumn IS NOT NULL, you inform the optimizer that you don't want in any circumstances any row having NULL in MyColumn. So it can blindly use the index without worrying about hypothetical rows having NULL in MyColumn.
For the same reason, declaring the ORDER BY column as NOT NULL should prevent the optimizer to do a full table scan. So, if you can change the schema, a cleaner option would be:
ALTER TABLE MyTable MODIFY (MyColumn NOT NULL);
See http://sqlfiddle.com/#!4/e3616/1 for various comparisons (click on view execution plan)

Aggregate functions in WHERE clause in SQLite

Simply put, I have a table with, among other things, a column for timestamps. I want to get the row with the most recent (i.e. greatest value) timestamp. Currently I'm doing this:
SELECT * FROM table ORDER BY timestamp DESC LIMIT 1
But I'd much rather do something like this:
SELECT * FROM table WHERE timestamp=max(timestamp)
However, SQLite rejects this query:
SQL error: misuse of aggregate function max()
The documentation confirms this behavior (bottom of page):
Aggregate functions may only be used in a SELECT statement.
My question is: is it possible to write a query to get the row with the greatest timestamp without ordering the select and limiting the number of returned rows to 1? This seems like it should be possible, but I guess my SQL-fu isn't up to snuff.
SELECT * from foo where timestamp = (select max(timestamp) from foo)
or, if SQLite insists on treating subselects as sets,
SELECT * from foo where timestamp in (select max(timestamp) from foo)
There are many ways to skin a cat.
If you have an Identity Column that has an auto-increment functionality, a faster query would result if you return the last record by ID, due to the indexing of the column, unless of course you wish to put an index on the timestamp column.
SELECT * FROM TABLE ORDER BY ID DESC LIMIT 1
I think I've answered this question 5 times in the past week now, but I'm too tired to find a link to one of those right now, so here it is again...
SELECT
*
FROM
table T1
LEFT OUTER JOIN table T2 ON
T2.timestamp > T1.timestamp
WHERE
T2.timestamp IS NULL
You're basically looking for the row where no other row matches that is later than it.
NOTE: As pointed out in the comments, this method will not perform as well in this kind of situation. It will usually work better (for SQL Server at least) in situations where you want the last row for each customer (as an example).
you can simply do
SELECT *, max(timestamp) FROM table
Edit:
As aggregate function can't be used like this so it gives error. I guess what SquareCog had suggested was the best thing to do
SELECT * FROM table WHERE timestamp = (select max(timestamp) from table)