Optimize GROUP BY after ranged index query - sql

I have a content application that needs to count responses in a time slice, then order them by number of responses. It currently works great with a small data set, but needs to scale to millions rows. My current query won't work.
mysql> describe Responses;
+---------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| site_id | int(10) unsigned | NO | MUL | NULL | |
| content_id | bigint(20) unsigned | NO | PRI | NULL | |
| response_id | bigint(20) unsigned | NO | PRI | NULL | |
| date | int(10) unsigned | NO | | NULL | |
+---------------+---------------------+------+-----+---------+-------+
The table type is InnoDB, the primary key is on (content_id, response_id). There is an additional index on (content_id, date) used to find responses to a piece of content, and another additional index on (site_id, date) used in the query I am have trouble with:
mysql> explain SELECT content_id id, COUNT(response_id) num_responses
FROM Responses
WHERE site_id = 1
AND date > 1234567890
AND date < 1293579867
GROUP BY content_id
ORDER BY num_responses DESC
LIMIT 0, 10;
+----+-------------+-----------+-------+---------------+------+---------+------+------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+------+---------+------+------+-----------------------------------------------------------+
| 1 | SIMPLE | Responses | range | date | date | 8 | NULL | 102 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-----------+-------+---------------+------+---------+------+------+-----------------------------------------------------------+
That's the best I've been able to come up with, but it will end up being in the 1,000,000's of rows needing to be counted, resulting in 10,000's of rows to sort, to pull in a handful of results.
I can't think of a way to precalculate the count either, as the date range is arbitrary. I have some liberty with changing the primary key: it can be composed of content_id, response_id, and site_id in any order, but cannot contain date.
The application is developed mostly in PHP, so if there is an quicker way to accomplish the same results by splitting the query into subqueries, using temporary tables, or doing things on the application side, I'm open to suggestions.

(Reposted from comments by request)
Set up a table that has three columns: id, date, and num_responses. The column num_responses consists of the number of responses for the given id on the given date. Backfill the table appropriately, and then at around midnight (or later) each night, run a script that adds a new row for the previous day.
Then, to get the rows you want, you can merely query the table mentioned above.

Rather than calculating each time, how about cache the calculated count since the last query, and add the increment of count to update the cache by putting date condition into the WHERE clause?

Have you considered partitioning the table by date? Are there any indices on the table?

Related

How to make postgres search fast

I have a Postgresql-database with more than 100 billion rows in one table.
The table schema is as follow:
id_1 | integer | | not null |
id_2 | bigint | | not null |
created_at | timestamp without time zone | | not null |
id_3 | bigint | | |
char1 | character varying(20) | | not null |
lang | character(6) | | not null |
gps | point | | |
some_dat | character varying(140)[] | | |
JSON | jsonb | | not null |
I'm trying to search inside the JSON object and sort the data by the JSON object but the problem is that it takes too much time for sorting and returning the data.
Also when sorting the data by created_at for example, it takes also time for the result.
I'm trying to make my application as a real-time as I can.
I have 2 indexing for id_1 and id_2
Also, I tried to use materialized view for each (id) but the problem is updating the materialized view takes much time also.
Any suggestions please?
I'm running PostgreSQL 10.3, on a Linux server with SSD and 128 GB of ram.
Thanks,
If you want to sort a query result with an expression like this:
ORDER BY expr1, expr2, ...
You need the following index to speed up the sorting:
CREATE INDEX ON atable ((expr1), (expr2), ...);
If that does not work because the expressions contain functions that are not IMMUTABLE, you cannot speed up the sort with an index. In that case, consider rewriting your query with IMMUTABLE expressions.

postgresql update column with least value of column from another table based on condition

I'm trying to run an update query on the column answer_date of a table P. I want to fill each row of answer_date of P with the unique date from create_date column of H where P.ID1 matches with H.ID1 and where P.acceptance_date is not empty.
The query takes a long while to run, so I check the interim changes in answer_date but the entire column is empty like it was created.
Btree indices exists on all the mentioned columns.
Is there something wrong with the query?
UPDATE P
SET answer_date = subquery.date
FROM (SELECT DISTINCT H.create_date as date
FROM H, P
where H.postid=P.acceptance_id
) AS subquery
WHERE P.acceptance_id is not null;
Table schema is as follows:
Table "public.P"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------+-----------------------------+-----------+----------+--------------+-------------
id | integer | not null | plain | |
acceptance_id | integer | | plain | |
answer_date | timestamp without time zone | | plain | |
Indexes:
"posts_pkey" PRIMARY KEY, btree (id)
"posts_accepted_answer_id_idx" btree (acceptance_id) WITH (fillfactor='100')
and
Table "public.H"
Column | Type | Modifiers | Storage | Stats target | Description
-------------------+-----------------------------+-----------+----------+--------------+-------------
id | integer | not null | plain | |
postid | integer | | plain | |
create_date | timestamp without time zone | not null | plain | |
Indexes:
"H_pkey" PRIMARY KEY, btree (id)
"ph_creation_date_idx" btree (create_date) WITH (fillfactor='100')
Table P as 70 million rows and H has 220 million rows.
Postgres version is 9.6
Hardware is a Windows laptop with 8Gb of RAM.

group by date of timestamp

How can you group by a the date portion of a timestamp column in mysql and still take advantage of an index on that ?
Of course you can use solutions like
select date(thetime) from mytable group by date(thetime)
however this query will not be able to use an index on thetime but will instead require a temporary table as you are transforming the column using a function before grouping by it.
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+----------------------------------------------+
| 1 | SIMPLE | mytable | index | NULL | thetime | 4 | NULL | 48183 | Using index; Using temporary; Using filesort |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+----------------------------------------------+
Theoretically there's no reason why it shouldn't be able to use a range scan on an index on that column and not need a temporary table. Is there any syntax that can persuade the query optimizer to execute the query to do that?
From what I can understand the database already does what you want? The index is being scanned (rather than the table itself) as evident in xplan.
But you cannot get around the fact that an intermediate table is needed to hold the dates during the grouping (distinct) operation.

MySQL not using indexes

I just enabled the slow-log (+not using indexes) and I'm getting hundreds of entries for the same kind of query (only user changes)
SELECT id
, name
FROM `all`
WHERE id NOT IN(SELECT id
FROM `picks`
WHERE user=999)
ORDER BY name ASC;
EXPLAIN gives:
+----+--------------------+-------------------+-------+------------------+--------+---------+------------+------+------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------------+-------+------------------+--------+---------+------------+------+------------------------------------------+
| 1 | PRIMARY | all | index | NULL | name | 156 | NULL | 209 | Using where; Using index; Using filesort |
| 2 | DEPENDENT SUBQUERY | picks | ref | user,user_2,pick | user_2 | 8 | const,func | 1 | Using where; Using index |
+----+--------------------+-------------------+-------+------------------+--------+---------+------------+------+------------------------------------------+
Any idea about how to optimize this query? I've tried with a bunch of different indexes on different fields but nothing.
I don't necessarily agree that 'not in' and 'exists' are ALWAYS bad performance choices, however, it could be in this situation.
You might be able to get your results using a much simpler query:
SELECT id
, name
FROM `all`
, 'picks'
WHERE all.id = picks.id
AND picks.user <> 999
ORDER BY name ASC;
"not in" and "exists" always bad choices for performance. May be left join with cheking "NULL" will be better try it.
This is probably the best way to write the query. Select everything from all and try to find matching rows from picks that share the same id and user is 999. If such a row doesn't exist, picks.id will be NULL, because it's using a left outer join. Then you can filter the results to return only those rows.
SELECT all.id, all.name
FROM
all
LEFT JOIN picks ON picks.id=all.id AND picks.user=999
WHERE picks.id IS NULL
ORDER BY all.name ASC

Mysql Index Being Ignored

EXPLAIN SELECT
*
FROM
content_link link
STRAIGHT_JOIN
content
ON
link.content_id = content.id
WHERE
link.content_id = 1
LIMIT 10;
+----+-------------+---------+-------+---------------+------------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+------------+---------+-------+------+-------+
| 1 | SIMPLE | link | ref | content_id | content_id | 4 | const | 1 | |
| 1 | SIMPLE | content | const | PRIMARY | PRIMARY | 4 | const | 1 | |
+----+-------------+---------+-------+---------------+------------+---------+-------+------+-------+
However, when I remove the WHERE, the query stops using the key (even when i explicitly force it to)
EXPLAIN SELECT
*
FROM
content_link link FORCE KEY (content_id)
STRAIGHT_JOIN
content
ON
link.content_id = content.id
LIMIT 10;
+----+-------------+---------+--------+---------------+---------+---------+------------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+--------+---------------+---------+---------+------------------------+---------+-------------+
| 1 | SIMPLE | link | index | content_id | PRIMARY | 7 | NULL | 4555299 | Using index |
| 1 | SIMPLE | content | eq_ref | PRIMARY | PRIMARY | 4 | ft_dir.link.content_id | 1 | |
+----+-------------+---------+--------+---------------+---------+---------+------------------------+---------+-------------+
Are there any work-arounds to this?
I realize I'm selecting the entire table in the second example, but why does mysql suddenly decide that it's going to ignore my FORCE anyway and not use the key? Without the key the query takes like 10 minutes.. ugh.
FORCE is a bit of a misnomer. Here's what the MySQL docs say (emphasis mine):
You can also use FORCE INDEX, which acts like USE INDEX (index_list) but with the addition that a table scan is assumed to be very expensive. In other words, a table scan is used only if there is no way to use one of the given indexes to find rows in the table.
Since you aren't actually "finding" any rows (you are selecting them all), a table scan is always going to be fastest, and the optimizer is smart enough to know that in spite of what you are telling them.
ETA:
Try adding an ORDER BY on the primary key once and I bet it'll use the index.
An index helps search quickly inside a table, but it just slows things down if you select the entire table. So MySQL is correct in ignoring the index.
In your case, maybe the index has a hidden side effect that's not known to MySQL. For example, if the inner join holds only for a few rows, an index would speed things up. But MySQL can't know that without an explicit hint.
There is an exception: when every column you select is inside the index, the index is still useful if you select every row. For example, if you have an index on LastName, the following query still benefits from the index:
select LastName from orders
But this one won't:
select * from Orders
Your content_id seems to accept NULL values.
MySQL optimizer thinks there is no guarantee that your query will return all values only by using the index (though actually there is guarantee, since you use the column in a JOIN)
That's why it reverts to full table scan.
Either add a NOT NULL condition:
SELECT *
FROM content_link link FORCE KEY (content_id)
STRAIGHT_JOIN
content
ON content.id = link.content_id
WHERE link.content_id IS NOT NULL
LIMIT 10;
or mark your column as NOT NULL:
ALTER TABLE content_link MODIFY content_id NOT NULL
Update:
This is verified bug 45314 in MySQL.