I have a Postgresql-database with more than 100 billion rows in one table.
The table schema is as follow:
id_1 | integer | | not null |
id_2 | bigint | | not null |
created_at | timestamp without time zone | | not null |
id_3 | bigint | | |
char1 | character varying(20) | | not null |
lang | character(6) | | not null |
gps | point | | |
some_dat | character varying(140)[] | | |
JSON | jsonb | | not null |
I'm trying to search inside the JSON object and sort the data by the JSON object but the problem is that it takes too much time for sorting and returning the data.
Also when sorting the data by created_at for example, it takes also time for the result.
I'm trying to make my application as a real-time as I can.
I have 2 indexing for id_1 and id_2
Also, I tried to use materialized view for each (id) but the problem is updating the materialized view takes much time also.
Any suggestions please?
I'm running PostgreSQL 10.3, on a Linux server with SSD and 128 GB of ram.
Thanks,
If you want to sort a query result with an expression like this:
ORDER BY expr1, expr2, ...
You need the following index to speed up the sorting:
CREATE INDEX ON atable ((expr1), (expr2), ...);
If that does not work because the expressions contain functions that are not IMMUTABLE, you cannot speed up the sort with an index. In that case, consider rewriting your query with IMMUTABLE expressions.
Related
This question already has answers here:
Closed 12 years ago.
The community reviewed whether to reopen this question 2 months ago and left it closed:
Original close reason(s) were not resolved
Possible Duplicate:
Why is SELECT * considered harmful?
Probably a database nOOb question.
Our application has a table like the following
TABLE WF
Field | Type | Null | Key | Default | Extra |
+--------------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| children | text | YES | | NULL | |
| w_id | int(11) | YES | | NULL | |
| f_id | int(11) | YES | | NULL | |
| filterable | tinyint(1) | YES | | 1 | |
| created_at | datetime | YES | | NULL | |
| updated_at | datetime | YES | | NULL | |
| status | smallint(6) | YES | | 1 | |
| visible | tinyint(1) | YES | | 1 | |
| weight | int(11) | YES | | NULL | |
| root | tinyint(1) | YES | | 0 | |
| mfr | tinyint(1) | YES | | 0 | |
+--------------------+-------------+------+-----+---------+----------------+
This table is expected to be upwards of ten million records. The schema is not expected to change much. I need to retrieve the columns f_id, children, status, visible, weight, root, mfr.
Which approach is faster for data retrieval?
1) Select * from WF where w_id = 1 AND status = 1;
I will strip the unnecessary columns in the application layer.
2) Select children,f_id,status,visible,weight,root,mfr from WF where w_id = 1 AND status = 1;
There is no need to strip the unnecessary columns as its pre-selected in the query.
Does any one have a real life benchmark as to which is faster. I know some say Select * is evil, but will MySQL respond faster while trying to get the whole chunk as opposed to retrieving selective columns?
I am using MySQL version: 5.1.37-1ubuntu5 (Ubuntu) and the application is Rails3 app.
As an example of how a select statement that includes a subset of columns can be significantly faster, it can use a covering index on the table that includes just those columns, potentially resulting in much better query performance.
If you return fewer columns there is less data to go across the network and less data for the database to process and it will almost always return faster. Databases also tend to be slower using select * because the database then has to go figure out what the columns are and thus do more work than when you specify. Further select * will often return bad results if the structure changes significantly. It may end up showing the user fields you don;t want them to see or if someone is silly enough to rearrange the columns, then the application may actually appear to show things in the wrong order or if doing an insert from the data, put them in the wrong column. It is almost alawys a poor practice to use selct * in production code.
I have a content application that needs to count responses in a time slice, then order them by number of responses. It currently works great with a small data set, but needs to scale to millions rows. My current query won't work.
mysql> describe Responses;
+---------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| site_id | int(10) unsigned | NO | MUL | NULL | |
| content_id | bigint(20) unsigned | NO | PRI | NULL | |
| response_id | bigint(20) unsigned | NO | PRI | NULL | |
| date | int(10) unsigned | NO | | NULL | |
+---------------+---------------------+------+-----+---------+-------+
The table type is InnoDB, the primary key is on (content_id, response_id). There is an additional index on (content_id, date) used to find responses to a piece of content, and another additional index on (site_id, date) used in the query I am have trouble with:
mysql> explain SELECT content_id id, COUNT(response_id) num_responses
FROM Responses
WHERE site_id = 1
AND date > 1234567890
AND date < 1293579867
GROUP BY content_id
ORDER BY num_responses DESC
LIMIT 0, 10;
+----+-------------+-----------+-------+---------------+------+---------+------+------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+------+---------+------+------+-----------------------------------------------------------+
| 1 | SIMPLE | Responses | range | date | date | 8 | NULL | 102 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-----------+-------+---------------+------+---------+------+------+-----------------------------------------------------------+
That's the best I've been able to come up with, but it will end up being in the 1,000,000's of rows needing to be counted, resulting in 10,000's of rows to sort, to pull in a handful of results.
I can't think of a way to precalculate the count either, as the date range is arbitrary. I have some liberty with changing the primary key: it can be composed of content_id, response_id, and site_id in any order, but cannot contain date.
The application is developed mostly in PHP, so if there is an quicker way to accomplish the same results by splitting the query into subqueries, using temporary tables, or doing things on the application side, I'm open to suggestions.
(Reposted from comments by request)
Set up a table that has three columns: id, date, and num_responses. The column num_responses consists of the number of responses for the given id on the given date. Backfill the table appropriately, and then at around midnight (or later) each night, run a script that adds a new row for the previous day.
Then, to get the rows you want, you can merely query the table mentioned above.
Rather than calculating each time, how about cache the calculated count since the last query, and add the increment of count to update the cache by putting date condition into the WHERE clause?
Have you considered partitioning the table by date? Are there any indices on the table?
I'm encountering a strange behavior of MySQL.
Query execution (i.e. the usage of indexes as shown by explain [QUERY]) and time needed for execution are dependent on the elements of the where clause.
Here is a query where the problem occurs:
select distinct
e1.idx, el1.idx, r1.fk_cat, r2.fk_cat
from ent e1, ent_leng el1, rel_c r1, _tax_c t1, rel_c r2, _tax_c t2
where el1.fk_ent=e1.idx
and r1.fk_ent=e1.idx and ((r1.fk_cat=43) or (r1.fk_cat=t1.fk_cat1 and t1.fk_cat2=43))
and r2.fk_ent=e1.idx and ((r2.fk_cat=10) or (r2.fk_cat=t2.fk_cat1 and t2.fk_cat2=10))
The corresponding explain output is:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+-------------+-------+--------+-------------------------+---------+---------+---------------+-------+------------------------------------
| 1 | SIMPLE | el1 | index | fk_ent | fk_ent | 4 | NULL | 15002 | Using index; Using temporary
| 1 | SIMPLE | e1 | eq_ref | PRIMARY | PRIMARY | 4 | DB.el1.fk_ent | 1 | Using index
| 1 | SIMPLE | r1 | ref | fk_ent,fk_cat,fks | fks | 4 | DB.e1.idx | 1 | Using where; Using index
| 1 | SIMPLE | r2 | ref | fk_ent,fk_cat,fks | fks | 4 | DB.el1.fk_ent | 1 | Using index
| 1 | SIMPLE | t1 | index | fk_cat1,fk_cat2,fk_cats | fk_cats | 8 | NULL | 69 | Using where; Using index; Distinct;
| | | | | | | | | | Using join buffer
| 1 | SIMPLE | t2 | index | fk_cat1,fk_cat2,fk_cats | fk_cats | 8 | NULL | 69 | Using where; Using index; Distinct;
| Using join buffer
As you can see a one-column index has the same name as the column it belongs to. I also added some useless indexes along with the used ones, just to see if they change the execution (which they don't).
The execution takes ~4.5 seconds.
When I add the column entl1.name to the select part (nothing else changed), the index fk_ent in el1 cannot be used any more:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+-------------+-------+--------+-------------------------+---------+---------+---------------+-------+------------------------------------
| 1 | SIMPLE | el1 | ALL | fk_ent | NULL | NULL | NULL | 15002 | Using temporary
The execution now takes ~8.5 seconds.
I always thought that the select part of a query does not influence the usage of indexes by the engine and doesn't affect performance in such a way.
Leaving out the attribute isn't a solution, and there are even more attributes that i have to select.
Even worse, the query in the used form is even a bit more complex and that makes the performance issue a big problem.
So my questions are:
1) What is the reason for this strange behavior?
2) How can I solve the performance problem?
Thanks for your help!
Gred
It's the DISTINCT restriction. You can think of that as another WHERE restriction. When you change the select list, you are really changing the WHERE clause for the DISTINCT restriction, and now the optimizer decides that it has to do a table scan anyway, so it might as well not use your index.
EDIT:
Not sure if this helps, but if I am understanding your data correctly, I think you can get rid of the DISTINCT restriction like this:
select
e1.idx, el1.idx, r1.fk_cat, r2.fk_cat
from ent e1
Inner Join ent_leng el1 ON el1.fk_ent=e1.idx
Inner Join rel_c r1 ON r1.fk_ent=e1.idx
Inner Join rel_c r2 ON r2.fk_ent=e1.idx
where
((r1.fk_cat=43) or Exists(Select 1 From _tax_c t1 Where r1.fk_cat=t1.fk_cat1 and t1.fk_cat2=43))
and
((r2.fk_cat=10) or Exists(Select 1 From _tax_c t2 Where r2.fk_cat=t2.fk_cat1 and t2.fk_cat2=10))
MySQL will return data from an index if possible, saving the entire row from being loaded. In this way, the selected columns can influence the index selection.
With this in mind, it can much more efficient to add all required columns to an index, especially in the case of only selecting a small subset of columns.
I'm building a MySQL database which contains entries about special substrings of DNA in species of yeast. My table looks like this:
+--------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+---------+------+-----+---------+-------+
| species | text | YES | MUL | NULL | |
| region | text | YES | MUL | NULL | |
| gene | text | YES | MUL | NULL | |
| startPos | int(11) | YES | | NULL | |
| repeatLength | int(11) | YES | | NULL | |
| coreLength | int(11) | YES | | NULL | |
| sequence | text | YES | MUL | NULL | |
+--------------+---------+------+-----+---------+-------+
There are approximately 1.8 million records. In one type of query I want to see how many DNA substrings are associated with each type of species and region, so I issue this query:
select species, region, count(*) group by species, region;
The species and region columns have only two possible entries (conserved/scer for species, and promoter/coding for region) yet this query takes about 30 seconds.
Is this a normal amount of time to expect for this type of query given the size of the table? Is it slow because I'm using text fields instead of simple integer or boolean values (I prefer text fields as several non-CS researchers will be using the DB). Any other ideas and suggestions would be welcome.
Please excuse if this is a boneheaded question, I am an SQL neophyte.
P.S. I've also seen this question but the proposed solution doesn't seem relevant for what I'm doing.
EDIT: Converting those fields to VARCHARs reduced the runtime to ~2.5 seconds. Note I also timed it against ENUMs which had a similar timing.
Why're all your string based columns defined as TEXT? If you read the performance comparison, you'll see that TEXT was ~3x slower than a VARCHAR column using identical indexing: http://forums.mysql.com/read.php?24,105964,105964
If your fields are only ever going to have 2 values, you're much better off making them booleans. You should also make everything NOT NULL unless there's a real reason you'll need it to be NULL.
Also take a look at the ENUM type for a better way to use a finite number of human-readable values for a column.
As for slowness, the first thing to try is to create indices on your columns. For the particular query you're showing here, an index on species, region should make a huge difference:
create index on mytablename (species, region);
should do it.
Explain
SELECT `feed_objects`.*
FROM `feed_objects`
WHERE (`feed_objects`.feed_id IN
( 165,160,159,158,157,153,152,151,150,149,148,147,129,128,127,126,125,124,122,
121, 120,119,118,117,116,115,114,113,111,110)) ;
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | feed_objects | ALL | by_feed_id | NULL | NULL | NULL | 188 | Using where |
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
Not used index by_feed_id
But when I point less than the values in the WHERE - everything is working right
Explain
SELECT `feed_objects`.*
FROM `feed_objects`
WHERE (`feed_objects`.feed_id IN
(165,160,159,158,157,153,152,151,150,149,148,147,129,128,127,125,124)) ;
+----+-------------+--------------+-------+---------------+------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+---------------+------------+---------+------+------+-------------+
| 1 | SIMPLE | feed_objects | range | by_feed_id | by_feed_id | 9 | NULL | 18 | Using where |
+----+-------------+--------------+-------+---------------+------------+---------+------+------+-------------+
Used index by_feed_id
What is the problem?
The MySQL optimizer makes a lot of decisions that sometimes look strange. In this particular case, I believe that you have a very small table (188 rows total from the looks of the first EXPLAIN), and that is affecting the optimizer's decision.
The "How MySQL Uses Indexes" manual pages offers this snippet of info:
Sometimes MySQL does not use an index,
even if one is available. One
circumstance under which this occurs
is when the optimizer estimates that
using the index would require MySQL to
access a very large percentage of the
rows in the table. (In this case, a
table scan is likely to be much faster
because it requires fewer seeks.)
Because the number of ids in your first WHERE clause is relatively large compared to the table size, MySQL has determined that it is faster to scan the data than to consult the index, as the data scan is likely to result in less disk access time.
You could test this by adding rows to the table and re-running the first EXPLAIN query. At a certain point, MySQL will start using the index for it as well as the second query.