Large MyISAM table slow even for non-concurrent inserts/updates

Large MyISAM table slow even for non-concurrent inserts/updates - sql

I have a MyISAM table with ~50'000'000 records (tasks for web crawler):
CREATE TABLE `tasks2` (
`id` int(11) NOT NULL auto_increment,
`url` varchar(760) character set latin1 NOT NULL,
`state` varchar(10) collate utf8_bin default NULL,
`links_depth` int(11) NOT NULL,
`sites_depth` int(11) NOT NULL,
`error_text` text character set latin1,
`parent` int(11) default NULL,
`seed` int(11) NOT NULL,
`random` int(11) NOT NULL default '0',
PRIMARY KEY (`id`),
UNIQUE KEY `URL_UNIQUE` (`url`),
KEY `next_random_task` (`state`,`random`)
) ENGINE=MyISAM AUTO_INCREMENT=61211954 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
Once every few seconds one of the following operations occur (but never simultaneously):
INSERT ... VALUES (500 rows) - inserts new tasks
UPDATE ... WHERE id IN (up to 10 ids) - updates state for batch of tasks
SELECT ... WHERE (by next_random_task index) - loads batch of tasks for processing
My problem is that inserts and updates are very slow - running on the order of tens of seconds, sometimes over a minute. Selects are fast, though. Why could this happen and how to improve performance?

~50M on a regular hardware is a decent number.
Please go through this question on sf (even though it is written for InoDB, there are similar parameters for MyISAM)
After that you should start the cycle of
identifying (logging) slow queries to understand you patterns (or confirm your assumptions)
tweaking my.cnf or adding/removing indexes (depending on the patterns)
measuring improvements

EXPLAIN a sample UPDATE against the full table to ensure the primary key index is being used.
Consider changing state to a TINYINT or ENUM to make its index smaller. (ENUM might not actually do this).
Do you need the unique key on url? This will slow down inserts.

Related

Very slow delete/update in MariaDb ColumnStore Server using primary key

We have table with InnoDB engine (on MariaDB columnstore server) that have 5 million rows and primary key of 2 columns. DDL:
CREATE TABLE `interest` (
`src_id` TINYINT NOT NULL,
`id` binary(36) NOT NULL,
`account_id` bigint(20) NOT NULL,
`instrument_id` int(11) NOT NULL,
`quantity` decimal(15,4) NOT NULL,
`time` datetime NOT NULL,
`interest` decimal(16,2) NOT NULL,
`auto_incr` bigint(20) NOT NULL ,
PRIMARY KEY (`src_id`, `id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
and both of these queries takes 4-5 seconds to execute
DELETE FROM interest WHERE `id`='52f35ddb-94d0-4744-bb3c-f3ae8100dc8e' AND `src_id`='1';
UPDATE interest SET interest = 1 WHERE `id`='52f35ddb-94d0-4744-bb3c-f3ae8100dc8e' AND `src_id`='1';
The result of the EXPLAIN is:
EXPLAIN EXTENDED DELETE FROM interest WHERE `id`='52f35ddb-94d0-4744-bb3c-f3ae8100dc8e' AND `src_id`='1';
It looks like that primary key doesn't take effect because it scans more than 50% of the rows.
I exported the data from this table (in MariaDB columnstore server) and import it in table within normal MariaDB server and there is no such problem. Both query execute under 10ms.
We are using latest version of MariaDB columnstore: 1.2.5

Columnstore uses slightly customized version of MDB and there is place where custom code alters optimizer settings but this only works for queries that have CS tables. JFYI Enterprise 10.4 and community 10.5 will have Columnstore as a normal engine with the non-custom MDB version.

Postgresql Query is very Slow

I have a table with 300000 rows and when i run a simple query like
select * from diario_det;
it leaves 41041 ms to return rows. It's fine that? How i can optimize the query?
I use Postgresql 9.3 in Centos 7.
Here's is my table
CREATE TABLE diario_det
(
cod_empresa numeric(2,0) NOT NULL,
nro_asiento numeric(8,0) NOT NULL,
nro_secue_pase numeric(4,0) NOT NULL,
descripcion_pase character varying(150) NOT NULL,
monto_debe numeric(16,3),
monto_haber numeric(16,3),
estado character varying(1) NOT NULL,
cod_pcuenta character varying(15) NOT NULL,
cod_local numeric(2,0) NOT NULL,
cod_centrocosto numeric(4,0) NOT NULL,
cod_ejercicio numeric(4,0) NOT NULL,
nro_comprob character varying(15),
conciliado_por character varying(10),
CONSTRAINT fk_diario_det_cab FOREIGN KEY (cod_empresa, cod_local, cod_ejercicio, nro_asiento)
REFERENCES diario_cab (cod_empresa, cod_local, cod_ejercicio, nro_asiento) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT fk_diario_det_pc FOREIGN KEY (cod_empresa, cod_pcuenta)
REFERENCES plan_cuenta (cod_empresa, cod_pcuenta) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH (
OIDS=TRUE
);
ALTER TABLE diario_det
OWNER TO postgres;
-- Index: pk_diario_det_ax
-- DROP INDEX pk_diario_det_ax;
CREATE INDEX pk_diario_det_ax
ON diario_det
USING btree
(cod_pcuenta COLLATE pg_catalog."default", cod_local, estado COLLATE pg_catalog."default");

Very roughly size of one row is 231 bytes, times 300000... It's 69300000 bytes (~69MB) that has to be transferred from server to client.
I think that 41 seconds is a bit long, but still the query has to be slow because of amount of data that has to be loaded from disk and transferred.
You can optimise query by
selecting just columns you that are going to use not all of them (if you need just cod_empresa it would reduce total amount of transferred data to ~1.2MB, but server would still have to iterate trough all records - slow)
filter only rows that are going to use - using WHERE on columns with indexes can really speed the query up
If you want to know what is happening in your query, play around with EXPLAIN and EXPLAIN EXECUTE.
Also, if you're running dedicated database server, be sure to configure it properly to use a lot of system resources.

Doctrine: Models with column_aggregation inheritance appear twice in SQL

Has anyone noticed this?
Whenever a model uses column_aggregation (inheritance), the schema.sql has 2 CREATE TABLE commands, one creates the basic table, and the other (apart from fields) adds an index on the inheritence column
CREATE TABLE Prop (id INT AUTO_INCREMENT, opt_property_type SMALLINT UNSIGNED DEFAULT 251 NOT NULL, property_nature VARCHAR(255), INDEX opt_property_type_idx (opt_property_type), PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = InnoDB;
CREATE TABLE Prop (id INT AUTO_INCREMENT, opt_property_type SMALLINT UNSIGNED DEFAULT 251 NOT NULL, property_nature VARCHAR(255), INDEX Prop_property_nature_idx (property_nature), INDEX opt_property_type_idx (opt_property_type), PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = InnoDB;
Note the inclusion of INDEX Prop_property_nature_idx (property_nature) in the second statement
If anyone else is facing this, I will log a bug. Thanks

I just came across this myself. It seems like doctrine:build-sql is buggy.
One of the crazier things I discovered while investigating this is that doctrine:insert-sql doesn't even use schema.sql. It dynamically generates and runs the SQL based on the model definitions.
Looks like this is a known bug and won't be fixed in Doctrine 1:
http://www.doctrine-project.org/jira/browse/DC-123
http://www.doctrine-project.org/jira/browse/DC-536

MySQL query slow when selecting VARCHAR

I have this table:
CREATE TABLE `search_engine_rankings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`keyword_id` int(11) DEFAULT NULL,
`search_engine_id` int(11) DEFAULT NULL,
`total_results` int(11) DEFAULT NULL,
`rank` int(11) DEFAULT NULL,
`url` varchar(255) DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`indexed_at` date DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_ranking` (`keyword_id`,`search_engine_id`,`rank`,`indexed_at`),
KEY `search_engine_rankings_search_engine_id_fk` (`search_engine_id`),
CONSTRAINT `search_engine_rankings_keyword_id_fk` FOREIGN KEY (`keyword_id`) REFERENCES `keywords` (`id`) ON DELETE CASCADE,
CONSTRAINT `search_engine_rankings_search_engine_id_fk` FOREIGN KEY (`search_engine_id`) REFERENCES `search_engines` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=244454637 DEFAULT CHARSET=utf8
It has about 250M rows in production.
When I do:
select id,
rank
from search_engine_rankings
where keyword_id = 19
and search_engine_id = 11
and indexed_at = "2010-12-03";
...it runs very quickly.
When I add the url column (VARCHAR):
select id,
rank,
url
from search_engine_rankings
where keyword_id = 19
and search_engine_id = 11
and indexed_at = "2010-12-03";
...it runs very slowly.
Any ideas?

The first query can be satisfied by the index alone -- no need to read the base table to obtain the values in the Select clause. The second statement requires reads of the base table because the URL column is not part of the index.
UNIQUE KEY `unique_ranking` (`keyword_id`,`search_engine_id`,`rank`,`indexed_at`),
The rows in tbe base table are not in the same physical order as the rows in the index, and so the read of the base table can involve considerable disk-thrashing.
You can think of it as a kind of proof of optimization -- on the first query the disk-thrashing is avoided because the engine is smart enough to consult the index for the values requested in the select clause; it will already have read that index into RAM for the where clause, so it takes advantage of that fact.

Additionally to Tim's answer. An index in Mysql can only be used left-to-right. Which means it can use columns of your index in your WHERE clause only up to the point you use them.
Currently, your UNIQUE index is keyword_id,search_engine_id,rank,indexed_at. This will be able to filter the columns keyword_id and search_engine_id, still needing to scan over the remaining rows to filter for indexed_at
But if you change it to: keyword_id,search_engine_id,indexed_at,rank (just the order). This will be able to filter the columns keyword_id,search_engine_id and indexed_at
I believe it will be able to fully use that index to read the appropriate part of your table.

I know it's an old post but I was experiencing the same situation and I didn't found an answer.
This really happens in MySQL, when you have varchar columns it takes a lot of time processing. My query took about 20 sec to process 1.7M rows and now is about 1.9 sec.
Ok first of all, create a view from this query:
CREATE VIEW view_one AS
select id,rank
from search_engine_rankings
where keyword_id = 19000
and search_engine_id = 11
and indexed_at = "2010-12-03";
Second, same query but with an inner join:
select v.*, s.url
from view_one AS v
inner join search_engine_rankings s ON s.id=v.id;

TLDR: I solved this by running optimize on the table.
I experienced the same just now. Even lookups on primary key and selecting just some few rows was slow. Testing a bit, I found it not to be limited to the varchar column, selecting an int also took considerable amounts of time.
A query roughly looking like this took around 3s:
select someint from mytable where id in (1234, 12345, 123456).
While a query roughly looking like this took <10ms:
select count(*) from mytable where id in (1234, 12345, 123456).
The approved answer here is to just make an index spanning someint also, and it will be fast, as mysql can fetch all information it needs from the index and won't have to touch the table. That probably works in some settings, but I think it's a silly workaround - something is clearly wrong, it should not take three seconds to fetch three rows from a table! Besides, most applications just does a "select * from mytable", and doing changes at the application side is not always trivial.
After optimize table, both queries takes <10ms.

How do you setup Post Revisions/History Tracking with ORM?

I am trying to figure out how to setup a revisions system for posts and other content. I figured that would mean it would need to work with a basic belongs_to/has_one/has_many/has_many_though ORM (any good ORM should support this).
I was thinking a that I could have some tables like (with matching models)
[[POST]] (has_many (text) through (revisions)
id
title
[[Revisions]] (belongs_to posts/text)
id
post_id
text_id
date
[[TEXT]]
id
body
user_id
Where I could join THROUGH the revisions table to get the latest TEXT body. But I'm kind of foggy on how it will all work. Has anyone setup something like this?
Basically, I need to be able to load an article and request the latest content entry.
// Get the post row
$post = new Model_Post($id);
// Get the latest revision (JOIN through revisions to TEXT) and print that body.
$post->text->body;
Having the ability to shuffle back in time to previous revisions and removing revisions would also be a big help.
At any rate, these are just ideas of how I think that some kind of history tracking would work. I'm open to any form of tracking I just want to know what the best-practice is.
:EDIT:
It seems that moving forward, two tables seems to make the most sense. Since I plan to store two copies of text this will also help to save space. The first table posts will store the data of the current revision for fast reads without any joins. The posts body will be the value of the matching revision's text field - but processed through markdown/bbcode/tidy/etc. This will allow me to retain the original text (for the next edit) without having to store that text twice in one revision row (or having to re-parse it each time I display it).
So fetching will be be ORM friendly. Then for creates/updates I will have to handle revisions separately and then just update the post object with the new current revision values.
CREATE TABLE IF NOT EXISTS `posts` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`published` tinyint(1) unsigned DEFAULT NULL,
`allow_comments` tinyint(1) unsigned DEFAULT NULL,
`user_id` int(11) NOT NULL,
`title` varchar(100) NOT NULL,
`body` text NOT NULL,
`created` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `published` (`published`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;
CREATE TABLE IF NOT EXISTS `postsrevisions` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`post_id` int(10) unsigned NOT NULL,
`user_id` int(10) unsigned NOT NULL,
`is_current` tinyint(1) unsigned DEFAULT NULL,
`date` datetime NOT NULL,
`title` varchar(100) NOT NULL,
`text` text NOT NULL,
`image` varchar(200) NOT NULL,
PRIMARY KEY (`id`),
KEY `post_id` (`post_id`),
KEY `user_id` (`user_id`),
KEY `is_current` (`is_current`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;

Your Revisions table as you have shown it models a many-to-many relationship between Posts and Text. This is probably not what you want, unless a given row in Text may provide the content for multiple rows in Posts. This is not how most CMS architectures work.
You certainly don't need three tables. I have no idea why you think this is needed for 3NF. The point of 3NF is that an attribute should not depend on a non-key attribute, it doesn't say you should split into multiple tables needlessly.
So you might only need a one-to-many relationship between two tables: Posts and Revisions. That is, for each post, there can be multiple revisions, but a given revision applies to only one post. Others have suggested two alternatives for finding the current post:
A flag column in Revisions to note the current revision. Changing the current revision is as simple as changing the flag to true in the desired revision and to false to the formerly current revision.
A foreign key in Posts to the revision that is current for the given post. This is even simpler, because you can change the current revision in one update instead of two. But circular foreign key references can cause problems vis-a-vis backup & restore, cascading updates, etc.
You could even implement the revision system using a single table:
CREATE TABLE PostRevisions (
post_revision_id SERIAL PRIMARY KEY,
post_id INT NOT NULL,
is_current TINYINT NULL,
date DATE,
title VARCHAR(80) NOT NULL,
text TEXT NOT NULL,
UNIQUE KEY (post_id, is_current)
);
I'm not sure it's duplication to store the title with each revision, because the title could be revised as much as the text, couldn't it?
The column is_current should be either 1 or NULL. A unique constraint doesn't count NULLs, so you can have only one row where is_current is 1 and an unlimited number of rows where it's NULL.
This does require updating two rows to make a revision current, but you gain some simplicity by reducing the model to a single table. This is a great advantage when you're using an ORM.
You can create a view to simplify the common case of querying current posts:
CREATE VIEW Posts AS SELECT * FROM PostRevisions WHERE is_current = 1;
update: Re your updated question: I agree that proper relational design would encourage two tables so that you could make a few attributes of a Post invariant for all that post's revisions. But most ORM tools assume an entity exists in a single table, and ORM's are clumsy at joining rows from multiple tables to constitute a given entity. So I would say if using an ORM is a priority, you should store the posts and revisions in a single table. Sacrifice a little bit of relational correctness to support the assumptions of the ORM paradigm.
Another suggestion is to consider Dimensional Modeling. This is a school of database design to support OLAP and data warehousing. It uses denormalization judiciously, so you can usually organize data in a Star Schema. The main entity (the "Fact Table") is represented by a single table, so this would be a win for an ORM-centric application design.

You'd probably be better off in this case to put a CurrentTextID on your Post table to avoid having to figure out which revision is current (an alternative would be a flag on Revision, but I think a CurrentTextID on the post will give you easier queries).
With the CurrentTextID on the Post, your ORM should place a single property (CurrentText) on your Post class which would allow you to access the current text with essentially the statement you provided.
Your ORM should also give you some way to load the Revisions based on the Post; If you want more details about that then you should include information about which ORM you are using and how you have it configured.

I think two tables would suffice here. A post table and it's revisions. If you're not worried about duplicating data, a single table (de-normalized) could also work.

For anyone interested, here is how wordpress handles revisions using a single MySQL posts table.
CREATE TABLE IF NOT EXISTS `wp_posts` (
`ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`post_author` bigint(20) unsigned NOT NULL DEFAULT '0',
`post_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`post_date_gmt` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`post_content` longtext NOT NULL,
`post_title` text NOT NULL,
`post_excerpt` text NOT NULL,
`post_status` varchar(20) NOT NULL DEFAULT 'publish',
`comment_status` varchar(20) NOT NULL DEFAULT 'open',
`ping_status` varchar(20) NOT NULL DEFAULT 'open',
`post_password` varchar(20) NOT NULL DEFAULT '',
`post_name` varchar(200) NOT NULL DEFAULT '',
`to_ping` text NOT NULL,
`pinged` text NOT NULL,
`post_modified` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`post_modified_gmt` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`post_content_filtered` text NOT NULL,
`post_parent` bigint(20) unsigned NOT NULL DEFAULT '0',
`guid` varchar(255) NOT NULL DEFAULT '',
`menu_order` int(11) NOT NULL DEFAULT '0',
`post_type` varchar(20) NOT NULL DEFAULT 'post',
`post_mime_type` varchar(100) NOT NULL DEFAULT '',
`comment_count` bigint(20) NOT NULL DEFAULT '0',
PRIMARY KEY (`ID`),
KEY `post_name` (`post_name`),
KEY `type_status_date` (`post_type`,`post_status`,`post_date`,`ID`),
KEY `post_parent` (`post_parent`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 ;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas