MySQL query optimisation - sql

I have a database table that stores imported information. For simplicity, its something like:
CREATE TABLE `data_import` (
`id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`amount` DECIMAL(12,2) NULL DEFAULT NULL,
`payee` VARCHAR(50) NULL DEFAULT NULL,
`posted` TINYINT(1) NOT NULL DEFAULT 0,
PRIMARY KEY (`id`),
INDEX `payee` (`payee`)
)
I also have a table that stores import rules:
CREATE TABLE `import_rules` (
`id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`search` VARCHAR(50) NULL DEFAULT NULL,
PRIMARY KEY (`id`),
INDEX `search` (`search`)
)
The idea is that for each imported transaction, the query needs to try find a single matching rule - this match is done on the data_import.payee and import_rules.seach fields. Because these are both varchar fields, I have indexed them in the hope of making the query faster.
This is what I have come up with so far, which seems to work fine. Albeit slower than I hoped.
SELECT i.id, i.payee, i.amount, i.posted r.id, r.search
FROM import_data id
LEFT JOIN import_rules ir on REPLACE(i.payee, ' ', '') = REPLACE(ir.search, ' ', '')
One thing that the above query does not cater for, is that if import_data.posted = 1, then I dont need to find a rule for that line - is it possible to stop the query joining on that particular row? Similarly, if the payee is null, then it shouldn't try join either.
Are there any other ways that I can optimise this? I realise that doing text joins is not ideal...not sure if there are any better methods.

I highly recommend doing anything you can to get rid of the REPLACEs in that JOIN. Using REPLACE on both sides of the join totally eliminate the ability to use an index on either table.
Assuming you can get rid of the REPLACEs (by cleansing the existing data and/or new data):
If you need to join on text
columns, use a single byte per
character charset if you application
allows for it (for a smaller/faster index).
Make the N in VARCHAR(N) as small
as you can as it will affect the side
of the index (or arguably, use index
prefixes).
I imagine you want to make the
search index on import_rules
UNIQUE -- then you're sure to only
going to get 1 row result returned per row of
import_data
You can throw an AND into your WHERE clause if you'd like to enforce your 'don't join in this case' rule.
LEFT JOIN import_rules ir ON id.payee=ir.search AND id.posted != 1

The use of REPLACE() on the join is probably breaking the indexing, as it has an index of the values in the field, not the amended values after REPLACE().
As for not joining, you are already using a LEFT JOIN, so non-matching joins will result in NULLs for the import_rules fields; you should be able to add WHERE clauses to force that.

Related

MS Access last() results are sometimes wrong

I have an MS Access query that uses last() but sometimes it doesn't work as expected--which I know is what's expected lol. But I need to find a solution, either in Access or by converting the below query to MySQL. Any suggestions?
SELECT maindata.TrendShort, Last(maindata.Resistance) AS LastOfResistance, Last(maindata.Support) AS LastOfSupport, Count(maindata.ID) AS Days, Max(maindata.Datestamp) AS Datestamp, maindata.ProductID
FROM market_opinion AS maindata
WHERE (((Exists (select * from market_opinion action_count where maindata.ProductID = action_count.ProductID and maindata.Datestamp < action_count.Datestamp and maindata.TrendShort<> action_count.TrendShort))=False))
GROUP BY maindata.TrendShort, maindata.ProductID
ORDER BY Count(maindata.ID) DESC;
Only the LastOfResistence and LastOfSupport are occasionally wrong, the other fields are always correct.
CREATE TABLE `market_opinion` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`ProductID` int(11) DEFAULT NULL,
`Trend` varchar(11) DEFAULT NULL,
`TrendShort` varchar(7) DEFAULT NULL,
`Resistance` decimal(9,2) unsigned DEFAULT NULL,
`Support` decimal(9,2) unsigned DEFAULT NULL,
`Username` varchar(12) DEFAULT NULL,
`Datestamp` date DEFAULT NULL,
PRIMARY KEY (`ID`),
KEY `ProductID` (`ProductID`),
KEY `Datestamp` (`Datestamp`),
KEY `TrendShort` (`TrendShort`)
) ENGINE=InnoDB AUTO_INCREMENT=9536 DEFAULT CHARSET=utf8;
Without some feel for the data, this becomes something of a guess, but what I'm thinking is that the Last(Resistance) and the Last(Support) aren't necessarily pulling from the same record as Max(DateStamp). You might try breaking your query out into a 2 part query, such as:
Select maindata.TrendShort, Resistance, Support, COUNT(ID) as Days, ProductId
FROM market_opinion maindata
INNER JOIN (SELECT mo.TrendShort, MAX(mo.DateStamp) AS MaxDate
FROM market_opinion mo
WHERE (((EXISTS(SELECT ...))=FALSE))
GROUP BY mo.TrendShort) inner
WHERE maindata.TrendShort=inner.TrendSort AND maindata.DateStamp = inner.MaxDate
ORDER BY Days DESC;
I've left out the bulk of your query where the ellipsis (...) is, I wouldn't expect any changes there. You might consider looking at http://www.access-programmers.co.uk/forums/showthread.php?t=42291 for a discussion of first/last vs min/max. Let me know if this gets you any closer. If not, maybe some sample data where it's not working out. It might give some insight into what's going on.
First or Last just returns some (arbitrary) record that is not the last or first respectively.
In most cases, simply use Min for the first and Max for the last.

How to make the Primary Key have X digits in PostgreSQL?

I am fairly new to SQL but have been working hard to learn. I am currently stuck on an issue with setting a primary key to have 8 digits no matter what.
I tried using INT(8) but that didn't work. Also AUTO_INCREMENT doesn't work in PostgreSQL but I saw there were a couple of data types that auto increment but I still have the issue of the keys not being long enough.
Basically I want to have numbers represent User IDs, starting at 10000000 and moving up. 00000001 and up would work too, it doesn't matter to me.
I saw an answer that was close to this, but it didn't apply to PostgreSQL unfortunately.
Hopefully my question makes sense, if not I'll try to clarify.
My code (which I am using from a website to try and make my own forum for a practice project) is:
CREATE Table users (
user_id INT(8) NOT NULL AUTO_INCREMENT,
user_name VARCHAR(30) NOT NULL,
user_pass VARCHAR(255) NOT NULL,
user_email VARCHAR(255) NOT NULL,
user_date DATETIME NOT NULL,
user_level INT(8) NOT NULL,
UNIQUE INDEX user_name_unique (user_name),
PRIMARY KEY (user_id)
) TYPE=INNODB;
It doesn't work in PostgreSQL (9.4 Windows x64 version). What do I do?
You are mixing two aspects:
the data type allowing certain values for your PK column
the format you chose for display
AUTO_INCREMENT is a non-standard concept of MySQL, SQL Server uses IDENTITY(1,1), etc.
Use a serial column in Postgres:
CREATE TABLE users (
user_id serial PRIMARY KEY
, ...
)
That's a pseudo-type implemented as integer data type with a column default drawing from an attached SEQUENCE. integer is easily big enough for your case (-2147483648 to +2147483647).
If you really need to enforce numbers with a maximum of 8 decimal digits, add a CHECK constraint:
CONSTRAINT id_max_8_digits CHECK (user_id BETWEEN 0 AND < 99999999)
To display the number in any fashion you desire - 0-padded to 8 digits, for your case, use to_char():
SELECT to_char(user_id, '00000000') AS user_id_8digit
FROM users;
That's very fast. Note that the output is text now, not integer.
SQL Fiddle.
A couple of other things are MySQL-specific in your code:
int(8): use int.
datetime: use timestamp.
TYPE=INNODB: just drop that.
You could make user_id a serial type column and set the seed of this sequence to 10000000.
Why?
int(8) in mysql doesn't actually only store 8 digits, it only displays 8 digits
Postgres supports check constraints. You could use something like this:
create table foo (
bar_id int primary key check ( 9999999 < bar_id and bar_id < 100000000 )
);
If this is for numbering important documents like invoices that shouldn't have gaps, then you shouldn't be using sequences / auto_increment

MySQL query slow when selecting VARCHAR

I have this table:
CREATE TABLE `search_engine_rankings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`keyword_id` int(11) DEFAULT NULL,
`search_engine_id` int(11) DEFAULT NULL,
`total_results` int(11) DEFAULT NULL,
`rank` int(11) DEFAULT NULL,
`url` varchar(255) DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`indexed_at` date DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_ranking` (`keyword_id`,`search_engine_id`,`rank`,`indexed_at`),
KEY `search_engine_rankings_search_engine_id_fk` (`search_engine_id`),
CONSTRAINT `search_engine_rankings_keyword_id_fk` FOREIGN KEY (`keyword_id`) REFERENCES `keywords` (`id`) ON DELETE CASCADE,
CONSTRAINT `search_engine_rankings_search_engine_id_fk` FOREIGN KEY (`search_engine_id`) REFERENCES `search_engines` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=244454637 DEFAULT CHARSET=utf8
It has about 250M rows in production.
When I do:
select id,
rank
from search_engine_rankings
where keyword_id = 19
and search_engine_id = 11
and indexed_at = "2010-12-03";
...it runs very quickly.
When I add the url column (VARCHAR):
select id,
rank,
url
from search_engine_rankings
where keyword_id = 19
and search_engine_id = 11
and indexed_at = "2010-12-03";
...it runs very slowly.
Any ideas?
The first query can be satisfied by the index alone -- no need to read the base table to obtain the values in the Select clause. The second statement requires reads of the base table because the URL column is not part of the index.
UNIQUE KEY `unique_ranking` (`keyword_id`,`search_engine_id`,`rank`,`indexed_at`),
The rows in tbe base table are not in the same physical order as the rows in the index, and so the read of the base table can involve considerable disk-thrashing.
You can think of it as a kind of proof of optimization -- on the first query the disk-thrashing is avoided because the engine is smart enough to consult the index for the values requested in the select clause; it will already have read that index into RAM for the where clause, so it takes advantage of that fact.
Additionally to Tim's answer. An index in Mysql can only be used left-to-right. Which means it can use columns of your index in your WHERE clause only up to the point you use them.
Currently, your UNIQUE index is keyword_id,search_engine_id,rank,indexed_at. This will be able to filter the columns keyword_id and search_engine_id, still needing to scan over the remaining rows to filter for indexed_at
But if you change it to: keyword_id,search_engine_id,indexed_at,rank (just the order). This will be able to filter the columns keyword_id,search_engine_id and indexed_at
I believe it will be able to fully use that index to read the appropriate part of your table.
I know it's an old post but I was experiencing the same situation and I didn't found an answer.
This really happens in MySQL, when you have varchar columns it takes a lot of time processing. My query took about 20 sec to process 1.7M rows and now is about 1.9 sec.
Ok first of all, create a view from this query:
CREATE VIEW view_one AS
select id,rank
from search_engine_rankings
where keyword_id = 19000
and search_engine_id = 11
and indexed_at = "2010-12-03";
Second, same query but with an inner join:
select v.*, s.url
from view_one AS v
inner join search_engine_rankings s ON s.id=v.id;
TLDR: I solved this by running optimize on the table.
I experienced the same just now. Even lookups on primary key and selecting just some few rows was slow. Testing a bit, I found it not to be limited to the varchar column, selecting an int also took considerable amounts of time.
A query roughly looking like this took around 3s:
select someint from mytable where id in (1234, 12345, 123456).
While a query roughly looking like this took <10ms:
select count(*) from mytable where id in (1234, 12345, 123456).
The approved answer here is to just make an index spanning someint also, and it will be fast, as mysql can fetch all information it needs from the index and won't have to touch the table. That probably works in some settings, but I think it's a silly workaround - something is clearly wrong, it should not take three seconds to fetch three rows from a table! Besides, most applications just does a "select * from mytable", and doing changes at the application side is not always trivial.
After optimize table, both queries takes <10ms.

Constraints in SQL Database

I need to have a table in T-SQL which will have the following structure
KEY Various_Columns Flag
1 row 1 F
2 row_2 F
3 row_3 T
4 row_4 F
Either no rows, or at most one row can have the Flag column with the value T. My developer claims that this can be achieved with a check constraint placed on the table.
Questions:
Can such a constraint be placed on the database itself (ie an inter-row constraint) at the database level, rather than in business rules for updating or inserting rows
Is such a table in normal form?
Or would normal form require removing the Flag column, and instead (say) had another simple table or variable containing the value of row which had Flag=T, ie in the above case row=3.
1 No. A check constraint is per row. No other constraint will do this either.
You need one of:
a trigger (all versions)
indexed view with filter Flag = T, and unique index on Flag (SQL Server 2000+)
filtered index (SQL Server 2008)
2 Good enough
3 Overkill really. You're splitting the same data up to avoid one the solutions above. But using a one row table, FK for the ID columns, and a unique constraint on Flag
My developer claims that this can be
achieved with a check constraint
placed on the table.
SQL Server does not directly** support subqueries in CHECK constraints (a requirement for Full SQL-92; SQL Server is only compliant with Entry Level SQL-92, broadly speaking).
While there are almost certainly better ways of enforcing this constraint in SQL Server, purely out of interest it can indeed be achieved using a row-level CHECK constraint and a UNIQUE constraint e.g. here's one way:
CREATE TABLE YourStuff
(
key_col INTEGER NOT NULL UNIQUE,
Various_Columns VARCHAR(8) NOT NULL,
Flag CHAR(1) DEFAULT 'F' NOT NULL
CHECK (Flag IN ('F', 'T')),
Flag_key INTEGER UNIQUE,
CHECK (
(Flag = 'F' AND Flag_key = key_col)
OR
(Flag = 'T' AND Flag_key = NULL)
)
);
The issue here is that you will need to maintain the Flag_key column's values 'manually'. Replacing the column + CHECK with a calculated column would mean the values are maintained automatically:
CREATE TABLE YourStuff
(
key_col INTEGER NOT NULL UNIQUE,
Various_Columns VARCHAR(8) NOT NULL,
Flag CHAR(1) DEFAULT 'F' NOT NULL
CHECK (Flag IN ('F', 'T')),
Flag_key AS (
CASE WHEN Flag = 'F' THEN key_col
ELSE NULL END
),
UNIQUE (Flag_key)
);
** While SQL Server does not directly support subqueries in CHECK constraints, there is a workaround in some cases using a user defined function (UDF) e.g.
CREATE FUNCTION dbo.CountTFlags ()
RETURNS INTEGER
AS
BEGIN
DECLARE #return INTEGER;
SET #return = (
SELECT COUNT(*)
FROM YourStuff
WHERE Flag = 'T'
);
RETURN #return;
END;
CREATE TABLE YourStuff
(
key_col INTEGER NOT NULL UNIQUE,
Various_Columns VARCHAR(8) NOT NULL,
Flag CHAR(1) DEFAULT 'F' NOT NULL
CHECK (Flag IN ('F', 'T')),
CHECK (1 >= dbo.CountTFlags())
);
Note that the UDF approach won't work in every case and that caution is required. The important point is that UDF will be evaluated for each row affected (rather than at the SQL statement or transaction level, as you may expect). In this case, the constraint needs to be true for every row affected and therefore -- I think! -- it is safe. For more details, see Trouble with CHECK Constraints by David Portas.
Personally, I would simply use a second table to model Flag, which would only involve keys and a foreign key e.g.
CREATE TABLE YourStuff
(
key_col INTEGER NOT NULL UNIQUE,
Various_Columns VARCHAR(8) NOT NULL
);
CREATE TABLE YourStuffFlag
(
key_col INTEGER NOT NULL UNIQUE
REFERENCES YourStuff (key_col)
);
Is [my] table in normal form?
You should by aiming for Fifth normal form (5NF). Whether you have achieved this depends upon the design of Various_Columns. I do not believe that your Flag falls fowl of the requirements for 5NF and I do not see any update, delete or insert anomalies (which is the point of normalization but a 5NF design can still exhibit anomalies). That said, to switch the row that gets the flag attibute, my two-table design requires a single UPDATE statement while your single-table design requires two ;)

Using MySQL's "IN" function where the target is a column?

In a certain TABLE, I have a VARTEXT field which includes comma-separated values of country codes. The field is named cc_list. Typical entries look like the following:
'DE,US,IE,GB'
'IT,CA,US,FR,BE'
Now given a country code, I want to be able to efficiently find which records include that country. Obviously there's no point in indexing this field.
I can do the following
SELECT * from TABLE where cc_list LIKE '%US%';
But this is inefficient.
Since the "IN" function is supposed to be efficient (it bin-sorts the values), I was thinking along the lines of
SELECT * from TABLE where 'US' IN cc_list
But this doesn't work - I think the 2nd operand of IN needs to be a list of values, not a string. Is there a way to convert a CSV string to a list of values?
Any other suggestions? Thanks!
SELECT *
FROM MYTABLE
WHERE FIND_IN_SET('US', cc_list)
In a certain TABLE, I have a VARTEXT field which includes comma-separated values of country codes.
If you want your queries to be efficient, you should create a many-to-many link table:
CREATE TABLE table_country (cc CHAR(2) NOT NULL, tableid INT NOT NULL, PRIMARY KEY (cc, tableid))
SELECT *
FROM tablecountry tc
JOIN mytable t
ON t.id = tc.tableid
WHERE t.cc = 'US'
Alternatively, you can set ft_min_word_len to 2, create a FULLTEXT index on your column and query like this:
CREATE FULLTEXT INDEX fx_mytable_cclist ON mytable (cc_list);
SELECT *
FROM MYTABLE
WHERE MATCH(cc_list) AGAINST('+US' IN BOOLEAN MODE)
This only works for MyISAM tables and the argument should be a literal string (you won't be able to join on this condition).
The first rule of normalization says you should change multi-value columns such as cc_list into a single value field for this very reason.
Preferably into it's own table with IDs for each country code and a pivot table to support a many-to-many relationship.
CREATE TABLE my_table (
my_id INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
mystuff VARCHAR NOT NULL,
PRIMARY KEY(my_id)
);
# this is the pivot table
CREATE TABLE my_table_countries (
my_id INT(11) UNSIGNED NOT NULL,
country_id SMALLINT(5) UNSIGNED NOT NULL,
PRIMARY KEY(my_id, country_id)
);
CREATE TABLE countries {
country_id SMALLINT(5) UNSIGNED NOT NULL AUTO_INCREMENT,
country_code CHAR(2) NOT NULL,
country_name VARCHAR(100) NOT NULL,
PRIMARY KEY (country_id)
);
Then you can query it making use of indexes:
SELECT * FROM my_table JOIN my_table_countries USING (my_id) JOIN countries USING (country_id) WHERE country_code = 'DE'
SELECT * FROM my_table JOIN my_table_countries USING (my_id) JOIN countries USING (country_id) WHERE country_code IN('DE','US')
You may have to group the results my my_id.
find_in_set seems to be the MySql function you want. If you could actually store those comma-separated strings as MySql sets (no more than 64 possible countries, or splitting countries into two groups of no more than 64 each), you could keep using find_in_set and go a bit faster.
There's no efficient way to find what you want. A table scan will be necessary. Putting multiple values into a single text field is a terrible misuse of relational database technology. If you refactor (if you have access to the database structure) so that the country codes are properly stored in a separate table you will be able to easily and quickly retrieve the data you want.
One approach that I've used successfully before (not on mysql, though) is to place a trigger on the table that splits the values (based on a specific delimiter) into discrete values, inserting them into a sub-table. Your select can then look like this:
SELECT * from TABLE where cc_list IN
(
select cc_list_name from cc_list_subtable
where c_list_subtable.table_id = TABLE.id
)
where the trigger parses cc_list in TABLE into separate entries in column cc_list_name in table cc_list_subtable. It involves a bit of work in the trigger, too, as every change to TABLE means that associated rows in cc_list_table have to be deleted/updated/inserted as appropriate, but is an approach that works in situations where the original table TABLE has to retain its original structure, but where you are free to adapt the query as you see fit.