Optimizing a strange MySQL Query - sql
Hoping someone can help with this. I have a query that pulls data from a PHP application and turns it into a view for use in a Ruby on Rails application. The PHP app's table is an E-A-V style table, with the following business rules:
Given fields: First Name, Last Name, Email Address, Phone Number and Mobile Phone Carrier:
Each property has two custom fields defined: one being required, one being not required. Clients can use either one, and different clients use different ones based on their own rules (e.g. Client A may not care about First and Last Name, but client B might)
The RoR app must treat each "pair" of properties as only a single property.
Now, here is the query. The problem is it runs beautifully with around 11,000 records. However, the real database has over 40,000 and the query is extremely slow, taking roughly 125 seconds to run which is totally unacceptable from a business perspective. It's absolutely required that we pull this data, and we need to interface with the existing system.
The UserID part is to fake out a Rails-esque foreign key which relates to a Rails table. I'm a SQL Server guy, not a MySQL guy, so maybe someone can point out how to improve this query? They (the business) demand that it be sped up but I'm not sure how to since the various group_concat and ifnull calls are required due to the fact that I need every field for every client and then have to combine the data.
select `ls`.`subscriberid` AS `id`,left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) AS `user_id`,
ifnull(min((case when (`s`.`fieldid` in (2,35)) then `s`.`data` else NULL end)),_utf8'') AS `first_name`,
ifnull(min((case when (`s`.`fieldid` in (3,36)) then `s`.`data` else NULL end)),_utf8'') AS `last_name`,
ifnull(`ls`.`emailaddress`,_utf8'') AS `email_address`,
ifnull(group_concat((case when (`s`.`fieldid` = 81) then `s`.`data` when (`s`.`fieldid` = 154) then `s`.`data` else NULL end) separator ''),_utf8'') AS `mobile_phone`,
ifnull(group_concat((case when (`s`.`fieldid` = 100) then `s`.`data` else NULL end) separator ','),_utf8'') AS `sms_only`,
ifnull(group_concat((case when (`s`.`fieldid` = 34) then `s`.`data` else NULL end) separator ','),_utf8'') AS `mobile_carrier`
from ((`list_subscribers` `ls`
join `lists` `l` on((`ls`.`listid` = `l`.`listid`)))
left join `subscribers_data` `s` on((`ls`.`subscriberid` = `s`.`subscriberid`)))
where (left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) regexp _utf8'[[:digit:]]+')
group by `ls`.`subscriberid`,`l`.`name`,`ls`.`emailaddress`
EDIT
I removed the regexp and that sped the query up to about 20 seconds, instead of nearly 120 seconds. If I could remove the group by then it would be faster, but I cannot as removing this causes it to duplicate rows with blank data for each field, instead of aggregating them. For instance:
With group by
id user_id first_name last_name email_address mobile_phone sms_only mobile_carrier
1 1 John Doe jdoe#example.com 5551234567 0 Sprint
Without group by
id user_id first_name last_name email_address mobile_phone sms_only mobile_carrier
1 1 John jdoe#xample.com
1 1 Doe jdoe#example.com
1 1 jdoe#example.com
1 1 jdoe#example.com 5551234567
And so on. What we need is the first result.
EDIT #2
The query still seems to take a long time, but earlier today it was running in only about 20 seconds on the production database. Without changing a thing, the same query is now once again taking over 60 seconds. This is still unacceptable.. any other ideas on how to improve this?
That is, without a doubt, the second most hideous SQL query I have ever laid my eyes on :-)
My advice is to trade storage requirements for speed. This is a common trick used when you find your queries have a lot of per-row functions (ifnull, case and so forth). These per-row functions never scale very well as the table becomes larger.
Create new fields in the table which will hold the values you want to extract and then calculate those values on insert/update (with a trigger) rather than select. This doesn't technically break 3NF since the triggers guarantee data consistency between columns.
The vast majority of database tables are read far more often than they're written so this will amortise the cost of the calculation across many selects. In addition, just about every reported problem with databases is one of speed, not storage.
An example of what I mean. You can replace:
case when (`s`.`fieldid` in (2,35)) then `s`.`data` else NULL end
with:
`s`.`data_2_35`
in your query if your insert/update trigger simply sets the data_2_35 column to data or NULL depending on the value of fieldid. Then you index data_2_35 and, voila, instant speed improvement at the cost of a little storage.
This trick can be done to the five case clauses, the left/regexp bit and the "naked" ifnull function as well (the ifnull functions containing min and group_concat may be harder to do).
The problem is most likely the WHERE condition:
where (left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) regexp _utf8'[[:digit:]]+')
This looks like complex string comparison, so no index can be used, which results in a full table scan, possibly for every row in the result set. I am not a MySQL expert, but if you can simplify this into more simple column comparisons it will probably run much faster.
The first thing that jumps out at me as the source of all the trouble:
The PHP app's table is an E-A-V style table...
Trying to convert data in EAV format into conventional relational format on the fly using SQL is bound to be awkward and inefficient. So don't try to smash it into a conventional column-per-attribute format. The following query returns multiple rows per subscriber, one row per EAV attribute:
SELECT ls.subscriberid AS id,
SUBSTRING_INDEX(l.name, _utf8'_', 1) AS user_id,
COALESCE(ls.emailaddress, _utf8'') AS email_address,
s.fieldid, s.data
FROM list_subscribers ls JOIN lists l ON (ls.listid = l.listid)
LEFT JOIN subscribers_data s ON (ls.subscriberid = s.subscriberid
AND s.fieldid IN (2,3,34,35,36,81,100,154)
WHERE SUBSTRING_INDEX(l.name, _utf8'_', 1) REGEXP _utf8'[[:digit:]]+'
This eliminates the GROUP BY which is not optimized well in MySQL -- it usually incurs a temporary table which kills performance.
id user_id email_address fieldid data
1 1 jdoe#example.com 2 John
1 1 jdoe#example.com 3 Doe
1 1 jdoe#example.com 81 5551234567
But you'll have to sort out the EAV attributes in application code. That is, you can't seamlessly use ActiveRecord in this case. Sorry about that, but that's one of the disadvantages of using a non-relational design like EAV.
The next thing that I notice is the killer string manipulation (even after I've simplified it with SUBSTRING_INDEX()). When you're picking substrings out of a column, this says you me that you've overloaded one column with two distinct pieces of information. One is the name and the other is some kind of list-type attribute that you would use to filter the query. Store one piece of information in each column.
You should add a column for this attribute, and index it. Then the WHERE clause can utilize the index:
SELECT ls.subscriberid AS id,
SUBSTRING_INDEX(l.name, _utf8'_', 1) AS user_id,
COALESCE(ls.emailaddress, _utf8'') AS email_address,
s.fieldid, s.data
FROM list_subscribers ls JOIN lists l ON (ls.listid = l.listid)
LEFT JOIN subscribers_data s ON (ls.subscriberid = s.subscriberid
AND s.fieldid IN (2,3,34,35,36,81,100,154)
WHERE l.list_name_contains_digits = 1;
Also, you should always analyze an SQL query with EXPLAIN if it's important for them to have good performance. There's an analogous feature in MS SQL Server, so you should be accustomed to the concept, but the MySQL terminology may be different.
You'll have to read the documentation to learn how to interpret the EXPLAIN report in MySQL, there's too much info to describe here.
Re your additional info: Yes, I understand you can't do away with the EAV table structure. Can you create an additional table? Then you can load the EAV data into it:
CREATE TABLE subscriber_mirror (
subscriberid INT PRIMARY KEY,
first_name VARCHAR(100),
last_name VARCHAR(100),
first_name2 VARCHAR(100),
last_name2 VARCHAR(100),
mobile_phone VARCHAR(100),
sms_only VARCHAR(100),
mobile_carrier VARCHAR(100)
);
INSERT INTO subscriber_mirror (subscriberid)
SELECT DISTINCT subscriberid FROM list_subscribers;
UPDATE subscriber_data s JOIN subscriber_mirror m USING (subscriberid)
SET m.first_name = IF(s.fieldid = 2, s.data, m.first_name),
m.last_name = IF(s.fieldid = 3, s.data, m.last_name),
m.first_name2 = IF(s.fieldid = 35, s.data, m.first_name2),
m.last_name2 = IF(s.fieldid = 36, s.data, m.last_name2),
m.mobile_phone = IF(s.fieldid = 81, s.data, m.mobile_phone),
m.sms_only = IF(s.fieldid = 100, s.data, m.sms_only),
m.mobile_carrer = IF(s.fieldid = 34, s.data, m.mobile_carrier);
This will take a while, but you only need to do it when you get a new data update from the vendor. Subsequently you can query subscriber_mirror in a much more conventional SQL query:
SELECT ls.subscriberid AS id, l.name+0 AS user_id,
COALESCE(s.first_name, s.first_name2) AS first_name,
COALESCE(s.last_name, s.last_name2) AS last_name,
COALESCE(ls.email_address, '') AS email_address),
COALESCE(s.mobile_phone, '') AS mobile_phone,
COALESCE(s.sms_only, '') AS sms_only,
COALESCE(s.mobile_carrier, '') AS mobile_carrier
FROM lists l JOIN list_subscribers USING (listid)
JOIN subscriber_mirror s USING (subscriberid)
WHERE l.name+0 > 0
As for the userid that's embedded in the l.name column, if the digits are the leading characters in the column value, MySQL allows you to convert to an integer value much more easily:
An expression like '123_bill'+0 yields an integer value of 123. An expression like 'bill_123'+0 has no digits at the beginning, so it yields an integer value of 0.
Related
Sorting with many to many relationship
I have a 3 tables person, person_speaks_language and language. person has 80 records language has 2 records I have the following records the first 10 persons speaks one language the first 70 persons (include the first group) speaks 2 languages the last 10 persons dont speaks any language Following with the example I want sort the persons by language, How I can do it correctly. I'm trying to use the the following SQL but seems quite strange SELECT "person".* FROM "person" LEFT JOIN "person_speaks_language" ON "person"."id" = "person_speaks_language"."person_id" LEFT JOIN "language" ON "person_speaks_language"."language_id" = "language"."id" ORDER BY "language"."name" ASC dataset 71,Catherine,Porter,male,NULL 72,Isabelle,Sharp,male,NULL 73,Scott,Chandler,male,NULL 74,Jean,Graham,male,NULL 75,Marc,Kennedy,male,NULL 76,Marion,Weaver,male,NULL 77,Melvin,Fitzgerald,male,NULL 78,Catherine,Guerrero,male,NULL 79,Linnie,Strickland,male,NULL 80,Ann,Henderson,male,NULL 11,Daniel,Boyd,female,English 12,Ora,Beck,female,English 13,Hulda,Lloyd,female,English 14,Jessie,McBride,female,English 15,Marguerite,Andrews,female,English 16,Maurice,Hamilton,female,English 17,Cecilia,Rhodes,female,English 18,Owen,Powers,female,English 19,Ivan,Butler,female,English 20,Rose,Bishop,female,English 21,Franklin,Mann,female,English 22,Martha,Hogan,female,English 23,Francis,Oliver,female,English 24,Catherine,Carlson,female,English 25,Rose,Sanchez,female,English 26,Danny,Bryant,female,English 27,Jim,Christensen,female,English 28,Eric,Banks,female,English 29,Tony,Dennis,female,English 30,Roy,Hoffman,female,English 31,Edgar,Hunter,female,English 32,Matilda,Gordon,female,English 33,Randall,Cruz,female,English 34,Allen,Brewer,female,English 35,Iva,Pittman,female,English 36,Garrett,Holland,female,English 37,Johnny,Russell,female,English 38,Nina,Richards,female,English 39,Mary,Ballard,female,English 40,Adrian,Sparks,female,English 41,Evelyn,Santos,female,English 42,Bess,Jackson,female,English 43,Nicholas,Love,female,English 44,Fred,Perkins,female,English 45,Cynthia,Dunn,female,English 46,Alan,Lamb,female,English 47,Ricardo,Sims,female,English 48,Rosie,Rogers,female,English 49,Susan,Sutton,female,English 50,Mary,Boone,female,English 51,Francis,Marshall,male,English 52,Carl,Olson,male,English 53,Mario,Becker,male,English 54,May,Hunt,male,English 55,Sophie,Neal,male,English 56,Frederick,Houston,male,English 57,Edwin,Allison,male,English 58,Florence,Wheeler,male,English 59,Julia,Rogers,male,English 60,Janie,Morgan,male,English 61,Louis,Hubbard,male,English 62,Lida,Wolfe,male,English 63,Alfred,Summers,male,English 64,Lina,Shaw,male,English 65,Landon,Carroll,male,English 66,Lilly,Harper,male,English 67,Lela,Gordon,male,English 68,Nina,Perry,male,English 69,Dean,Perez,male,English 70,Bertie,Hill,male,English 1,Nelle,Gill,female,Spanish 2,Lula,Wright,female,Spanish 3,Anthony,Jensen,female,Spanish 4,Rodney,Alvarez,female,Spanish 5,Scott,Holmes,female,Spanish 6,Daisy,Aguilar,female,Spanish 7,Elijah,Olson,female,Spanish 8,Alma,Henderson,female,Spanish 9,Willie,Barrett,female,Spanish 10,Ada,Huff,female,Spanish 11,Daniel,Boyd,female,Spanish 12,Ora,Beck,female,Spanish 13,Hulda,Lloyd,female,Spanish 14,Jessie,McBride,female,Spanish 15,Marguerite,Andrews,female,Spanish 16,Maurice,Hamilton,female,Spanish 17,Cecilia,Rhodes,female,Spanish 18,Owen,Powers,female,Spanish 19,Ivan,Butler,female,Spanish 20,Rose,Bishop,female,Spanish 21,Franklin,Mann,female,Spanish 22,Martha,Hogan,female,Spanish 23,Francis,Oliver,female,Spanish 24,Catherine,Carlson,female,Spanish 25,Rose,Sanchez,female,Spanish 26,Danny,Bryant,female,Spanish 27,Jim,Christensen,female,Spanish 28,Eric,Banks,female,Spanish 29,Tony,Dennis,female,Spanish 30,Roy,Hoffman,female,Spanish 31,Edgar,Hunter,female,Spanish 32,Matilda,Gordon,female,Spanish 33,Randall,Cruz,female,Spanish 34,Allen,Brewer,female,Spanish 35,Iva,Pittman,female,Spanish 36,Garrett,Holland,female,Spanish 37,Johnny,Russell,female,Spanish 38,Nina,Richards,female,Spanish 39,Mary,Ballard,female,Spanish 40,Adrian,Sparks,female,Spanish 41,Evelyn,Santos,female,Spanish 42,Bess,Jackson,female,Spanish 43,Nicholas,Love,female,Spanish 44,Fred,Perkins,female,Spanish 45,Cynthia,Dunn,female,Spanish 46,Alan,Lamb,female,Spanish 47,Ricardo,Sims,female,Spanish 48,Rosie,Rogers,female,Spanish 49,Susan,Sutton,female,Spanish 50,Mary,Boone,female,Spanish 51,Francis,Marshall,male,Spanish 52,Carl,Olson,male,Spanish 53,Mario,Becker,male,Spanish 54,May,Hunt,male,Spanish 55,Sophie,Neal,male,Spanish 56,Frederick,Houston,male,Spanish 57,Edwin,Allison,male,Spanish 58,Florence,Wheeler,male,Spanish 59,Julia,Rogers,male,Spanish 60,Janie,Morgan,male,Spanish 61,Louis,Hubbard,male,Spanish 62,Lida,Wolfe,male,Spanish 63,Alfred,Summers,male,Spanish 64,Lina,Shaw,male,Spanish 65,Landon,Carroll,male,Spanish 66,Lilly,Harper,male,Spanish 67,Lela,Gordon,male,Spanish 68,Nina,Perry,male,Spanish 69,Dean,Perez,male,Spanish 70,Bertie,Hill,male,Spanish Update the expect results are: each person must be appears only one time using the language order For explain the case further, I'll take a new and small dataset, using only the person id and the language name 1,English 2,English 3,English 4,English 19,English 1,Spanish 2,Spanish 3,Spanish 4,Spanish 5,Spanish 14,Spanish 15,Spanish 16,Spanish 19,Spanish 21,Spanish 25,Spanish I'm using the same order but if I use a limit for example LIMIT 8 the results will be 1,English 2,English 3,English 4,English 19,English 1,Spanish 2,Spanish 3,Spanish And the expected result is 1,English 2,English 3,English 4,English 19,English 5,Spanish 14,Spanish 15,Spanish What I'm trying to do What I'm trying to do is sorting, paginating and filtering a list of X that may have a many-to-many relationship with Y, in this case X is a person and Y is the language. I need do it in a general way. I found a trouble if I want ordering the list by some Y properties. The list will show in this way: firstname, lastname, gender , languages Daniel , Boyd , female , English Spanish Ora , Beck , female , English Anthony , Jensen , female , Spanish .... I only need return a array with the IDs in the correct order this is the main reason I need that the results only appears the person one time is because the ORM (that I'm using) try to hydrate each result and if I paginate the results using offset and limit. the results maybe aren't the expected. I'm doing assumptions many to many relationships I can't use the string_agg or group_concat because I dont know the real data, I dont know if are integers or strings
If you want each person to appear only once, then you need to aggregate by that person. If you then want the list of languages, you need to combine them in some way, concatenation comes to mind. The use of double quotes suggests Postgres or Oracle to me. Here is Postgres syntax for this: SELECT p.id, string_agg(l.name) as languages FROM person p LEFT JOIN person_speaks_language psl ON p.id = psl.person_id LEFT JOIN language l ON psl.language_id = l.id GROUP BY p.id ORDER BY COUNT(l.name) DESC, languages; Similar functionality to string_agg() exists in most databases.
There is nothing wrong with Bertie Hill appearing in two rows, with one language each, that is the Tabular View of Data per the Relational Model. There are no dependencies on data values or number of data values. It is completely correct and un-confused. But here, the requirement is confused, because you really want three separate lists: speaks one language speaks two languages [or the number of languages currently in the language file] speaks no language [on file] ) ... But you want those three lists in one list. Concatenating data values is never, ever a good idea. It is a breach of rudimentary standards, specifically 1NF. It may be common, but it is a gross error. It may be taught by the so-called "theoreticians", but it remains a gross error. Even in a result set, yes. It creates confusion, such as I have detailed at the top. With concatenated strings, as the number of languages changes, the width of that concatenated field will grow, and eventually exceed space, wherever it appears (eg. the width of the field on the screen). Just two of the many reasons why it is incorrect, not expandable, sub-standard. By the way, in your "dataset" (it isn't the result set produced by your code), the sexes appear to be nicely mixed up. Therefore the answer, and the only correct one, even if it isn't popular, is that your code is correct (it can be cleaned it up, sure), and you have to educate the user re the dangers of sub-standard code or reports. You can sort by person.name (rather than by language.name) and then write smarter SQL such that (eg) the person.name is not repeated on the second and subsequent row for persons who speak more than one language, etc. That is just pretty printing. The non-answer, for those who insist on sub-standard code that will break one day when, is Gordon's response. Response to Comments In the Relational Model: There is no order to the rows, that is deemed a physical or implementation aspect, which we have no control over, and which changes anyway, and which we are warned not to rely upon. If order is sought in the output result set, then we must us ORDER BY, that is its purpose in life. The data has meaning, and that meaning is carried in Relational Keys. Meaning cannot be carried in surrogates (ie. ID columns). Limiting myself to the files (they are not tables) that you have given, there is no such thing in the data as: the first 10 persons who speaks one language Obtaining persons who speak one language is simple, I believe you already understand that: SELECT person.first_name, person.last_name FROM person P, (SELECT person_id FROM person_speaks_language GROUP BY person_id HAVING COUNT(*) = 1 -- change this for 2 languages, etc ) AS PL WHERE P.person_id = PL.person_id But "first" ? "first" by what criteria ? Record creation date ? ORDER BY date_created -- if it exists in the data Record ID does not give first anything: as records are added and deleted, any "order" that may exist initially is completely lost. You cannot extract meaning out of, or assign meaning to something that, by definition, has no meaning. If the Record ID is relevant, ie. you are going to use it for some purpose, then it is not a Record ID, name the field for what it actually is. I fail to see, I do not understand, the relevance of the difference between the "dataset" and the updated "small dataset". The "dataset" size is irrelevant, the field headings are irrelevant, what the result set means, is relevant. The problem is not some "limitation" in the Relational Model, the problem is (a) your fixed view of data values, and (b) your lack of understanding about what the Relational Model is, what it does, understanding of which makes this whole question disappear, and we are left with a simple SQL (as tagged) "how to" question. Eg. If I had a Relational Database, with persons and languages, with no ID columns, there is nothing that I cannot do with it, no report that I cannot produce from it, from the data. Please try to use an example that conveys the meaning in the data, in what you are trying to do. the expect results are: each person must be appear only one time They already appear only once (for each language) using the language order Well, there is no order in the language file. We can give it some order, whatever order is meaning-ful, to you, in the result set, based on the data. Eg. language.name. Of course, many persons speak each language, so what order would you like within language.name? How about last_name, first_name. The Record IDs are meaningless to the user, so I won't display them in the result set. NULL is also meaningless, and ambiguous, so I will make the meaning here explicit. This is pretty much what you have, tidied up: SELECT [language] = CASE name WHEN NULL THEN "[None]" ELSE name END, last_name, first_name FROM person P LEFT JOIN person_speaks_language PL ON P.id = PL.person_id LEFT JOIN language L ON PL.language_id = L.id ORDER BY name, last_name, first_name But then you have: And the expected result is The example data of which contradicts your textual descriptions: the expect results are: each person must be appear only one time using the language order So now, if I ignore the text, and examine the example data re what you want (which is a horrible thing to do, because I am joining you in the incorrect activity of focussing on the data values, rather than understanding the meaning), it appears you want the person to appear only once, full stop, regardless of how many languages they speak. Your example data is meaningless, so I cannot be asked to reproduce it. See if this has some meaning. SELECT last_name, first_name, [language] = ( -- correlated subquery SELECT TOP 1 -- get the "first" language CASE name -- make meaning of null explicit WHEN NULL THEN "[None]" ELSE name END FROM person_speaks_language PL JOIN language L ON PL.language_id = L.id WHERE P.id = PL.person_id -- the subject person ORDER BY name -- id would be meaningless ) FROM person P -- vector for person, once ORDER BY last_name, first_name Now if you wanted only persons who speak a language (on file): SELECT last_name, first_name, [language] = ( -- correlated subquery SELECT TOP 1 -- get the "first" language name FROM person_speaks_language PL JOIN language L ON PL.language_id = L.id WHERE P.id = PL.person_id -- the subject person ORDER BY name -- id would be meaningless ) FROM person P, ( SELECT DISTINCT person_id -- just one occ, thanks FROM person_speaks_language PL -- vector for speakers ) AS PL_1 WHERE P.id = PL_1.person_id -- join them to person fields There, not an outer join anywhere to be seen, in either solution. LEFT or RIGHT will confuse you. Do not attempt to "get everything", so that you can "see" the data values, and then mangle, hack and chop away at the result set, in order to get what you want from that. No, forget about the data values and get only what you want from the record filing system. Response to Update I was trying to explain the case with a data set, I think I made things tougher than they actually were Yes, you did. Reviewing the update then ... The short answer is, get rid of the ORM. There is nothing in it of value: you can access the RDB from the queries that populate your objects directly. The way we did for decades before the flatulent beast came along. Especially if you understand and implement Open Architecture Standards. Further, as evidenced, it creates masses of problems. Here, you are trying to work around the insane restrictions of the ORM. Pagination is a straight-forward issue, if you have your data Normalised, and Relational Keys. The long answer is ... please read this Answer. I trust you will understand that the approach you take to designing your app components, your design of windows, will change. All your queries will be simplified, you get only what you require for the specific window or object. The problem may well disappear entirely (except for possibly the pagination, you might need a method). Then please think about those architectural issues carefully, and make specific comments of questions.
Google Bigquery use of substr, never returns back results
I have a table which has two sets of data, one set of data has information like Type | Name | Id PackagedDrug |Pseudoephedrine HCl Oral Tablet 120 MG| 110 PackagedDrug |Pseudoephedrine HCl Oral Tablet 60 MG|111 DrugName| Pseudoephedrine HCl| 112 What I want to do is join PackagedDrug with DrugName concepts, so get all Ids for Type PackagedDrug whose Name is matching with Name for Type DrugName. If I hardcode the Name for DrugName in the following query, it runs instantenously, but if I take out the hardcoding then it just keeps on running. Could you please suggest me suitable ways to speed up the big query? SELECT a.MSC_ID MSC_id, a.MSC_CONcept_type, a.concept_id, a.concept_name , b.concept_name from (select MSC_id, MSC_CONcept_type, concept_id, concept_name FROM [ClientAlerts.MSC_Concepts] where MSC_CONcept_type in ('MediSpan.Concepts.PackagedDrug') ) a CROSS JOIN (select MSC_CONcept_type, concept_id, concept_name , length(concept_name) len FROM [ClientAlerts.MSC_Concepts] where MSC_CONcept_type in ('MediSpan.Concepts.NamebasedClassification.DrugName') -- and concept_name in ('Pseudoephedrine HCl') ) b where substr(a.concept_name,1,b.len)+' ' = b.concept_name Thanks, Savita
This has nothing to do with BigQuery itself. When you hardcode, your values are "filtered" way faster, because it doesn't have to check every row, since it looks for the hardcoded value. If you don't use the hardcoded value, it will look at WAY more rows, compare ALL the rows from your first query with your second. Honestly, if you describe your use case properly here, I don't think of any way to do this faster. But one question does come to mind. Why do you have a "type". It seems like it should be two different tables instead.
Why does changing the where clause on this criteria reduce the execution time so drastically?
I ran across a problem with a SQL statement today that I was able to fix by adding additional criteria, however I really want to know why my change fixed the problem. The problem query: SELECT * FROM (SELECT ah.*, com.location, ha.customer_number, d.name applicance_NAME, house.name house_NAME, dr.name RULE_NAME FROM actionhistory ah INNER JOIN community com ON (t.city_id = com.city_id) INNER JOIN house_address ha ON (t.applicance_id = ha.applicance_id AND ha.status_cd = 'ACTIVE') INNER JOIN applicance d ON (t.applicance_id = d.applicance_id) INNER JOIN house house ON (house.house_id = t.house_id) LEFT JOIN the_rule tr ON (tr.the_rule_id = t.the_rule_id) WHERE actionhistory_id >= 'ACT100010000' ORDER BY actionhistory_id ) WHERE rownum <= 30000; The "fix" SELECT * FROM (SELECT ah.*, com.location, ha.customer_number, d.name applicance_NAME, house.name house_NAME, dr.name RULE_NAME FROM actionhistory ah INNER JOIN community com ON (t.city_id = com.city_id) INNER JOIN house_address ha ON (t.applicance_id = ha.applicance_id AND ha.status_cd = 'ACTIVE') INNER JOIN applicance d ON (t.applicance_id = d.applicance_id) INNER JOIN house house ON (house.house_id = t.house_id) LEFT JOIN the_rule tr ON (tr.the_rule_id = t.the_rule_id) WHERE actionhistory_id >= 'ACT100010000' and actionhistory_id <= 'ACT100030000' ORDER BY actionhistory_id ) All of the _id columns are indexed sequences. The first query's explain plan had a cost of 372 and the second was 14. This is running on an Oracle 11g database. Additionally, if actionhistory_id in the where clause is anything less than ACT100000000, the original query returns instantly.
This is because of the index on the actionhistory_id column. During the first query Oracle has to return all the index blocks containing indexes for records that come after 'ACT100010000', then it has to match the index to the table to get all the records, and then it pulls 29999 records from the result set. During the second query Oracle only has to return the index blocks containing records between 'ACT100010000' and 'ACT100030000'. Then it grabs from the table those records that are represented in the index blocks. A lot less work in that step of grabbing the record after having found the index than if you use the first query. Noticing your last line about if the id is less than ACT100000000 - sounds to me that those records may all be in the same memory block (or in a contiguous set of blocks). EDIT: Please also consider what is said by Justin - I was talking about actual performance, but he is pointing out that the id being a varchar greatly increases the potential values (as opposed to a number) and that the estimated plan may reflect a greater time than reality because the optimizer doesn't know the full range until execution. To further optimize, taking his point into consideration, you could put a function based index on the id column or you could make it a combination key, with the varchar portion in one column and the numeric portion in another.
What are the plans for both queries? Are the statistics on your tables up to date? Do the two queries return the same set of rows? It's not obvious that they do but perhaps ACT100030000 is the largest actionhistory_id in the system. It's also a bit confusing because the first query has a predicate on actionhistory_id with a value of TRA100010000 which is very different than the ACT value in the second query. I'm guessing that is a typo? Are you measuring the time required to fetch the first row? Or the time required to fetch the last row? What are those elapsed times? My guess without that information is that the fact that you appear to be using the wrong data type for your actionhistory_id column is affecting the Oracle optimizer's ability to generate appropriate cardinality estimates which is likely causing the optimizer to underestimate the selectivity of your predicates and to generate poorly performing plans. A human may be able to guess that actionhistory_id is a string that starts with ACT10000 and then has 30,000 sequential numeric values from 00001 to 30000 but the optimizer is not that smart. It sees a 13 character string and isn't able to figure out that the last 10 characters are always going to be numbers so there are only 10 possible values rather than 256 (assuming 8-bit characters) and that the first 8 characters are always going to be the same constant value. If, on the other hand, actionhistory_id was defined as a NUMBER and had values between 1 and 30000, it would be dramatically easier for the optimizer to make reasonable estimates about the selectivity of various predicates.
How to optimize group by in table with huge number of records
I have a Person table with huge number of records(for about 16 million), and have a requirement to find all persons, with same lastname, first letter of firstname and birthyear, in other worlds I want to show assuming duplicate persons in UI for users to analyze and decide are there a same person or not. Here is the query I write SELECT * FROM Person INNER JOIN ( SELECT SUBSTRING(firstName, 1, 1) firstNameF,lastName,YEAR(birthDate) birthYear FROM Person GROUP BY SUBSTRING(firstName, 1,1),lastName,YEAR(birthDate) HAVING count(*) > 1 ) as dupPersons ON SUBSTRING(Person.firstName,1,1) = dupPersons.firstNameF and Person.lastName = dupPersons.lastName and YEAR(Person.birthDate) = dupPersons.birthYear order by Person.lastName,Person.firstName but as I am not SQL expert, want too know, is this good way to do that? are there more optimized way? EDIT Note that I can cut data, which can have contribution in optimization for example if I want to cut data by 2 it could return two persons Johan Smith | Jane Smith | have same lastname and first name inita Jack Smith | Mark Tween | have same lastname and first name inita Mac Tween |
If the performance using a GROUP BY is not adequate, You could try using an INNER JOIN SELECT * FROM Person p1 INNER JOIN Person p2 ON p2.PersonID > p1.PersonID WHERE SUBSTRING(p2.Firstname, 1, 1) = SUBSTRING(p1.Firstname, 1, 1) AND p2.LastName = p1.LastName AND YEAR(p2.BirthDate) = YEAR(p1.BirthDate) ORDER BY p1.LastName, p1.FirstName
Well, if you're not an expert, the query you wrote says to me that you're at least pretty competent. When we look at whether a query is "optimized", there are two immediate parts to that: 1. The query just on its own has something notably wrong with it - a bad join, keyword misuse, exploding result set size, supersitions about NOT IN, etc. 2. The context that the query operates within - DB specifics, task specifics, etc. Your query passes #1, no problem. I would have written it differently - aliased the Person table, used LEFT(P.FirstName, 1) instead of SUBSTRING, and used a CTE (WITH-clause) instead of a subquery. But these aren't optimization issues. Maybe I'd use WITH(READUNCOMMITTED) if the results weren't sensitive to dirty reads. Out of any further context, your query doesn't look like a bomb waiting to go off. As for #2 - You should probably switch to specifics. Like "I have to run this every week. It takes 17 minutes. How can I get it down to under a minute?" Then people will ask you what your plan looks like, what indexes you have, etc. Things I'd want to know: How long does it already take to run? What's your runtime window? (User & app tolerance for query time.) Is this run once a day? Week? Month? Quarter? Do you have the permission to create tables, change current tables, or alter indexes? Maybe based on having run it, what's the ratio of duplicates you're expecting to find? 5%? 90%? How stable is the matching criteria requirement? Example scenario: If this was a run-on-command feature, it will be in my app indefinitely, it will get run weekly, with 10% or fewer records expected to be duplicates, with ability to change the DB how I'd like, if the duplicate matching criteria is firm (not fluctuating), and I wan to cut it from 90s to 5s, I'd create a dedicated BirthYear column (possibly a persisted computed column off of BirthDate), and an index on LastName ASC, BirthYear ASC, FirstName ASC. If too many of those stipulations change, I might to a different direction entirely.
You can try something like this and see the difference on the execution plans, or benchmark the results on performance: ;WITH DupPersons AS ( SELECT *, COUNT(1) OVER(PARTITION BY SUBSTRING(firstName, 1, 1), lastName, YEAR(birthDate)) Quant FROM Person ) SELECT * FROM DupPersons WHERE Quant > 1 Of course, it would also help to know your table definition and the indexes you created. I think that maybe it can help to add a computed column with the year of birthdate and create an index on it, the same with the first letter of firstname.
Sorting SQL by first two characters of fields
I'm trying to sort some data by sales person initials, and the sales rep field is 3 chars long, and is Firstname, Lastname and Account type. So, Bob Smith would be BS* and I just need to sort by the first two characters. How can I pull all data for a certain rep, where the first two characters of the field equals BS?
In some databases you can actually do select * from SalesRep order by substring(SalesRepID, 1, 2) Othere require you to select *, Substring(SalesRepID, 1, 2) as foo from SalesRep order by foo And in still others, you can't do it at all (but will have to sort your output in program code after you get it from the database). Addition: If you actually want just the data for one sales rep, do as the others suggest. Otherwise, either you want to sort by the thing or maybe group by the thing.
What about this SELECT * FROM SalesTable WHERE SalesRepField LIKE 'BS_'
I hope that you never end up with two sales reps who happen to have the same initials. Also, sorting and filtering are two completely different things. You talk about sorting in the question title and first paragraph, but your question is about filtering. Since you can just ORDER BY on the field and it will use the first two characters anyway, I'll give you an answer for the filtering part. You don't mention your RDBMS, but this will work in any product: SELECT my_columns FROM My_Table WHERE sales_rep LIKE 'BS%' If you're using a variable/parameter then: SELECT my_columns FROM My_Table WHERE sales_rep LIKE #my_param + '%' You can also use: LEFT(sales_rep, 2) = 'BS' I would stay away from: SUBSTRING(sales_rep, 1, 2) = 'BS' Depending on your SQL engine, it might not be smart enough to realize that it can use an index on the last one.
You haven't said what DBMS you are using. The following would work in Oracle, and something like them in most other DBMSs 1) where sales_rep like 'BS%' 2) where substr(sales_rep,1,2) = 'BS'
SELECT * FROM SalesRep WHERE SUBSTRING(SalesRepID, 1, 2) = 'BS' You didn't say what database you were using, this works in MS SQL Server.