Why does added RAND() cause MySQL to overload? - sql

OK I have this query which gives me DISTINCT product_series, plus all the other fields in the table:
SELECT pi.*
FROM (
SELECT DISTINCT product_series
FROM cart_product
) pd
JOIN cart_product pi
ON pi.product_id =
(
SELECT product_id
FROM cart_product po
WHERE product_brand = "everlon"
AND product_type = "'.$type.'"
AND product_available = "yes"
AND product_price_contact = "no"
AND product_series != ""
AND po.product_series = pd.product_series
ORDER BY product_price
LIMIT 1
) ORDER BY product_price
This works fine. I am also ordering by price so I can get the starting price for each series. Nice.
However today my boss told me that all the products thats are showing up from this query are of metal_type white gold And he wants to show random metal types. so I added RAND() to the order by after the ORDER BY price so that I will still get the lowest price, but a random metal in the lowest price.. here is the new query:
SELECT pi.*
FROM (
SELECT DISTINCT product_series
FROM cart_product
) pd
JOIN cart_product pi
ON pi.product_id =
(
SELECT product_id
FROM cart_product po
WHERE product_brand = "everlon"
AND product_type = "'.$type.'"
AND product_available = "yes"
AND product_price_contact = "no"
AND product_series != ""
AND po.product_series = pd.product_series
ORDER BY product_price, RAND()
LIMIT 1
) ORDER BY product_price, RAND()
When I run this query, MySQL completely shuts down and tells me that there are too many connections And I get a phone call from the host admin asking me what the hell I did.
I didn't believe that could be just from added RAND() to the query and I thought it had to be a coincidence. I waited a few hours after everything was fixed and ran the query again. Immediately... same issue.
So what is going on? Because I have no clue. Is there something wrong with my query?
Thanks!!!!

Using RAND() for ORDER BY is not a good idea, because it does not scale as the data increases. You can see more information on it, including two alternatives you can adapt, in my answer to this question.

Here's a blog post that explains the issue quite well, and workarounds:
http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/
And here's a similar warning against ORDER BY RAND() for MySQL, I think the cause is basically the same there:
http://www.webtrenches.com/post.cfm/avoid-rand-in-mysql

Depending on the number of products in your site, that function call is going to execute once per record, potentially slowing the query down.. considerably.
The Too Many Connections error is probably due to this query blocking others while it tries to compute those numbers.
Find another way. ;)

Instead, you can generate random numbers on the programming language you're using, instead of the MySQL side, as rand() is being called for each row

If you know how many records you have you can select a random record like this (this is Perl):
$handle->Sql("SELECT COUNT(0) AS nor FROM table");
$handle->FetchRow();
$nor = $handle->Data('nor');
$rand = int(rand()*$nor)+1;
$handle->Sql("SELECT * FROM table LIMIT $rand,1");
$handle->FetchRow();
.
.
.

Related

sql is duplicating my results

I believe the problem is within my joins but i am unable to correct it. The SQL should return 3 rows however it is duplicating and returning 12 rows instead. Any help would be much appreciated!
SELECT J.JOURNEY_NUMBER,
L.DESCRIPTION,
L.USE_CODE,
J.REAL_START_DATE,
J.REAL_END_DATE,
S.STOP_ID,
SN.WRIN_ID,
J.JOURNEY_ID
FROM PDA_STG.JOURNEY J,
PDA_STG.RESTAURANT R,
PDA_STG.LOCATION L,
PDA_STG.SERIAL_NUMBER SN,
PDA_STG.STOP S
WHERE J.JOURNEY_ID = R.JOURNEY_ID
AND l.loc_id = r.rest_loc_id
AND J.JOURNEY_ID = S.JOURNEY_ID
AND S.STOP_ID = SN.STOP_ID
AND SN.WRIN_ID = '00768669'
AND j.dc_loc_id = '994'
AND J.JOURNEY_ID = '357020'
AND J.PLANNED_START_DATE < '20-APR-17'
ORDER BY J.JOURNEY_ID DESC
You are probably joining records that you don't want to join for which you'd have to add some join criteria. (For instance if the serial number could change for a stop, i.e. you keep old serial numbers with a date, you'd only want the latest serial number, not all.)
In order to find the flaw in your query you can select * and see what records you are actually selecting.
Thanks for the feedback, i just done as India.Rocket said and it worked perfectly.
without sample it's difficult to tell what's wrong with the query. But
if rows are exact duplicate then just put a distinct after select.
That should do the job – India.Rocket 46 mins ago

Negative comparison with ActiveRecord

I have two tables:
sales
period_id
customer_id
product_id
value
total_sales
period_id
customer_id
product_id
value
I want to select every combination of period_id and customer_id that exists on sales but don't exists on total_sales. I believe there's a short way to do this. But every approach I thought involves N+1 queries.
How can I do this?
In my opinion this is precisely the case where it's reasonable to diverge from the purist ActiveRecord approach and bring some SQL to the table. This should do the trick perfectly well:
Sale.find_by_sql("
SELECT * FROM sales WHERE NOT EXISTS (
SELECT 1 FROM total_sales
WHERE total_sales.period_id = sales.period_id AND
total_sales.customer_id = sales.customer_id)
")
I suppose it also can be done in a more ActiveRecordish style, but it will most certainly lose clarity of the code.
Of course if you ever drop this to you codebase, you really should wrap it in a method and leave in the model, so that SQL code at least doesn't bleed into the controller.
Hope this helps!
P.S. Oh, here's the SQL fiddle I used to test this out: http://sqlfiddle.com/#!15/eaf1d/1
How about you first get the list that matches, then use that to get the rows that you actually want:
unwanted_ids = Sale.joins('inner join total_sales on sales.period_id = total_sales.period_id and sales.customer_id = total_sales.customer_id').pluck('distinct sales.id')
wanted_rows = unwanted_ids.any? Sale.where('id not in (?)', unwanted_ids) : Sale.all
wanted_rows.select(:period_id, :customer_id) #if you only want these two
This way you only need two queries.
EDIT:
You can also do this:
Sale.joins('left outer join total_sales on sales.period_id = total_sales.period_id and sales.customer_id = total_sales.customer_id').where('total_sales.period_id is null')
Does it in one query

BigQuery - Shuffle By error

I have a table of about 5M rows. Note this is just a poc. Ultimately we will need to be in the TB range. I am doing a self join to find permutations of products for a market basket analysis.
I need to find the number of times the combination occurs in a basket, the ratio of occurrences to total baskets, and the number of times the item occurs in all baskets. This is pretty standard. BigQuery does not support selects in the predicate of another select so I needed to create another join I suppose. Here's what I came up with -
select twoItem.upc1,twoItem.upc2,twoItem.twoItemOccurrences, totalUpc.totalUpcCount
from
(
select purchase1.upc as upc1,purchase2.upc as upc2,count(upc1) as twoItemOccurrences
from
conagra.purchase as purchase1
join each conagra.purchase as purchase2
on purchase1.upc = purchase2.upc
group by upc1,upc2
) as twoItem
JOIN EACH
(
select purchase3.upc as upc3, count(*) as totalUpcCount
from conagra.purchase as purchase3
group by upc3
) as totalUpc
on totalUpc.upc3 = twoItem.upc1
LIMIT 50;
I get the following error:
SHUFFLE BY may only be applied to parallelizable queries, but query is not parallelizable: (SELECT * FROM (SELECT [purchase3.upc] AS [upc3], COUNT(*) AS [totalUpcCount]...
Maybe an unpublished limitation?
Any help would be appreciated.
Try running these with GROUP EACH BY on your inner queries. We'll improve the response message for queries like this.

Select first or random row in group by

I have this query using PostgreSQL 9.1 (9.2 as soon as our hosting platform upgrades):
SELECT
media_files.album,
media_files.artist,
ARRAY_AGG (media_files. ID) AS media_file_ids
FROM
media_files
INNER JOIN playlist_media_files ON media_files.id = playlist_media_files.media_file_id
WHERE
playlist_media_files.playlist_id = 1
GROUP BY
media_files.album,
media_files.artist
ORDER BY
media_files.album ASC
and it's working fine, the goal was to extract album/artist combinations and in the result set have an array of media files ids for that particular combo.
The problem is that I have another column in media files, which is artwork.
artwork is unique for each media file (even in the same album) but in the result set I need to return just the first of the set.
So, for an album that has 10 media files, I also have 10 corresponding artworks, but I would like just to return the first (or a random picked one for that collection).
Is that possible to do with only SQL/Window Functions (first_value over..)?
Yes, it's possible. First, let's tweak your query by adding alias and explicit column qualifiers so it's clear what comes from where - assuming I've guessed correctly, since I can't be sure without table definitions:
SELECT
mf.album,
mf.artist,
ARRAY_AGG (mf.id) AS media_file_ids
FROM
"media_files" mf
INNER JOIN "playlist_media_files" pmf ON mf.id = pmf.media_file_id
WHERE
pmf.playlist_id = 1
GROUP BY
mf.album,
mf.artist
ORDER BY
mf.album ASC
Now you can either use a subquery in the SELECT list or maybe use DISTINCT ON, though it looks like any solution based on DISTINCT ON will be so convoluted as not to be worth it.
What you really want is something like an pick_arbitrary_value_agg aggregate that just picks the first value it sees and throws the rest away. There is no such aggregate and it isn't really worth implementing it for the job. You could use min(artwork) or max(artwork) and you may find that this actually performs better than the later solutions.
To use a subquery, leave the ORDER BY as it is and add the following as an extra column in your SELECT list:
(SELECT mf2.artwork
FROM media_files mf2
WHERE mf2.artist = mf.artist
AND mf2.album = mf.album
LIMIT 1) AS picked_artwork
You can at a performance cost randomize the selected artwork by adding ORDER BY random() before the LIMIT 1 above.
Alternately, here's a quick and dirty way to implement selection of a random row in-line:
(array_agg(artwork))[width_bucket(random(),0,1,count(artwork)::integer)]
Since there's no sample data I can't test these modifications. Let me know if there's an issue.
"First" pick
Wouldn't it be simpler / cheaper to just use min():
SELECT m.album
,m.artist
,array_agg(m.id) AS media_file_ids
,min(m.artwork) AS artwork
FROM playlist_media_files p
JOIN media_files m ON m.id = p.media_file_id
WHERE p.playlist_id = 1
GROUP BY m.album, m.artist
ORDER BY m.album, m.artist;
Abitrary / random pick
If you are looking for a random selection, #Craig already provided a solution with truly random picks.
You could also use a CTE to avoid additional scans on the (possibly big) base table and then run two separate (cheap) subqueries on the small result set.
For arbitrary selection - not truly random, the result will depend on the physical order of rows in the table and implementation-specifics:
WITH x AS (
SELECT m.album, m.artist, m.id, m.artwork
FROM playlist_media_files p
JOIN media_files m ON m.id = p.media_file_id
)
SELECT a.album, a.artist, a.media_file_ids, b.artwork
FROM (
SELECT album, artist, array_agg(id) AS media_file_ids
FROM x
) a
JOIN (
SELECT DISTINCT ON (1,2) album, artist, artwork
FROM x
) b USING (album, artist);
For truly random results, you can add an ORDER BY .. random() like this to subquery b:
JOIN (
SELECT DISTINCT ON (1, 2) album, artist, artwork
FROM x
ORDER BY 1, 2, random()
) b USING (album, artist);

How to optimize this low-performance MySQL query?

I’m currently using the following query for jsPerf. In the likely case you don’t know jsPerf — there are two tables: pages containing the test cases / revisions, and tests containing the code snippets for the tests inside the test cases.
There are currently 937 records in pages and 3817 records in tests.
As you can see, it takes quite a while to load the “Browse jsPerf” page where this query is used.
The query takes about 7 seconds to execute:
SELECT
id AS pID,
slug AS url,
revision,
title,
published,
updated,
(
SELECT COUNT(*)
FROM pages
WHERE slug = url
AND visible = "y"
) AS revisionCount,
(
SELECT COUNT(*)
FROM tests
WHERE pageID = pID
) AS testCount
FROM pages
WHERE updated IN (
SELECT MAX(updated)
FROM pages
WHERE visible = "y"
GROUP BY slug
)
AND visible = "y"
ORDER BY updated DESC
I’ve added indexes on all fields that appear in WHERE clauses. Should I add more?
How can this query be optimized?
P.S. I know I could implement a caching system in PHP — I probably will, so please don’t tell me :) I’d just really like to find out how this query could be improved, too.
Use:
SELECT x.id AS pID,
x.slug AS url,
x.revision,
x.title,
x.published,
x.updated,
y.revisionCount,
COALESCE(z.testCount, 0) AS testCount
FROM pages x
JOIN (SELECT p.slug,
MAX(p.updated) AS max_updated,
COUNT(*) AS revisionCount
FROM pages p
WHERE p.visible = 'y'
GROUP BY p.slug) y ON y.slug = x.slug
AND y.max_updated = x.updated
LEFT JOIN (SELECT t.pageid,
COUNT(*) AS testCount
FROM tests t
GROUP BY t.pageid) z ON z.pageid = x.id
ORDER BY updated DESC
You want to learn how to use EXPLAIN. This will execute the sql statement, and show you which indexes are being used, and what row scans are being performed. The goal is to reduce the number of row scans (ie, the database searching row by row for values).
You may want to try the subqueries one at a time to see which one is slowest.
This query:
SELECT MAX(updated)
FROM pages
WHERE visible = "y"
GROUP BY slug
Makes it sort the result by slug. This is probably slow.