Simple join between 3 tables takes lot of time in memsql

Simple join between 3 tables takes lot of time in memsql - sql

I ran the following query in memsql and mysql but the time taken by it is quite different.
Memsql
select count(*) from po A , cu B , tsk C where A.customer_id = B.customer_id and B.taskid = C.id and A.domain = 5 and week(B.post_date) = 22;
+----------+
| count(*) |
+----------+
| 98952 |
+----------+
1 row in set (19.89 sec)
Mysql
select count(*) from po A , cu B , tsk C where A.customer_id = B.customer_id and B.taskid = C.id and A.domain = 5 and week(B.post_date) = 22;
+----------+
| count(*) |
+----------+
| 98952 |
+----------+
1 row in set (0.50 sec)
Why Does memsql perform so badly while mysql is so fast?
Both mysql and memsql are on the same 8GB , quad core machine. memsql has 1 master Aggregator node and 3 leaf nodes.
Does memsql perform badly if there are joins?
UPDATE
From the Doc it is clear that the table should have a shard key on columns which are expected to join on often. This allows the optimizer to minimize network traffic during the execution of the query.
So i think here i went wrong. Instead of having a shard key i had added a simple primary key on the tables.

Have you tried running the query in MemSQL a second time?
MemSQL compiles and caches the query execution code the first time it sees a query - MemSQL calls it code generation.
http://docs.memsql.com/latest/concepts/codegen/
When you run the query again, you should see a considerable performance speedup.

Related

Why are query results across linked servers of different versions completely wrong?

I have a SQL Server 2005 running a stored procedure which hits other servers running 2008.
A very straightforward query is returning utterly incorrect results.
SELECT
c.acctno, c.provcode, p.provcode, p.provname, c.poscode AS ChargePOS,
pos.poscode, pos.posdesc
FROM
Server2008.charge_t as c
inner join
Server2008.provcode_t as p on c.provcode = p.provcode
inner join
Server2008.poscode_t as pos on c.poscode = pos.poscode
inner join
Server2008.patdemo_t as pat on c.acctno = pat.acctno
left join
Server2008.billareacode_t as b on c.billingarea = b.billareacode
Where
c.proccode in ('G0438', 'G0439', '99420')
and c.correction = 'N'
and (c.priinscode in ('0001', '001A', '001B')
or c.secinscode in ('0001', '001A', '001B'))
and year(c.dateofservice) = year(getdate())
Note the INNER JOIN from poscode_t to charge_t table (second inner join) where c.poscode = pos.poscode. This is very simple, standard stuff here.
When this is executed on the 2005 server, the results are just wrong. I get the following:
acctno | patlname | patfname | ChargeProv | ProvProv | provname | ChargePOS | poscode | posdesc
---------------------------------------------------------------------------------------------------------------------------------------------------------
1 | person1 | Person1 | 28 | 28 | Doctor28 | 07 | 323 | Site323
2 | person2 | person2 | 24 | 24 | Doctor24 | 07 | 323 | Site323
In both example, the ChargePOS (07) and the poscode (323) are clearly not the same, which the join should ensure they were.
When I run this query on Server2008 itself, the results are correct. When I run it on a 2012 server, the results are correct. It's only when I run it on the 2005 server. It makes no difference what version of SSMS I use.
I've broken the query down to run piece by piece adding in the joins one at a time. If I specify an acctno in the WHERE, the results are correct.
Has anyone seen anything like this? It's like the link itself is bad or there's some sort of junk in a hung transaction out there that's messing with things only on this server. Any ideas where to look are helpful.
Thanks for your time.

This is not a solution to the overall issue, but I've discovered a couple of things that affect the results.
Changing the joins to LEFT corrects the problem when running it on the 2005 server.
Adding the hint OPTION ( MERGE JOIN ) corrects it as well.
None of this explains why it runs properly on all other servers but horribly wrong on this one server. Changing the joins to LEFT didn't alter the execution plan's structure but adding the hint did.
We're to a point where we need to bring in an expert because working around the problem isn't acceptable in this case. I still welcome any ideas for what might be happening here.

Access join on first record

I have two tables in an Access database, tblProducts and tblProductGroups.
I am trying to run a query that joins both of these tables, and brings back a single record for each product. The problem is that the current design allows for a product to be listed in the tblProductGroups table more than 1 - i.e. a product can be a member of more than one group (i didnt design this!)
The query is this:
select tblProducts.intID, tblProducts.strTitle, tblProductGroups.intGroup
from tblProducts
inner join tblProductGroups on tblProducts.intID = tblProductGroups.intProduct
where tblProductGroups.intGroup = 56
and tblProducts.blnActive
order by tblProducts.intSort asc, tblProducts.curPrice asc
At the moment this returns results such as:
intID | strTitle | intGroup
1 | Product 1 | 1
1 | Product 1 | 2
2 | Product 2 | 1
2 | Product 2 | 2
Whereas I only want the join to be based on the first matching record, so that would return:
intID | strTitle | intGroup
1 | Product 1 | 1
2 | Product 2 | 1
Is this possible in Access?
Thanks in advance
Al

This option runs a subquery to find the minimum intGoup for each tblProducts.intID.
SELECT tblProducts.intID
, tblProducts.strTitle
, (SELECT TOP 1 intGroup
FROM tblProductGroups
WHERE intProduct=tblProducts.intID
ORDER BY intGroup ASC) AS intGroup
FROM tblProducts
WHERE tblProducts.blnActive
ORDER BY tblProducts.intSort ASC, tblProducts.curPrice ASC

This works for me. Maybe this helps someone:
SELECT
a.Lagerort_ID,
FIRST(a.Regal) AS frstRegal,
FIRST(a.Fachboden) AS frstFachboden,
FIRST(a.xOffset) AS frstxOffset,
FIRST(a.yOffset) AS frstyOffset,
FIRST(a.xSize) AS frstxSize,
FIRST(a.ySize) AS frstySize,
FIRST(a.Platzgr) AS frstyPlatzgr,
FIRST(b.Artikel_ID) AS frstArtikel_ID,
FIRST(b.Menge) AS frstMenge,
FIRST(c.Breite) AS frstBreite,
FIRST(c.Tiefe) AS frstTiefe,
FIRST(a.Fachboden_ID) AS frstFachboden_ID,
FIRST(b.BewegungsDatum) AS frstBewegungsDatum,
FIRST(b.ErzeugungsDatum) AS frstErzeugungsDatum
FROM ((Lagerort AS a)
LEFT JOIN LO_zu_ART AS b ON a.Lagerort_ID = b.Lagerort_ID)
LEFT JOIN Regal AS c ON a.Regal = c.Regal
GROUP BY a.Lagerort_ID
ORDER BY FIRST(a.Regal), FIRST(a.Fachboden), FIRST(a.xOffset), FIRST(a.yOffset);
I have non unique entries for Lagerort_ID on the table LO_zu_ART. My goal was to only use the first found entry from LO_zu_ART to match into Lagerort.
The trick is to use FIRST() an any column but the grouped one. This may also work with MIN() or MAX(), but I have not tested it.
Also make sure to call the Fields with the "AS" statement different than the original field. I used frstFIELDNAME. This is important, otherwise I got errors.

Create a new query, qryFirstGroupPerProduct:
SELECT intProduct, Min(intGroup) AS lowest_group
FROM tblProductGroups
GROUP BY intProduct;
Then JOIN qryFirstGroupPerProduct (instead of tblProductsGroups) to tblProducts.
Or you could do it as a subquery instead of a separate saved query, if you prefer.

It's not very optimal, but if you're bringing in a few thousand records this will work:
Create a query that gets the max of tblProducts.intID from one table and call it qry_Temp.
Create another query and join qry_temp to the table you are trying to join against, and you should get your results.

Database size reports differently than the sum of all the tables in SQL Server

I am trying to determine if my SQL Server database is healthy.
I ran a couple of commands to check for the size and I was shocked at the differences reported between the sum of the table sizes and the database size.
I am wondering why there is this large size difference.
EXEC sp_spaceused #updateusage = N'TRUE';
database_name | database_size | unallocated space
FleetEquip |1357.00 MB |0.20 MB
and
EXEC sp_MSforeachtable #command1="EXEC sp_spaceused '?'"
(way too much formatting to include all the tables - an HTML Table would be nice)
name | rows | reserved(KB) | data(KB) | index_size(KB) | unused(KB)
EquipmentState | 131921 | 40648 | 40608 | 8 | 32
the sum of all the tables comes to 45768 KB

You can look at the definition of sp_spaceused with EXEC sp_helptext 'sp_spaceused'
Though I prefer the result format returned by the following actually:
select object_definition(object_id('sp_spaceused')) as [processing-instruction(x)] FOR XML PATH
Can you try the below (based on the aggregate query it contains) and see where the discrepancy lies?
select OBJECT_NAME(p.object_id),
reservedpages = sum(a.total_pages),
usedpages = sum(a.used_pages),
pages = sum(
CASE
-- XML-Index and FT-Index internal tables are not considered "data", but is part of "index_size"
When it.internal_type IN (202,204,211,212,213,214,215,216) Then 0
When a.type <> 1 Then a.used_pages
When p.index_id < 2 Then a.data_pages
Else 0
END
)
from sys.partitions p join sys.allocation_units a on p.partition_id = a.container_id
left join sys.internal_tables it on p.object_id = it.object_id
GROUP BY p.object_id
with rollup

MSSQL allocates memory as needed for it's tables...However, when rows are removed the DB doesn't "shrink". It's similar to DOS where occasionally you need to "Defrag" the drive. There are tools that allow you to defrag/shrink the db if needed.

MYSQL - Combining Two Results in One Query

I have a query I need to perform to show search results for a project. What needs to happen, I need to sort the results by the "horsesActiveDate" and this applies to all of them except for any ad with the adtypesID=7. Those results are sorted by date but they must always result after all other ads.
So I will have all my ads in the result set be ordered by the Active Date AND adtypesID != 7. After that, I need all adtypesID=7 to be sorted by Active Date and appended at the bottom of all the results.
I'm hoping to put this in one query instead of two and appending them together in PHP. The way the code is written, I have to find a way to get it all in one query.
So here is my original query which has worked great until I had to ad the adtypesID=7 which has different sorting requirements.
This is the query that exists now that doesn't take into account the adtypesID for sorting.
SELECT
horses.horsesID,
horsesDescription,
horsesActiveDate,
adtypesID,
states.statesName,
horses_images.himagesPath
FROM horses
LEFT JOIN states ON horses.statesID = states.statesID
LEFT JOIN horses_images ON horses_images.himagesDefault = 1 AND horses_images.horsesID = horses.horsesID AND horses_images.himagesPath != ''
WHERE
horses.horsesStud = 0
AND horses.horsesSold = 0
AND horses.horsesID IN
(
SELECT DISTINCT horses.horsesID
FROM horses
LEFT JOIN horses_featured ON horses_featured.horsesID = horses.horsesID
WHERE horses.horsesActive = 1
)
ORDER BY adtypesID, horses.horsesActiveDate DESC
My first thought was to do two queries where one looked for all the ads that did not contain adtypesID=7 and sort those as the query does, then run a second query to find only those ads with adtypesID=7 and sort those by date. Then take those two results and append them to each other. Since I need to get this all into one query, I can't use a php function to do that.
Is there a way to merge the two query results one after the other in mysql? Is there a better way to run this query that will accomplish this sorting?
The Ideal Results would be as below (I modified the column names so they would be shorter):
ID | Description | ActiveDate | adtypesID | statesName | himagesPath
___________________________________________________________________________
3 | Ad Text | 06-01-2010 | 3 | OK | image.jpg
2 | Ad Text | 05-31-2010 | 2 | LA | image1.jpg
9 | Ad Text | 03-01-2010 | 4 | OK | image3.jpg
6 | Ad Text | 06-01-2010 | 7 | OK | image5.jpg
6 | Ad Text | 05-01-2010 | 7 | OK | image5.jpg
6 | Ad Text | 04-01-2010 | 7 | OK | image5.jpg
Any help that can be provided will be greatly appreciated!

I am not sure about the exact syntax in MySQL, but something like
ORDER BY case when adtypesID = 7 then 2 else 1 end ASC, horses.horsesActiveDate DESC
would work in many other SQL dielects.
Note that most SQL dialects allow the order by to not only be a column, but an expression.

This should work:
ORDER BY (adtypesID = 7) ASC, horses.horsesActiveDate DESC

Use a Union to append two queries together, like this:
SELECT whatever FROM wherever ORDER BY something AND adtypesID!=7
UNION
SELECT another FROM somewhere ORDER BY whocares AND adtypesID=7
http://dev.mysql.com/doc/refman/5.0/en/union.html

I re-wrote your query as:
SELECT h.horsesID,
h.horsesDescription,
h.horsesActiveDate,
adtypesID,
s.statesName,
hi.himagesPath
FROM HORSES h
LEFT JOIN STATES s ON s.stateid = h.statesID
LEFT JOIN HORSES_IMAGES hi ON hi.horsesID = h.horsesID
AND hi.himagesDefault = 1
AND hi.himagesPath != ''
LEFT JOIN HORSES_FEATURED hf ON hf.horsesID = h.horsesID
WHERE h.horsesStud = 0
AND h.horsesSold = 0
AND h.horsesActive = 1
ORDER BY (adtypesID = 7) ASC, h.horsesActiveDate DESC
The IN subquery, using a LEFT JOIN and such, will mean that any horse record whose horsesActive value is 1 will be returned - regardless if they have an associated HORSES_FEATURED record. I leave it to you for checking your data to decide if it should really be an INNER JOIN. Likewise for the STATES table relationship...

SQL magic - query shouldn't take 15 hours, but it does

Ok, so i have one really monstrous MySQL table (900k records, 180 MB total), and i want to extract from subgroups records with higher date_updated and calculate weighted average in each group. The calculation runs for ~15 hours, and i have a strong feeling i'm doing it wrong.
First, monstrous table layout:
category
element_id
date_updated
value
weight
source_prefix
source_name
Only key here is on element_id (BTREE, ~8k unique elements).
And calculation process:
Make hash for each group and subgroup.
CREATE TEMPORARY TABLE `temp1` (INDEX ( `ds_hash` ))
SELECT `category`,
`element_id`,
`source_prefix`,
`source_name`,
`date_updated`,
`value`,
`weight`,
MD5(CONCAT(`category`, `element_id`, `source_prefix`, `source_name`)) AS `subcat_hash`,
MD5(CONCAT(`category`, `element_id`, `date_updated`)) AS `cat_hash`
FROM `bigbigtable` WHERE `date_updated` <= '2009-04-28'
I really don't understand this fuss with hashes, but it worked faster this way. Dark magic, i presume.
Find maximum date for each subgroup
CREATE TEMPORARY TABLE `temp2` (INDEX ( `subcat_hash` ))
SELECT MAX(`date_updated`) AS `maxdate` , `subcat_hash`
FROM `temp1`
GROUP BY `subcat_hash`;
Join temp1 with temp2 to find weighted average values for categories
CREATE TEMPORARY TABLE `valuebycats` (INDEX ( `category` ))
SELECT `temp1`.`element_id`,
`temp1`.`category`,
`temp1`.`source_prefix`,
`temp1`.`source_name`,
`temp1`.`date_updated`,
AVG(`temp1`.`value`) AS `avg_value`,
SUM(`temp1`.`value` * `temp1`.`weight`) / SUM(`weight`) AS `rating`
FROM `temp1` LEFT JOIN `temp2` ON `temp1`.`subcat_hash` = `temp2`.`subcat_hash`
WHERE `temp2`.`subcat_hash` = `temp1`.`subcat_hash`
AND `temp1`.`date_updated` = `temp2`.`maxdate`
GROUP BY `temp1`.`cat_hash`;
(now that i looked through it and wrote it all down, it seems to me that i should use INNER JOIN in that last query (to avoid 900k*900k temp table)).
Still, is there a normal way to do so?
UPD: some picture for reference:
removed dead ImageShack link
UPD: EXPLAIN for proposed solution:
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
| 1 | SIMPLE | cur | ALL | NULL | NULL | NULL | NULL | 893085 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | next | ref | prefix | prefix | 1074 | bigbigtable.cur.source_prefix,bigbigtable.cur.source_name,bigbigtable.cur.element_id | 1 | 100.00 | Using where |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+

Using hashses is one of the ways in which a database engine can execute a join. It should be very rare that you'd have to write your own hash-based join; this certainly doesn't look like one of them, with a 900k rows table with some aggregates.
Based on your comment, this query might do what you are looking for:
SELECT cur.source_prefix,
cur.source_name,
cur.category,
cur.element_id,
MAX(cur.date_updated) AS DateUpdated,
AVG(cur.value) AS AvgValue,
SUM(cur.value * cur.weight) / SUM(cur.weight) AS Rating
FROM eev0 cur
LEFT JOIN eev0 next
ON next.date_updated < '2009-05-01'
AND next.source_prefix = cur.source_prefix
AND next.source_name = cur.source_name
AND next.element_id = cur.element_id
AND next.date_updated > cur.date_updated
WHERE cur.date_updated < '2009-05-01'
AND next.category IS NULL
GROUP BY cur.source_prefix, cur.source_name,
cur.category, cur.element_id
The GROUP BY performs the calculations per source+category+element.
The JOIN is there to filter out old entries. It looks for later entries, and then the WHERE statement filters out the rows for which a later entry exists. A join like this benefits from an index on (source_prefix, source_name, element_id, date_updated).
There are many ways of filtering out old entries, but this one tends to perform resonably well.

Ok, so 900K rows isn't a massive table, it's reasonably big but and your queries really shouldn't be taking that long.
First things first, which of the 3 statements above is taking the most time?
The first problem I see is with your first query. Your WHERE clause doesn't include an indexed column. So this means that it has to do a full table scan on the entire table.
Create an index on the "data_updated" column, then run the query again and see what that does for you.
If you don't need the hash's and are only using them to avail of the dark magic then remove them completely.
Edit: Someone with more SQL-fu than me will probably reduce your whole set of logic into one SQL statement without the use of the temporary tables.
Edit: My SQL is a little rusty, but are you joining twice in the third SQL staement? Maybe it won't make a difference but shouldn't it be :
SELECT temp1.element_id,
temp1.category,
temp1.source_prefix,
temp1.source_name,
temp1.date_updated,
AVG(temp1.value) AS avg_value,
SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 LEFT JOIN temp2 ON temp1.subcat_hash = temp2.subcat_hash
WHERE temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;
or
SELECT temp1.element_id,
temp1.category,
temp1.source_prefix,
temp1.source_name,
temp1.date_updated,
AVG(temp1.value) AS avg_value,
SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 temp2
WHERE temp2.subcat_hash = temp1.subcat_hash
AND temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Simple join between 3 tables takes lot of time in memsql - sql

Related

Why are query results across linked servers of different versions completely wrong?

Access join on first record

Database size reports differently than the sum of all the tables in SQL Server

MYSQL - Combining Two Results in One Query

SQL magic - query shouldn't take 15 hours, but it does

Categories

Resources