MySQL statement taking abnormally long - sql

I have two tables in one database with about 50,000 to 70,000 rows. Both are MyISAM. The first, yahooprices, contains SKU codes (column code) for items and pricing (column price). The second table, combined_stock, contains partnumber (same information as code, but sorted differently), price, quantity, and description. Price is currently defined as FLOAT 10,2 and set to 0.00. I am attempting to pull the pricing over from yahooprices (also FLOAT 10,2) to combined_stock using this statement:
UPDATE combined_stock dest LEFT JOIN (
SELECT price, code FROM yahooprices
) src ON dest.partnumber = src.code
SET dest.price = src.price
I know this statement worked because I tried it on a smaller test amount. They have partnumber and code as non-unique indexes. I also tried indexing price on both tables to see if that would speed up the query. Technically it should finish within seconds, but last time I tried running this, it sat there overnight and even then I'm pretty certain it didn't work out. Anyone have any troubleshooting recommendations?

I would suggest some relatively small changes. First, get rid of the subquery. Second, switch to an inner join:
UPDATE combined_stock dest JOIN
yahooprices src
ON dest.partnumber = src.code
SET dest.price = src.price;
Finally create an index on yahooprices(code, price).
You can leave the left outer join if you really want the price to be set to NULL when there is no match.

Related

Request optimisation

I have two tables, on one there are all the races that the buses do
dbo.Courses_Bus
|ID|ID_Bus|ID_Line|DateHour_Start_Course|DateHour_End_Course|
On the other all payments made in these buses
dbo.Payments
|ID|ID_Bus|DateHour_Payment|
The goal is to add the notion of a Line in the payment table to get something like this
dbo.Payments
|ID|ID_Bus|DateHour_Payment|Line|
So I tried to do this :
/** I first added a Line column to the dbo.Payments table**/
UPDATE
Table_A
SET
Table_A.Line = Table_B.ID_Line
FROM
[dbo].[Payments] AS Table_A
INNER JOIN [dbo].[Courses_Bus] AS Table_B
ON Table_A.ID_Bus = Table_B.ID_Bus
AND Table_A.DateHour_Payment BETWEEN Table_B.DateHour_Start_Course AND Table_B.DateHour_End_Course
And this
UPDATE
Table_A
SET
Table_A.Line = Table_B.ID_Line
FROM
[dbo].[Payments] AS Table_A
INNER JOIN (
SELECT
P.*,
CP.ID_Line AS ID_Line
FROM
[dbo].[Payments] AS P
INNER JOIN [dbo].[Courses_Bus] CP ON CP.ID_Bus = P.ID_Bus
AND CP.DateHour_Start_Course <= P.Date
AND CP.DateHour_End_Course >= P.Date
) AS Table_B ON Table_A.ID_Bus = Table_B.ID_Bus
The main problem, apart from the fact that these requests do not seem to work properly, is that each table has several million lines that are increasing every day, and because of the datehour filter (mandatory since a single bus can be on several lines everyday) SSMS must compare each row of the second table to all rows of the other table.
So it takes an infinite amount of time, which will increase every day.
How can I make it work and optimise it ?
Assuming that this is the logic you want:
UPDATE p
SET p.Line = cb.ID_Line
FROM [dbo].[Payments] p JOIN
[dbo].[Courses_Bus] cb
ON p.ID_Bus = cb.ID_Bus AND
p.DateHour_Payment BETWEEN cb.DateHour_Start_Course AND cb.DateHour_End_Course;
To optimize this query, then you want an index on Courses_Bus(ID_Bus, DateHour_Start_Course, DateHour_End_Course).
There might be slightly more efficient ways to optimize the query, but your question doesn't have enough information -- is there always exactly one match, for instance?
Another big issue is that updating all the rows is quite expensive. You might find that it is better to do this in loops, one chunk at a time:
UPDATE TOP (10000) p
SET p.Line = cb.ID_Line
FROM [dbo].[Payments] p JOIN
[dbo].[Courses_Bus] cb
ON p.ID_Bus = cb.ID_Bus AND
p.DateHour_Payment BETWEEN cb.DateHour_Start_Course AND cb.DateHour_End_Course
WHERE p.Line IS NULL;
Once again, though, this structure depends on all the initial values being NULL and an exact match for all rows.
Thank you Gordon for your answer.
I have investigated and came with this query :
MERGE [dbo].[Payments] AS p
USING [dbo].[Courses_Bus] AS cb
ON p.ID_Bus= cb.ID_Bus AND
p.DateHour_Payment>= cb.DateHour_Start_Course AND
p.DateHour_Payment<= cb.DateHour_End_Course
WHEN MATCHED THEN
UPDATE SET p.Line = cb.ID_Ligne;
As it seems to be the most suitable in an MS-SQL environment.
It also came with the error :
The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
I understood this to mean that it finds several lines with identical
[p.ID_Bus= cb.ID_Bus AND
p.DateHour_Payment >= cb.DateHour_Start_Course AND
p.DateHour_Payment <= cb.DateHour_End_Course]
Yes, this is a possible case, however the ID is different each time.
For example, if two blue cards are beeped at the same time, or if there is a loss of network and the equipment has been updated, thus putting the beeps at the same time. These are different lines that must be treated separately, and you can obtain for example:
|ID|ID_Bus|DateHour_Payments|Line|
----------------------------------
|56|204|2021-01-01 10:00:00|15|
----------------------------------
|82|204|2021-01-01 10:00:00|15|
How can I improve this query so that it takes into account different payment IDs?
I can't figure out how to do this with the help I find online. Maybe this method is not the right one in this context.

Improving Mass Update Query Performance

I'm looking to run some remediation on my database which requires adding some new fields to a table and backfilling them based on specific criteria. There are a LOT of records and I'd prefer the backfill to run relatively fast, as my current query is taking forever to run.
I attempted updating via subqueries, but this doesn't seem to be very performant.
Query below is edited to show general process (so syntax might be a little off). Hopefully you can understand what I'm trying to do./
I'd like to update every single record in the accounts table.
I'd like to do this by goin through each record and running a number of checks against the ID of that record prior to updating. For any records that don't match up in the join, I just want to set them to 0.
Doing this over the course of a few hundred thousand records seems to take forever. I'm sure there is a faster more efficient way to go about doing this. Using Postgres.
UPDATE
accounts
SET
account_age = a2.account_age,
FROM
(
with dataPoints as (
select
COALESCE(account__c.account_id as account_id,
account__c.account_age, 0) as account_age
        from account__c
        OUTER LEFT join points on points.account_id = account__c.id
        OUTER LEFT join goal__c on goal__c.id = scores.goal_id
group by account_id, account__c.account_age
     )
       select
account_id,
        max(dataPoints.account_age) as account_age,
        from scores
join account on accounts.id = scores.account_id
group by account_id
) as a2
WHERE accounts.id = a2.account_id;

Hive: Can't select one random match on right table in left outer join

EDIT - I don't care about the skewness or things being slow. I found out that the slowness was more so caused by a many times many join on many matches in my left outer join... Please skip down to the bottom.
I have an issue of a skewed table, that is, many more keys than other keys to join. My problem is that I have more than one key with many appearances in the rows.
Stats on this table and table I am joining with:
Larger table: totalSize=47431500000, numRows=509500000, rawDataSize=47022050000 21052 distinct keys
Smaller table: totalSize=1154984612, numRows=13780692, rawDataSize=1141203920 AND 39313 distinct keys
The smaller table also has repeated rows of keys. The other challenge is that I need to randomly select a matching key from the smaller table.
What I have tried so far:
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=1155mb;
and
CREATE TABLE joined_table AS SELECT * FROM (
select * from larger_table as a
LEFT OUTER JOIN smaller_table as b
on a.key = b.key
ORDER BY RAND()
)b;
It has been running for a day now.
I thought about manually doing something like this but, I have more than one key that there are a ton of, so I would have to make a bunch of tables and merge them. Which I can do if that is my only option :O
But I wanted to reach out to you all on SO.
Thanks for the help in advance
EDIT June 20th
I found to try:
set hive.optimize.skewjoin = true;
set hive.skewjoin.key = 200000;
I had already created a few separate tables to separate and join up the highest appearing keys, such that now the highest appearing key in the rest was 200k. Running the query to join the rest now took 25 minutes, finished all tasks successfully, according to the job tracker on the web interface. On the command line in the Hive shell, it is still sitting there, and when I go to check, the table does not exist.
**EDIT #2 After a lot of reading and trying out a lot of sql hive code... the 1 solution that should have worked in theory did not work, specifically the order by rand() never even happened...
CREATE TABLE joined_table AS SELECT * FROM (
select * from larger_table as a
JOIN
(SELECT *, row_number() over (partition by key order by rand() )
from smaller_table) as b
on a.key = b.key
and b.row_num=1
)b;
In the results it is being matched with the first row, not a rand() row at all..
Any other options or anything I did incorrectly here?

SQL Inner Join respect order of primary table (DBF files DBASE IV)

I'm integrating a solution with a point of sale software. I'm almost done just need to get the ordering/sequence correct.
I'm working in VB.net.
This is my query:
SELECT B.DESCRIPT AS description,
B.REF_NO AS upc,
A.QUANTY AS quantity,
ROUND((A.PRICE_PAID * (1+(C.TAX_PCT/100))),2) AS unit_price,
A.DEL_CODE AS discount_percent
FROM (TABLE.DBF A INNER JOIN MENU.DBF B ON B.REF_NO = A.REF_NO),
TAXTBL.DBF C
WHERE C.TAX_DESC='TAX'
And it yields a result something like this:
(sequence is an auto-incrementing column set to the datatable)
The results from the query are being ordered by the upc/REF_NO, due to the inner join between TABLE.DBF and MENU.DBF. When I take out the MENU.DBF components, I get the correct order from TABLE.DBF.
I need to make this query respect the order from TABLE.DBF. The items should be ordered as such:
(time_sent doesnt help because (1) its a batch of items and (2) multiple items can be added even within the same second)
Thanks for the help.
I am not sure about dbf, but in almost all SQL I have run into the order is an implementation detail, so you should not rely on it and should provide your own order.
That would mean using an ORDER BY xyz at the end of your query

How to perform multiple SQL tasks when using SQL within code (in this case vbscript)

I am hitting a brick wall with something I'm trying to do.
I'm trying to perform a complex query and return the results to a vbscript (vbs) record set.
In order to speed up the query I create temporary tables and then use those tables in the main query (creates a speed boost of around 1200% on just using sub queries)
the problem is, the outlying code seems to ignore the main query, only 'seeing' the result of the very first command (i.e. it will return a 'records affected' figure)
For example, given a query like this..
delete from temp
select * into temp from sometable where somefield = somefilter
select sum(someotherfield) from yetanothertable where account in (select * from temp)
The outlying code only seems to 'see' the returned result of 'delete from temp' I can't access the data that the third command is returning.
(Obviously the sql query above is pseudo/fake. the real query is large and it's content not relevant to the question being asked. I need to solve this problem as without being able to use a temporary table the query goes from taking 3 seconds to 6 minutes!)
edit: I know I could get around this by making multiple calls to ADODB.Connection's execute (make the call to empty the temp tables, make the call to create them again, finally make the call to get the data) but I'd rather find an elegant solution/way to avoid this way of doing it.
edit 2: Below is the actual SQL code I've ended up with. Just adding it for the curiosity of people who have replied. It doesn't use the nocount as I'd already settled on a solution which works for me. It is also probably badly written. It evolved over time from something more basic. I could probably improve it myself but as it works and returns data extremely quickly I have stuck with it. (for now)
Here's the SQL.
Here's the Code where it's called. My chosen solution is to run the first query into a third temp table, then run a select * on that table from the code, then a delete from from the code...
I make no claims about being a 'good' sql scripter (self taught via necesity mostly), and the database is not very well designed (a mix of old and new tables. Old tables not relational and contain numerical values and date values stored as strings)
Here is the original (slow) query...
select
name,
program_name,
sum(handle) + sum(refund) as [Total Sales],
sum(refund) as Refunds,
sum(handle) as [Net Sales],
sum(credit - refund) as Payout,
cast(sum(comm) as money) as commission
from
(select accountnumber,program_name,
cast(credit_amount as money) as credit,cast(refund_amt as money) as refund,handle, handle * (
(select commission from amtotecommissions
where _date = a._date
and pool_type = (case when a.pool_type in ('WP','WS','PS','WPS') then 'WN' else a.pool_type end)
and program_name = a.program_name) / 100) as comm
from amtoteaccountactivity a where _date = '#yy/#mm/#dd' and transaction_type = 'Bet'
and accountnumber not in ('5067788','5096272') /*just to speed the query up a bit. I know these accounts aren't included*/
) a,
ews_db.dbo.amtotetrack t
where (a.accountnumber in (select accountno from ews_db.dbo.get_all_customers where country = 'US')
or a.accountnumber in ('5122483','5092147'))
and t.our_code = a.program_name collate database_default
and t.tracktype = 2
group by name,program_name
I suspect that with the right SQL and indexes you should be able to get equal performance with a single SELECT, however there isn't enough information in the original question to be able to give guidance on that.
I think you'll be best of doing this as a stored procedure and calling that.
CREATE PROCEDURE get_Count
#somefilter int
AS
delete from temp;
select * into temp from sometable where somefield = #somefilter;
select sum(someotherfield) from yetanothertable
where account in (select * from temp);
However an example avoiding the IN the way you're using it via a JOIN will probably fix the performance issue. Use EXPLAIN SELECT to see what's going on and optimise from there. For example the following
select sum(transactions.value) from transactions
inner join user on transactions.user=user.id where user.name='Some User'
is much quicker than
select sum(transactions.value) from transactions
where user in (SELECT id from user where user.name='Some User')
because the amount of rows scanned in the second example will be the entire table, whereas in the first the indexes can be used.
Rev1
Taking the slow SQL posted it is appears that there are full table scans going on where the SQL states WHERE .. IN e.g.
where (a.accountnumber in (select accountno from ews_db.dbo.get_all_customers))
The above will pull in lots of records which may not be required. This together with the other nested table selects are not allowing the optimiser to pull in only the records that match, as would be the case when using JOIN at the outer level.
When building these type of complex queries I generally start with the inner detail, because we need to have the inner detail so we can perform joins and aggregate operations.
What I mean by this is if you have a typical DB with customers that have orders that create transactions that contain items then I would start with the items and pull in the rest of the detail with joins.
By way of example only I suggest building the query more like the following:
select name,
program_name,
SUM(handle) + SUM(refund) AS [Total Sales],
SUM(refund) AS Refunds,
SUM(handle) AS [Net Sales],
SUM(credit - refund) AS Payout,
CAST(SUM(comm) AS money) AS commission,
FROM ews_db.dbo.get_all_customers AS cu
INNER JOIN amtoteactivity AS a ON a.accoutnumber = cu.accountnumber
INNER JOIN ews_db.dbo.amtotetrack AS track ON track.our_code = a.program_name
INNER JOIN amtotecommissions AS commision ON co.program_name = a.program_name
WHERE customers.country='US'
AND t.tracktype = 2
AND a.transaction_type = 'Bet'
AND a._date = ''#yy/#mm/#dd'
AND a.program_name = co.program_name
AND co.pool_type = (case when a.pool_type in ('WP','WS','PS','WPS') then 'WN' else a.pool_type end)
GROUP BY name,program_name,co.commission
NOTE: The above is not functional and is for illustration purposes. I'd need to have the database online to build the real query. I'm hoping you'll get the general idea and build from there.
My top tip for complex queries that don't work is simply to completely start again throwing away what you've already got. Sometimes I will do this three or four times when building a really tricky query.
Always build these queries gradually starting from the most detail and working outwards. Inspect the results at each stage because it helps visualise what the data are.
If you could come to a common data structure for all the selects you could UNION ALL them together with perhaps selecting a constant in each union so you know where the data was coming from - kinda like
select '1',col1,col2,'' from table 1
UNION ALL
select '2',col1,col2,col3 from table2
I just solved my original problem (that I came up against again today on a different query) in a slightly hacky way...
Conn.Execute(split(query,";")(0))
set rs = Conn.Execute(split(query,";")(1))
Works perfectly!
Edit : I just noticed that the first comment on my original question also provided a quick fix (set nocount on). I forgot about that. Well there is this and that. I had tried to get the query working without the temporary table but I couldn't get anywhere near the same performance as with it.