Related
I have a query that is filtered on a list of order numbers. The actual filed for the order number is 9 characters long (char). However, occasionally the system that the end users get their order numbers from will generate an extra 0 or single alpha character to the beginning of this order number. I am trying to account for that using the existing SQL and although it is running, it takes exponentially longer (and sometimes won't even run).
Is the approach I am taking below the best way to account for these differences?
order number field example:
066005485,
066005612
example of what may be entered and I need to account for:
0066005485,
A066005612
Here is what I have tried that seems to not work or at least be EXTREMELY slow:
SELECT S.order_no AS 'contract_no',
S.SIZE_INDEX AS 'technical_index',
S.open_qty AS 'contract_open_qty',
S.order_qty AS 'contract_order_qty',
E.excess,
(S.order_qty - E.excess) AS 'new_contract_size_qty'
FROM EXCESS E
JOIN SIM S ON RIGHT(E.GPS_CONTRACT_NUMBER,9) = S.order_no AND E.[AFS TECH INDEX] = S.size_index
WHERE S.order_no IN ('0066003816','0066003817','0066005485','0066005612','0066005390','0066005616','0066005617','A066005969','A066005970','0066005952','0066005798','0066006673','0066005802','0066006196','0066006197','0066006199','0066006205','0066006697')
OR CONCAT('0',S.order_no) IN ('0066003816','0066003817','0066005485','0066005612','0066005390','0066005616','0066005617','A066005969','A066005970','0066005952','0066005798','0066006673','0066005802','0066006196','0066006197','0066006199','0066006205','0066006697')
ORDER BY S.order_no,
S.size_index
Any thoughts on something that may work better or I am missing?
I can't do anything about the nasty join that requires the right function. If you have any influence over the data base designers it could be fruitful to either have that key (E.GPS_CONTRACT_NUMBER) cleaned up before it is put into the table or get them to add another field where the RIGHT(E.GPS_CONTRACT_NUMBER,9) has already been performed and an index can be created.
But there is definitely something you can do to remove the concat function calculation and take advantage of any index on S.order_no. I noticed your Where clause looks like order_no IN listofvals OR Concat('0', order_no) IN samelistofvals . So instead of adding a zero onto order_no remove a zero from everything in the IN list.
Where order_no IN ('0066003816','0066003817','0066005485','0066005612','0066005390','0066005616','0066005617','A066005969','A066005970','0066005952','0066005798','0066006673','0066005802','0066006196','0066006197','0066006199','0066006205','0066006697',
'066003816','066003817','066005485','066005612','066005390','066005616','066005617','066005952','066005798','066006673','066005802','066006196','066006197','066006199','066006205','066006697')
Notice that the IN-list is on two lines and the second line is just the first repeated with the leading 0 removed and any entry beginning with "A" removed entirely. This simplifies the Where clause and allows use of indexes, if any exist.
If the efficiency problem is in the WHERE clause (not considering the JOIN operation), in order to improve the situation, you can try using the "pseudo-regex" pattern matching way with LIKE:
WHERE
S.order_no LIKE '[A0]06600%'
OR
S.order_no LIKE '06600%'
Warning: this pattern will match also strings that end with other numbers (e.g. 8648).
Does it work for you?
I have a Google Spreadsheet and I want to run a QUERY function. But I want the WHERE statement to check a series of values. I'm basically looking for what I would use an IN statement in SQL - what the IN equivalent in Google Spreadsheets? So right now I have:
=QUERY(Sheet1!A3:AB50,"Select A,B, AB WHERE B='"& G4 &"'")
And that works. But what I really need is the equivalent of:
=QUERY(Sheet1!A3:AB50,"Select A,B, AB WHERE B='"& G4:G7 &"'")
And of course, that statement fails. How can I get the where against a range of values? These are text values, if that makes a difference.
Great trick Zolley! Here's a small improvement.
Instead of:
=CONCATENATE(G3,"|",G4,"|",G5,"|",G6,"|",G7)
we can use
=TEXTJOIN("|",1,G3:G7)
That also allows us to work with bigger arrays when adding every cell into the formula one by one just doesn't make sense.
UPD:
Going further I tried to compose two formulas together to exclude the helping cell and here we go:
=QUERY(Sheet1!A3:AB50,"Select A,B, AB WHERE B matches '^.(" & TEXTJOIN("|",1,G3:G7) & ").$'")
Used this in my own project and it worked perfectly!
Although I don't now the perfect answer for that, I could find a workaround for the problem, which can be used in small amount of data (hope that's the case here :) )
First step: You should create a "helper cell" in which you concatenate the G4:G7 cells with a "|" character:
=CONCATENATE(G3,"|",G4,"|",G5,"|",G6,"|",G7) - Let's say it's the content of the cell H2.
Now, you should change your above query as follows:
=QUERY(Sheet1!A3:AB50,"Select A,B, AB WHERE B matches '^.*(" & H2 & ").*$'")
This should do the trick. Basically we the "matches" operator allows the use of regular expressions, which also allow construction using the | symbol.
As I said, it's a workaround, it has drawbacks, but I hope it'll help.
I had the same question and came across your question in my research. Your Concatenate idea got me thinking. The problem is that the range that I'm using in my where clause is dynamic. The number of rows changes. That got me onto the Join function which lead me to this:
=(QUERY('Sheet1!A3:AB50,"select A where B = "& JOIN(" OR B = ",ARRAY_CONSTRAIN($A$5:A,COUNTIF($G$4:$G,">0"),1)) &"")
COUNTIF counts the number of rows with data. I'm using numbers in this range so maybe "!= ''" would be more appropriate than ">0". ARRAY_CONSTRAIN creates an array with with cells that only has data. JOIN turns the range into query language for the where clause.
Need to start where clause with basically same text that is in delimeter in the JOIN function. Note that I am using numbers so I don’t need ‘’ around values. This is what works in my spreadsheet and I hope it helps.
We have a massive, multi-table Sybase query we call the get_safari_exploration_data query, that fetches all sorts of info related to explorers going on a safari, and all the animals they encounter.
This query is slow, and I've been asked to speed it up. The first thing that jumps out at me is that there doesn't seem to be a pressing need for the nested SELECT statement inside the outer FROM clause. In that nested SELECT, there also seems to be several fields that aren't necessary (vegetable, broomhilda, devoured, etc.). I'm also skeptical about the use of the joins ("*=" instead of "INNER JOIN...ON").
SELECT
dog_id,
cat_metadata,
rhino_id,
buffalo_id,
animal_metadata,
has_4_Legs,
is_mammal,
is_carnivore,
num_teeth,
does_hibernate,
last_spotted_by,
last_spotted_date,
purchased_by,
purchased_date,
allegator_id,
cow_id,
cow_name,
cow_alias,
can_be_ridden
FROM
(
SELECT
mp.dog_id as dog_id,
ts.cat_metadata + '-yoyo' as cat_metadata,
mp.rhino_id as rhino_id,
mp.buffalo_id as buffalo_id,
mp.animal_metadata as animal_metadata,
isnull(mp.has_4_Legs, 0) as has_4_Legs,
isnull(mp.is_mammal, 0) as is_mammal,
isnull(mp.is_carnivore, 0) as is_carnivore,
isnull(mp.num_teeth, 0) as num_teeth,
isnull(mp.does_hibernate, 0) as does_hibernate,
jungle_info.explorer as last_spotted_by,
exploring_journal.spotted_datetime as last_spotted_date,
jungle_info.explorer as purchased_by,
early_exploreration_journal.spotted_datetime as purchased_date,
alleg_id as allegator_id,
ho.cow_id,
ho.cow_name,
ho.cow_alias,
isnull(mp.is_ridable,0) as can_be_ridden,
ts.cat_metadata as broomhilda,
ts.squirrel as vegetable,
convert (varchar(15), mp.rhino_id) as tms_id,
0 as devoured
FROM
mammal_pickles mp,
very_tricky_animals vt,
possibly_venomous pv,
possibly_carniv_and_tall pct,
tall_and_skinny ts,
tall_and_skinny_type ptt,
exploration_history last_exploration_history,
master_exploration_journal exploring_journal,
adventurer jungle_info,
exploration_history first_exploration_history,
master_exploration_journal early_exploreration_journal,
adventurer jungle_info,
hunting_orders ho
WHERE
mp.exploring_strategy_id = 47
and mp.cow_id = ho.cow_id
and ho.cow_id IN (20, 30, 50)
and mp.rhino_id = vt.rhino_id
and vt.version_id = pv.version_id
and pv.possibly_carniv_and_tall_id = pct.possibly_carniv_and_tall_id
and vt.tall_and_skinny_id = ts.tall_and_skinny_id
and ts.tall_and_skinny_type_id = ptt.tall_and_skinny_type_id
and mp.alleg_id *= last_exploration_history.exploration_history_id
and last_exploration_history.master_exploration_journal_id *= exploring_journal.master_exploration_journal_id
and exploring_journal.person_id *= jungle_info.person_id
and mp.first_exploration_history_id *= first_exploration_history.exploration_history_id
and first_exploration_history.master_exploration_journal_id *= early_exploreration_journal.master_exploration_journal_id
and early_exploreration_journal.person_id *= jungle_info.person_id
) TEMP_TBL
So I ask:
Am I correct about the nested SELECT?
Am I correct about the unnecessary fields inside the nested SELECT?
Am I correct about the structure/syntax/usage of the joins?
Is there anything else about the structure/nature of this query that jumps out at you as being terribly inefficient/slow?
Unfortunately, unless there is irrefutable, matter-of-fact proof that decomposing this large query into smaller queries is beneficial in the long run, management will simply not approve refactoring it out into multiple, smaller queries, as this will take considerable time to refactor and test. Thanks in advance for any help/insight here!
Am I correct about the nested SELECT?
You would be in some cases, but a competent planner would collapse it and ignore it here.
Am I correct about the unnecessary fields inside the nested SELECT?
Yes, especially considering that some of them don't show up at all in the final list of fields.
Am I correct about the structure/syntax/usage of the joins?
Insofar as I'm aware, *= and =* are merely syntactic sugar for a left and right join, but I might be wrong in stating that. If not, then they merely force the way joins occur, but they may be necessary for your query to work at all.
Is there anything else about the structure/nature of this query that jumps out at you as being terribly inefficient/slow?
Yes.
Firstly, you've some calculations that aren't needed, e.g. convert (varchar(15), mp.rhino_id) as tms_id. Perhaps a join or two as well, but I admittedly haven't looked at the gory details of the query.
Next, you might have a problem with the db design itself, e.g. a cow_id field. (Seriously? :-) )
Last, there occasionally is something to be said about doing multiple queries instead of a single one, to avoid doing tons of joins.
In a blog, for instance, it's usually a good idea to grab the top 10 posts, and then to use a separate query to fetch their tags (where id in (id1, id2, etc.)). In your case, the selective part seems to be around here:
mp.exploring_strategy_id = 47
and mp.cow_id = ho.cow_id
and ho.cow_id IN (20, 30, 50)
so maybe isolate that part in one query, and then build an in () clause using the resulting IDs, and fetch the cosmetic bits and pieces in one or more separate queries.
Oh, and as point out by Gordon, check your indexes as well. But then, note that the indexes may end up of little use without splitting the query into more manageable parts.
I would suggest the following approach.
First, rewrite the query using ANSI standard joins with the on clause. This will make the conditions and filtering much easier to understand. Also, this is "safe" -- you should get exactly the same results as the current version. Be careful, because the *= is an outer join, so not everything is an inner join.
I doubt this step will improve performance.
Then, check each of the reference tables and be sure that the join keys have indexes on them in the reference table. If keys are missing, then add them in.
Then, check whether the left outer joins are necessary. There are filters on tables that are left outer join'ed in . . . these filters convert the outer joins to inner joins. Probably not a performance hit, but you never know.
Then, consider indexing the fields used for filtering (in the where clause).
And, learn how to use explain capabilities. Any nested loop joins (without an index) as likely culprits for performance problems.
As for the nested select, I think Sybase is smart enough to "do the right thing". Even if it wrote out and re-read the result set, that probably would have a marginal effect on the query compared to getting the joins right.
If this is your real data structure, by the way, it sounds like a very interesting domain. It is not often that I see a field called allegator_id in a table.
I will answer some of your questions.
You think that the fields (vegetable, broomhilda, devoured) in nested SELECT could be causing performance issue. Not necessarily. The two unused fields (vegetable, broomhilda) in nested SELECT are from the ts table but the cat_metadata field which is being used is also from ts table. So unless cat_metadata is being covered by index used on ts table, there wont be any performance impact. Because, to extract cat_metadata field the data page from table will need to be fetched anyway. The extraction of other two fields will take little CPU, that's it. So don't worry about that. The 'devoured' field is also a constant. That will not affect the performance either.
Dennis pointed out about usage of convert function convert(varchar(15), mp.rhino_id). I disagree that that will effect performance as it will consume CPU only.
Lastly I would say, try using the set table count to 13, as there are 13 tables in there. Sybase uses four tables at a time for optimisation.
Apologies for the somewhat confusing Title, I've been struggling to find an answer to my question, partly because it's hard to concisely describe it in the title line or come up with a good search string for it. Anyhoooo, here's the problem I'm facing:
Short version of the question is:
How can I write the following (invalid but understandable SQL) in valid SQL understood by Oracle:
select B.REPLACER as COL, A.* except A.COL from A join B on a.COL = B.COL;
Here's the long version (if you already know what I want from reading the short version, you don't need to read this :P ):
My (simplified) task is to come up with service that massages a table's data and provide it as a sub-query. The table has a lot of columns (a few dozens or more), and I am stuck with using "select *" rather than explicitly listing out all columns one by one, because new columns may be added to or removed from the table without me knowing, although my downstream systems will know and adjust accordingly.
Say, this table (let's call it Table A from now on) has a column called "COL", and we need to replace the values in that COL with the value in the REPLACER column of table B where the two COL value matches.
How do I do this? I cannot rename the column because the downstream systems expect "COL"; I cannot do without the "expect A.COL" part because that would cause the sql to be ambiguous.
Appreciate your help, almighty StackOverflow
Ray
You can either use table.* or table.fieldName.
There is no syntax available for table.* (except field X).
This means that you can only get what you want by explicitly listing all of the fields...
select
A.field1,
A.field2,
B.field3,
A.field4,
etc
from
A join B on a.COL = B.COL;
This means that you may need to re-model your data so as to ensure you don't keep getting new fields. OR write dynamic sql. Interrogate the database to find out the column names, use code to write a query as above, and then run that dynamically generated query.
Try this: (not tested)
select Case B.COL
when null then A.COL
else B.REPLACER
end as COLAB, A.*
from A left join B on A.COL = B.COL;
This should get the B.REPLACER when exists B.COL = A.COL, you can add more column in the select (like sample col1, col2) or use A.* (change COL into COLAB to make it distinguish with A.COL in A.*) .
Like said before, you cannot specify in regular sql which column not to select. you could write a procedure for that, but it would be quite complex, because you would need to return a variable table type. Probably something with refcursor magic stuff.
The closest I could come up with is joining with using. This will give you the column col in the first field once and for the rest all columns in a and b. So not what you want basically. :)
select *
from a
join b using (col)
Let's start from first principles. select * from .... is a bug waiting to happend and has no place in production code. Of course everybody uses it because it entails less typing but that doesn't make it a good practice.
Beyond that, the ANSI SQL standard doesn't support select * except col1 from .... syntax. I know a lot of people wish it would but it doesn't.
There are a couple of ways to avoid excessive typing. One is to generate the query by selecting from data dictionary, using one of the views like USER_TAB_COLUMNS. It is worth writing the PL?SQL block to do this if you need lots of queries like this.
A simpler hack is to use the SQL*Plus describe to list out the structure of table A; cut'n'paste it into an IDE which supports regular expressions and edit the columns to give you the query's projection.
Both these options might strike you as labourious but frankly either workaround (and especially the second) would have taken less effort than asking StackOverflow. You'll know better next time.
I have a large number of phrases (~ several million), each less than six or seven words and the large majority less than five, and I would like to see if they "phrase match" each other. This is a search engine marketing term - essentially, A phrase matches B if A is contained in B. Right now, they are stored in a db (postgres), and I am performing a join on regexes (see this question). It is running impossibly slowly even after trying all basic optimization tricks (indexing, etc) and trying the suggestions provided.
Is there an easier way to do this? I am not averse to a non-DB solution. Is there any reason to think that regexes are overkill and are taking way longer than a different solution?
An ideal algorithm for doing sub-string matching is AhoCorsick.
Although you will have to read the data out of the database to use it, it is tremendously fast, when compared to more naive methods.
See here for a related question on substring matching:
And here for an AhoCorsick implementation in Java:
It would be great to get a little more context as to why you need to see which phrases are subsets of others: for example, it seems strange that the DB would be built in such a way anyway: you're having to do the work now because the DB is not in an appropriate format, so it makes sense that you should 'fix' the DB or the way in which it is built, instead.
It depends massively on what you are doing with the data and why, but I have found it useful in the past to break things down into single words and pairs of words, then link resources or phrases to those singles/pairs.
For example to implement a search I have done:
Source text: Testing phrases to see
Entries:
testing
testing phrases
phrases
phrases to
to
to see
see
To see if another phrase was similar (granted, not contained within) you would break down the other phrase in the same way and count the number of phrases common between them.
It has the nice side effect of still matching if you were to use (for example) "see phases to testing": because the individual words would match.. but because the order is different the pairs wouldn't, so it's taking phrases (consecutive words) into account at the same time, the number of matches wouldn't be as high, good for use as a 'score' in matching.
As I say that -kind- of thing has worked for me, but it would be great to hear some more background/context, so we can see if we can find a better solution.
When you have the 'cleaned column' from MaasSQL's previous answer, you could, depending on the way "phrase match" works exactly (I don't know), sort this column based on the length of the containing string.
Then make sure you run the comparison query in a converging manner in a procedure instead of a flat query, by stepping through your table (with a cursor) and eliminating candidates for comparison through WHERE statements and through deleting candidates that have already been tested (completely). You may need a temporary table to do this.
What do I mean with 'WHERE' statement previously? Well, if the comparison value is in a column sorted on length, you'll never have to test whether a longer string matches inside a shorter string.
And with deleting candidates: starting with the shortest strings, once you've tested all strings of a certain length, you'll can remove them from the comparison table, as any next test you'll do will never get a match.
Of course, this requires a bit more programming than just one SQL statement. And is dependent on the way "phrase match" works exactly.
DTS or SSIS may be your friend here as well.