is there a way to search within multiple tables in SQL - sql

Right now i have 100 tables in SQL and i am looking for a specific string value in all tables, and i do not know which column it is in.
select * from table1, table2 where column1 = 'MyLostString' will not work because i do not know which column it has to be in.
Is there a SQL query for that, must i brute force search every table for every column for that 'MyLostString'
If I were to brute-force search across all tables, is there an efficient query for that?
For instance:
select * from table3 where allcolumns = MyLostString

It is the defining feature of a RDBMS (or at least one of them), that the meaning of a value depends on the column it is in. E.g.: The value 17 will have quite different meanings, if it stands in a customer_id column, than in the product_id of a fictional orders table.
This leads to the fact, that RDBMS are not well equipped to search for a value, no matter in which column of which tables it might be used.
My recommendation is to first study the data model to try and find out, which column of which table should be holding the value. If this really fails, you have a problem much worse than a "lost string".
The last ressort is to transform the DB into something better suited for fulltext search ... such as a flat file. You might want to try mydbexportcommand --options | grep -C10 'My lost string' or friends.

Related

Select query to find all unique keys from a table where a any word from a list appears in a free text field?

Looking for a little bit of SQL-foo to help find the most efficient way to do this query.
I have a table with two columns, ID and a small character field (<300 chars). The ID field is not unique, and I would like the result to be a distinct list of ID numbers. I also have an input list of words that I want to query on, say 'foo', 'bar' as the base case. For a result to be valid, it also must have at least one matching row for each word that is input.
What is a clean and efficient way to write this as one query? I am also open to multiple queries if there is no single-query way to execute it efficiently.
Please note that in the specific environment I am working with I cannot use more than 10 subqueries, and I may have 10 or more words provided as input (although I may be able to limit the input to 10 as long as the user is aware of this). Also note that I cannot use the 'IN' clause if it is possible that the list of values in it grows to be larger than a few thousand. I am querying a table with potentially millions of ID-text pairs.
Thanks for any and all advice!
Use a UDF that returns a table:
Consider writing a user-defined function (UDF) that takes a string containing all values that you wish to search for, separated by a delimiter. The UDF would split the data in the string and return it as a table. Then, include the table that the UDF returns as a join on the table in question.
Here's an example: http://everysolution.wordpress.com/2011/07/28/udf-to-split-a-delimited-string-and-return-it-as-a-table/
If that small character field is always one word and you're looking for an exact match with a word in your list, I don't see why the below would not work. That is, if you're looking for IDs with 'foo', do you want only IDs that are 'foo', or might there be 'fooish', which should also be a match? In the latter case this won't work, in the former it should.
The query below assumes:
That your 2 column table is called "tbl"
That you can put the list of these 'input' words into a table; in my example below this other table is called "othertbl". It should contain however many words you're searching on, and it can be over 1,000 (the exists subquery doesn't have that limitation)
As stated before, I am assuming you are looking for exact matches on the 2nd column of "tbl", not partial or fuzzy matches
For performance reasons, you'll want to ensure that tbl.wordfield and othertbl.word are indexed (whatever the column names actually are)
-
select distinct id
from tbl
where exists
(select 'x' from othertbl where othertbl.word = tbl.wordfield)

Index on VARCHAR column

I have a table of 32,589 rows, and one of the columns is called 'Location' and is a Varchar(40) column type. The column holds a location, which is actually a suburb, all uppercase text.
A function that uses this table does a:
IF EXISTS(SELECT * FROM MyTable WHERE Location = 'A Suburb')
...
Would it be beneficial to add an index to this column, for efficiency? This is more a read-only table, so not much edits or inserts except for maintanance.
Without an index SQL Server will have to perform a table scan to find the first instance of the location you're looking for. You might get lucky and have the value be in one of the first few rows, but it could be at row 32,000, which would be a waste of time. Adding an index only takes a few second and you'll probably see a big performance gain.
I concur with #Brian Shamblen answer.
Also, try using TOP 1 in the inner select
IF EXISTS(SELECT TOP 1 * FROM MyTable WHERE Location = 'A Suburb')
You don't have to select all the records matching your criteria for EXISTS, one is enough.
An opportunistic approach to performance tuning is usually a bad idea.
To answer the specific question - if your function is using location in a where clause, and the table has more than a few hundred rows, and the values in the location column are not all identical, creating an index will speed up your function.
Whether you notice any difference is hard to say - there may be much bigger performance problems lurking in the database, and you might be fixing the wrong problem.

Find string similarities between two dimensions in SQL

I have two tables and I want to find matches where values can be found in one of the tables and where they are in the second.
In table A I have a list over search queries by users, and in table B I have a list over a selection of search queries I want to find. To make this work I want to use a method similar to:
SELECT UTL_MATCH.JARO_WINKLER_SIMILARITY('shackleford', 'shackelford') FROM DUAL
I have used this method, but it does not work as it can be a difference between the query and the name in selection.
SELECT query FROM search_log WHERE query IN (SELECT navn FROM selection_table);
Are there any best practice methods for finding similarities through a query?
One approach might be something like:
SELECT
SEARCH_LOG.QUERY
FROM
SEARCH_LOG
WHERE
EXISTS
(
SELECT
NULL
FROM
SELECTION_TABLE
WHERE
UTL_MATCH.JARO_WINKLER_SIMILARITY(SEARCH_LOG.QUERY, SELECTION_TABLE.NAVN) >= 98
);
This will return rows in SEARCH_LOG that have a row in SELECTION_TABLE where NAVN matches QUERY with a score of at least 98 (out of 100). You could change the 98 to whatever threshold you prefer.
This is a "brute force" approach because it potentially looks at all combinations of rows. So, it might not be "best practice", but it might still be practical. If performance is important, you might consider a more sophisticated solution like Oracle Text.

MySQL - Selecting data from multiple tables all with same structure but different data

Ok, here is my dilemma I have a database set up with about 5 tables all with the exact same data structure. The data is separated in this manner for localization purposes and to split up a total of about 4.5 million records.
A majority of the time only one table is needed and all is well. However, sometimes data is needed from 2 or more of the tables and it needs to be sorted by a user defined column. This is where I am having problems.
data columns:
id, band_name, song_name, album_name, genre
MySQL statment:
SELECT * from us_music, de_music where `genre` = 'punk'
MySQL spits out this error:
#1052 - Column 'genre' in where clause is ambiguous
Obviously, I am doing this wrong. Anyone care to shed some light on this for me?
I think you're looking for the UNION clause, a la
(SELECT * from us_music where `genre` = 'punk')
UNION
(SELECT * from de_music where `genre` = 'punk')
It sounds like you'd be happer with a single table. The five having the same schema, and sometimes needing to be presented as if they came from one table point to putting it all in one table.
Add a new column which can be used to distinguish among the five languages (I'm assuming it's language that is different among the tables since you said it was for localization). Don't worry about having 4.5 million records. Any real database can handle that size no problem. Add the correct indexes, and you'll have no trouble dealing with them as a single table.
Any of the above answers are valid, or an alternative way is to expand the table name to include the database name as well - eg:
SELECT * from us_music, de_music where `us_music.genre` = 'punk' AND `de_music.genre` = 'punk'
The column is ambiguous because it appears in both tables you would need to specify the where (or sort) field fully such as us_music.genre or de_music.genre but you'd usually specify two tables if you were then going to join them together in some fashion. The structure your dealing with is occasionally referred to as a partitioned table although it's usually done to separate the dataset into distinct files as well rather than to just split the dataset arbitrarily. If you're in charge of the database structure and there's no good reason to partition the data then I'd build one big table with an extra "origin" field that contains a country code but you're probably doing it for legitimate performance reason.
Either use a union to join the tables you're interested in http://dev.mysql.com/doc/refman/5.0/en/union.html or by using the Merge database engine http://dev.mysql.com/doc/refman/5.1/en/merge-storage-engine.html.
Your original attempt to span both tables creates an implicit JOIN. This is frowned upon by most experienced SQL programmers because it separates the tables to be combined with the condition of how.
The UNION is a good solution for the tables as they are, but there should be no reason they can't be put into the one table with decent indexing. I've seen adding the correct index to a large table increase query speed by three orders of magnitude.
The union statement cause a deal time in huge data. It is good to perform the select in 2 steps:
select the id
then select the main table with it

Is there a difference between Select * and Select [list each col] [duplicate]

This question already has answers here:
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc
(49 answers)
Closed 8 years ago.
I'm using MS SQL Server 2005. Is there a difference, to the SQL engine, between
SELECT * FROM MyTable;
and
SELECT ColA, ColB, ColC FROM MyTable;
When ColA, ColB, and ColC represent every column in the table?
If they are the same, is there a reason why you should use the 2nd one anyway? I have a project that's heavy on LINQ, and I'm not sure if the standard SELECT * it generates is a bad practice, or if I should always be a .Select() on it to specify which cols I want.
EDIT: Changed "When ColA, ColB, and ColC are all the columns to the table?" to "When ColA, ColB, and ColC represent every column in the table?" for clarity.
Generally, it's better to be explicit, so Select col1, col2 from Table is better. The reason being that at some point, an extra column may be added to that table, and would cause unneeded data to be brought back from the query.
This isn't a hard and fast rule though.
1) The second one is more explicit about which columns are returned. The value of the 2nd one then is how much you value explicitly knowing which columns come back.
2) This involves potentially less data being returned when there are more columns than the ones explicitly used as well.
3) If you change the table by adding a new column, the first query changes and the second does not. If you have code like "for all columns returned do ..." then the results change if you use the first, but not the 2nd.
I'm going to get a lot of people upset with me, but especially if I'm adding columns later on, I usually like to use the SELECT * FROM table. I've been called lazy for this reason, because if I make any modifications to my tables, I'd like not to track down all the stored procs that use that table, and just change it in the data access layer classes in my application. There are cases in which I will specify the columns, but in the case where I'm trying to get a complete "object" from the database, I'd rather just use the "*". And, yes, I know people will be hating me for this, but it has allowed me to be quicker and less bug free while adding fields to my applications.
The two sides of the issue are this: Explicit column specification gives better performance as new columns are added, but * specification requires no maintenance as new columns are added.
Which to use depends on what kind of columns you expect to add to the table, and what the point of the query is.
If you are using your table as a backing store for an object (which seems likely in the LINQ-to-SQL case), you probably want any new columns added to this table to be included in your object, and vice-versa. You're maintaining them in parallel. For this reason, for this case, * specification in the SELECT clause is right. Explicit specification would give you an extra bit of maintenance every time something changed, and a bug if you didn't update the field list correctly.
If the query is going to return a lot of records, you are probably better off with explicit specification for performance reasons.
If both things are true, consider having two different queries.
You should specify an explicit column list. SELECT * will bring back more columns than you need creating more IO and network traffic, but more importantly it might require extra lookups even though a non-clustered covering index exists (On SQL Server).
Some reasons not to use the first statement (select *) are:
If you add some large fields (a BLOB column would be very bad) later to that table, you could suffer performance problems in the application
If the query was a JOIN query with two or more tables, some of the fields could have the same name. It would be better to assure that your field names are different.
The purpose of the query is clearer with the second statement from an programming esthetics viewpoint
When you select each field individually, it is more clear which fields are actually being selected.
SELECT * is a bad practice in most places.
What if someone adds a 2gb BLOB column to that table?
What is someone adds really any column to that table?
It's a bug waiting to happen.
A couple things:
A good number of people have posted here recommending against using *, and given several good reasons for those answers. Out of 10 other responses so far only one doesn't recommend listing columns.
People often make exceptions to that rule when posting to help sites like StackOverflow, because they often don't know what columns are in your table or are important to your query. For that reason, you'll see a lot of code here and elsewhere on the web that uses the * syntax, even though the poster would tend to avoid it in his own code.
Its good for forward-compatiblity.
When you use
SELECT * FROM myTable
and in "myTable" are 3 columns. You get same results as
SELECT Column1, Column2, Column3 FROM myTable
But if you add new column in future, you get a diferent results.
Of course, if you change name one of existing column, in first case you get results and in the second case you get a error ( I think, this is correct behaviour of application ).
If your code relies on certain columns being in a certain order, you need to list the columns. If not, it doesn't really make a difference if you use "*" or write the column names out in the select statement.
An example is if you insert a column into a table.
Take this table:
ColA ColB ColC
You might have a query:
SELECT *
FROM myTable
Then the code might be:
rs = executeSql("SELECT * FROM myTable")
while (rs.read())
Print "Col A" + rs[0]
Print "Col B" + rs[1]
Print "Col C" + rs[2]
If you add a column between ColB and ColC, the query wouldn't return what you're looking for.
For LinqToSql, if you plan to modify those records later, you should pull the whole record into memory.
It depends on what you mean by "difference". There is the obvious syntax difference, but the real difference is one of performance.
When you say SELECT * FROM MyTable, you are telling the SQL query engine to return a data set with all of the columns from that table, while SELECT ColA, ColB, ColC FROM MyTable tells the query engine to return a data set with only ColA, ColB, and ColC from the table.
Say you have a table with 100 columns defined as CHAR[10]. SELECT * will return 100 columns * 10 bytes worth of data while SELECT ColA, ColB, ColC will return 3 columns * 10 bytes worth of data. This is a huge size difference in the amount of data that is being passed back across the wire.
Specifying the column list also makes it much clearer what columns you are interested in. The drawback is that if you add/remove a column from the table you need to ensure that the column list is updated as well, but I think that's a small price compared to the performance gain.
SELECT * FROM MyTable
select * is dependent on the column order in the schema so if you refer to the result set by the index # of the collection you will be looking at the wrong column.
SELECT Col1,Col2,Col3 FROM MyTable
this query will give you a collection that stays the same over time, but how often are you changing the column order anyways?
A quick look at the query execution plan shows that the querys are the same.
The general rule of thumb is that you will want to limit your queries to only the fields that you need returned.
selecting each column is better than just * because in case you add or delete a new row you HAVE to look at the code and take a look what you were doing with the retrieved data.
Also, it helps you understand your code better and allows you to use aliases as column names (in case you're performing a join of tables with a column sharing the name)
An example as to why you never (imho) should use SELECT *. This does not relate to MSSQL, but rather MySQL. Versions prior to 5.0.12 returned columns from certain types of joins in a none-standard manner. Of course, if your queries defines which columns you want and in which order you have no problem. Imagine the fun if they don't.
(One possible exception: Your query SELECTs from just one table and you identify columns in your programming language of choice by name rather than position.)
Using "SELECT *" optimizes for programmer typing. That's it. That's the only advantage.