Find string similarities between two dimensions in SQL - sql

I have two tables and I want to find matches where values can be found in one of the tables and where they are in the second.
In table A I have a list over search queries by users, and in table B I have a list over a selection of search queries I want to find. To make this work I want to use a method similar to:
SELECT UTL_MATCH.JARO_WINKLER_SIMILARITY('shackleford', 'shackelford') FROM DUAL
I have used this method, but it does not work as it can be a difference between the query and the name in selection.
SELECT query FROM search_log WHERE query IN (SELECT navn FROM selection_table);
Are there any best practice methods for finding similarities through a query?

One approach might be something like:
SELECT
SEARCH_LOG.QUERY
FROM
SEARCH_LOG
WHERE
EXISTS
(
SELECT
NULL
FROM
SELECTION_TABLE
WHERE
UTL_MATCH.JARO_WINKLER_SIMILARITY(SEARCH_LOG.QUERY, SELECTION_TABLE.NAVN) >= 98
);
This will return rows in SEARCH_LOG that have a row in SELECTION_TABLE where NAVN matches QUERY with a score of at least 98 (out of 100). You could change the 98 to whatever threshold you prefer.
This is a "brute force" approach because it potentially looks at all combinations of rows. So, it might not be "best practice", but it might still be practical. If performance is important, you might consider a more sophisticated solution like Oracle Text.

Related

Any resources for this SQL filtering?

I have 100 tables each of size of order of few tenths of GB. The schema of each table is the following:
A: string | B: string | C: string
In each table I would like to retain only the rows for which the (B, C) appears at least 10 times in a concatenation of all 100 tables. Is there any efficient way to achieve this?
A very vague question, excluding your DBMS as well isn't helpful as SQL comes in different forms.
But first, you would have to join all of the tables together - there may be a faster way of doing this, but without knowing which flavor of SQL you are using it is hard to tell.
Something like this will work:
SELECT * FROM table_1
UNION
SELECT * FROM table_2
...
UNION
SELECT * FROM table_100
Once you have all of the data you do something like this:
WITH tables_with_counts as (SELECT
A,
B,
C,
COUNT(1) OVER(PARTITION BY(B, C)) AS bc_count
FROM
aggragated_tables)
SELECT
A,
B,
C
FROM
tables_with_counts
WHERE
bc_count >= 10
Here is my take:
Step 1 : Aggregate all tables into one. It would be bulky but if you are using Oracle database, I think it shouldn't be an issue.
Step 2: Create md5 checksum hash values for B,C columns like below :
SELECT APEX_ITEM.MD5_CHECKSUM(B,C) md5_cks,
A,B,C
FROM aggregated_tables
Step 3: take count based on checksum values and retain the rows where count > 10
Step 4: Get rid of duplicate data using rank() or dense rank() in delete statement.
The short answer, which I'm sure that you don't want to hear, is "no." In the context of relational databases there is no efficient query to merge 100 tables.
It is not all bad news though. If it were just one table (let's say it was named "combined" just to have concrete examples) you could use an elegant SQL using windowed functions
select A,B,C from (select A,B,C,count(1) over (partition by B,C) as counts from combined)counted where counts>=10
Option 1. So the question is how to get a "combined" table so that the snippet above works. If we stick with ANSI (standard) sql, you could use UNION ALL, which and collect it into a WITH clause to keep things neat.
Here is an example:
with
combined as (
select * from table_1
union all
select * from table_2),
counted as (
select
A,B,C,
count(1) over (partition by B,C) as counts
from
combined)
select A,B,C from counted where counts>=10;
I only included 2 tables, but the real query would extend that up to table_100. Thats a lot of typing and not very efficient with the programmer's time. Also unions and union all's are notoriously poor performing for databases, so this is not efficient in terms of system resources or time, either. Personally I would not do it this way, but it is an answer.
Option 2 There are other options which do not exactly match your question, but may be helpful to know. Any time you are tempted to create multiple tables with exactly the same schema, you will be better off creating a single table with multiple partitions. see MySQL, Postgres, Sql Server, Oracle, Hive. Every database platform has its own syntax for partitioning tables but they are all similar. For this table, each of the original tables becomes a single partition in the table, and the table name would be a really good candidate for the string value in the partition identifier (partition column)
If you are able to stuff all of your 100 tables into 100 partitions of one table then you can run the first query after all. The advantage is that the database can optimize that query because all modern databases are optimized to manage partitioned queries.
In addition, adding a partition to a table is really no more trouble than creating a new table instead, but supporting and maintaining one table is a lot less trouble than 100 tables.
A third option, since you tagged "big data" is to use a big data engine like Spark with SparkSQL. This would be objectively best because you can actually load a dataframe with 100 combined tables very efficiently with spark, and the SQL after that is not much different from the relational database sql we have been considering. That's kind of out of scope here, but worth considering. If you submit a more specific question and specifically for spark we could go into more details.

MS Access SQL - Removing Duplicates From Query

MS Access SQL - This is a generic performance-related duplicates question. So, I don't have a specific example query, but I believe I have explained the situation below clearly and simply in 3 statements.
I have a standard/complex SQL query that Selects many columns; some computed, some with asterisk, and some by name - e.g. (tab1.*, (tab2.co1 & tab2.col2) as computedFld1, tab3.col4, etc).
This query Joins about 10 tables. And the Where clause is based on user specified filters that could be based on any of the fields present in all 10 tables.
Based on these filters, I can sometimes get records with the same tab4.ID value.
Question: What is the best way to eliminate duplicate result rows with the same tab4.ID value. I don't care which rows get eliminated. They will differ in non-important ways.
Or, if important, they will differ in that they will have different tab5.ID values; and I want to keep the result rows with the LARGEST tab5.ID values.
But if the first query performs better than the second, then I really don't care which rows get eliminated. The performance is more important.
I have worked on this most of the morning and I am afraid that the answer to this is above my pay scale. I have tried Group By tab4.ID, but can't use "*" in Select clause; and many other things that I just keep bumping my head against a wall.
Access does not support CTEs but you can do something similar with saved queries.
So first alias the columns that have same names in your query, something like:
SELECT tab4.ID AS tab4_id, tab5.ID AS tab5_id, ........
and then save your query for example as myquery.
Then you can use this saved query like this:
SELECT q1.*
FROM myquery AS q1
WHERE q1.tab5_id = (SELECT MAX(q2.tab5_id) FROM myquery AS q2 WHERE q2.tab4_id = q1.tab4_id)
This will return 1 row for each tab4_id if there are no duplicate tab5_ids for each tab4_id.
If there are duplicates then you must provide additional conditions.

is there a way to search within multiple tables in SQL

Right now i have 100 tables in SQL and i am looking for a specific string value in all tables, and i do not know which column it is in.
select * from table1, table2 where column1 = 'MyLostString' will not work because i do not know which column it has to be in.
Is there a SQL query for that, must i brute force search every table for every column for that 'MyLostString'
If I were to brute-force search across all tables, is there an efficient query for that?
For instance:
select * from table3 where allcolumns = MyLostString
It is the defining feature of a RDBMS (or at least one of them), that the meaning of a value depends on the column it is in. E.g.: The value 17 will have quite different meanings, if it stands in a customer_id column, than in the product_id of a fictional orders table.
This leads to the fact, that RDBMS are not well equipped to search for a value, no matter in which column of which tables it might be used.
My recommendation is to first study the data model to try and find out, which column of which table should be holding the value. If this really fails, you have a problem much worse than a "lost string".
The last ressort is to transform the DB into something better suited for fulltext search ... such as a flat file. You might want to try mydbexportcommand --options | grep -C10 'My lost string' or friends.

Optimized way to get x Random rows satisfying given criteria in MySQL

I need to get x rows from a Database Table which satisfy some given criteria.
I know that we can get random rows from MySQL using ORDER BY RAND ().
SELECT * FROM 'vids' WHERE 'cat'=n ORDER BY RAND() LIMIT x
I am looking for the most optimized way do the same {Low usage of system resources is main priority. Next important priority is speed of the query}. Also, in the table design, should I make 'cat' INDEX ?
I'm trying to think of how to do this too. My thinking at the moment is the following three alternatives:
1) select random rows ignoring criteria, then throw out ones that do not match at the application level and select more random rows if needed. This method will be effective if your criteria matches lots of rows in your table, perhaps 20% or more (need to benchmark)
2) select rows following criteria, and choosing a row based on a random number between 1 and count(*) (random number determined in the application). This will be effective if the data matching the criteria is evenly distributed, but will fail terribly if for example you are selecting a date range, and the majority of random numbers will fall upon records outside this range.
3) my current favourite, but also the most work. For every combination of criteria you intend to use to select a random record, you insert a record into a special table for that criteria. You then select random records from the special table, and follow them back to your data. For example, you might have a table like this:
Table cat: name, age, eye_colour, fur_type
If you want to be able to select random cats with brown fur, then you need a table like this:
Table cats_with_brown_fur: id (autonumber), cat_fk
You can then select a random record from this table based on the autonumber id, and it will be fast, and will produce evenly distributed random results. But indeed, if you select from many sets of criteria, you will have some overheads on maintaining these tables.
That's my current take on it, anyway. Good luck
Order by Rand() is a bad idea.
Here's a better solution:
How can i optimize MySQL's ORDER BY RAND() function?
Google is your friend, a lot of people have it explained it better than I ever could.
http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/
http://www.phpbuilder.com/board/showthread.php?t=10338930
http://www.paperplanes.de/2008/4/24/mysql_nonos_order_by_rand.html

Multiple SQL searches vs searching through one returned array

Is it faster to do multiple SQL finds on one table with different conditions or to get all the items from the table in an array and then separate them out that way? I realize I'm not explaining my question that well, so here is an example:
I want to pull records on posts and display them in categories based on when they were posted, say within one year, within one month, one week, etc. The nature of the categories results in lower level categories being entirely contained within upper level ones.
Should I do a SQL find with different conditions for each category, resulting in multiple calls to the database, or should I do one search, returning all of the items and then sort them out from the array? Thanks for your responses, sorry I'm new at this.
Typically I would say that you are going to get better performance by letting your database engine do the sorting work. Each database engine has this functionality and typically it can do it faster than you can.
So I would vote to use the database to get your multiple groups rather than trying to do it yourself in memory.
I typically perform one large sql query and then break the array up in ruby to minimize the number or duration of database connections.
This isn't necessarily any faster, and I have never benchmarked it, but less reads to the db hopefully means it will scale longer.
Edit: Nevermind, I didn't quite understand the question. Just let SQL perform the ordering for you in a convenient fashion and then process the array yourself.
You can probably make it even easier if you let your SELECT statement generate helper columns to say which categories (e.g., based on the date) a record belongs to.
The simplest, and easiest to understand would be to perform multiple queries for each criteria, and then form each of those result sets into a group. I don't think you want to start traversing result sets and duplicating rows.
If you really want to do it in one query, you could try a UNION query.
SELECT *, 1 as group from posts WHERE date > '2009-07-24 00:00:00' ORDER BY date DESC
UNION ALL
SELECT *, 2 as group from posts WHERE date > '2009-07-17 00:00:00' ORDER BY date DESC
UNION ALL
SELECT *, 3 as group from posts WHERE date > '2009-06-24 00:00:00' ORDER BY date DESC
UNION ALL
SELECT *, 4 as group from posts WHERE date > '2008-07-24 00:00:00' ORDER BY date DESC
After that you just need to traverse the list once, and filter into new lists by the "group" column.
It depends. If you're using OR operators in your procedures, then things could get kind of slow. It would be better at that point to use multiple SQL statements.
But really, you need to analyze the query plans and decide for yourself if it is efficient enough or not. Run real world examples and TEST TEST TEST.