Database design - efficient text searching

Database design - efficient text searching - sql

I have a table that contains URL strings, i.e.
/A/B/C
/C/E
/C/B/A/R
Each string is split into tokens where the separator in my case is '/'. Then I assign integer value to each token and the put them into dictionary (different database table) i.e.
A : 1
B : 2
C : 3
E : 4
D : 5
G : 6
R : 7
My problem is to find those rows in first tables which contain given sequence of tokens. Additional problem is that my input is sequence of ints, i.e. I have
3, 2
and I'd like to find following rows
/A/B/C
/C/B/A/R
How to do this in efficient way. By this I mean how to design proper database structure.
I use PostgreSQL, solution should work well for 2 mln of rows in first table.
To clarify my example - I need both 'B' AND 'C' to be in the URL. Also 'B' and 'C' can occur in any order in the URL.
I need efficient SELECT. INSERT does not have to be efficient. I do not have to do all work in SQL if this changes anything.
Thanks in advance

I'm not sure how to do this, but I'm just giving you some idea that might be useful. You already have your initial table. You process is and create the token table:
+------------+---------+
| TokenValue | TokenId |
+------------+---------+
| A | 1 |
| B | 2 |
| C | 3 |
| E | 4 |
| D | 5 |
| G | 6 |
| R | 7 |
+------------+---------+
That's ok for me. Now, what I would do is to create a new table in which I would match the original table with the tokens of the token table (OrderedTokens). Something like:
+-------+---------+---------+
| UrlID | TokenId | AnOrder |
+-------+---------+---------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | 3 |
| 2 | 5 | 1 |
| 2 | 2 | 2 |
| 2 | 1 | 3 |
| 2 | 7 | 4 |
| 3 | 3 | 1 |
| 3 | 4 | 2 |
+-------+---------+---------+
This way you can even recreate your original table as long as you use the order field. For example:
select string_agg(t.tokenValue, '/' order by ot.anOrder) as OriginalUrl
from OrderedTokens as ot
join tokens t on t.tokenId = ot.tokenId
group by ot.urlId
The previous query would result in:
+-------------+
| OriginalUrl |
+-------------+
| A/B/C |
| D/B/A/R |
| C/E |
+-------------+
So, you don't even need your original table anymore. If you want to get Urls that have any of the provided token ids (in this case B OR C), you sould use this:
select string_agg(t.tokenValue, '/' order by ot.anOrder) as OriginalUrl
from OrderedTokens as ot
join Tokens t on t.tokenId = ot.tokenId
group by urlid
having count(case when ot.tokenId in (2, 3) then 1 end) > 0
This results in:
+-------------+
| OriginalUrl |
+-------------+
| A/B/C | => It has both B and C
| D/B/A/R | => It has only B
| C/E | => It has only C
+-------------+
Now, if you want to get all Urls that have BOTH ids, then try this:
select string_agg(t.tokenValue, '/' order by ot.anOrder) as OriginalUrl
from OrderedTokens as ot
join Tokens t on t.tokenId = ot.tokenId
group by urlid
having count(distinct case when ot.tokenId in (2, 3) then ot.tokenId end) = 2
Add in the count all the ids you want to filter and then equal that count the the amount of ids you added. The previous query will result in:
+-------------+
| OriginalUrl |
+-------------+
| A/B/C | => It has both B and C
+-------------+
The funny thing is that none of the solutions I provided results in your expected result. So, have I misunderstood your requirements or is the expected result you provided wrong?
Let me know if this is correct.

It really depends on what you mean by efficient. It will be a trade-off between query performance and storage.
If you want to efficiently store this information, then your current approach is appropriate. You can query the data by doing something like this:
SELECT DISTINCT
u.url
FROM
urls u
INNER JOIN
dictionary d
ON
d.id IN (3, 2)
AND u.url ~ E'\\m' || d.url_component || E'\\m'
This query will take some time, as it will be required to do a full table scan, and perform regex logic on each URL. It is, however, very easy to insert and store data.
If you want to optimize for query performance, though, you can create a reference table of the URL components; it would look something like this:
/A/B/C A
/A/B/C B
/A/B/C C
/C/E C
/C/E E
/D/B/A/R D
/D/B/A/R B
/D/B/A/R A
/D/B/A/R R
You can then create a clustered index on this table, on the URL component. This query would retrieve your results very quickly:
SELECT DISTINCT
u.full_url
FROM
url_components u
INNER JOIN
dictionary d
ON
d.id IN (3, 2)
AND u.url_component = d.url_component
Basically, this approach moves the complexity of the query up front. If you are doing few inserts, but lots of queries against this data, then that is appropriate.
Creating this URL component table is trivial, depending on what tools you have at your disposal. A simple awk script could work through your 2M records in a minute or two, and the subsequent copy back into the database would be quick as well. If you need to support real-time updates to this table, I would recommend a non-SQL solution: whatever your app is coded in could use regular expressions to parse the URL and insert the components into the component table. If you are limited to using the database, then an insert trigger could fulfill the same role, but it will be a more brittle approach.

Related

Union two query result column-wise

Say if I have two queries returning two tables with the same number of rows.
For example, if query 1 returns
| a | b | c |
| 1 | 2 | 3 |
| 4 | 5 | 6 |
and query 2 returns
| d | e | f |
| 7 | 8 | 9 |
| 10 | 11 | 12 |
How to obtain the following, assuming both queries are opaque
| a | b | c | d | e | f |
| 1 | 2 | 3 | 7 | 8 | 9 |
| 4 | 5 | 6 | 10 | 11 | 12 |
My current solution is to add to each query a row number column and inner join them
on this column.
SELECT
q1_with_rownum.*,
q2_with_rownum.*
FROM (
SELECT ROW_NUMBER() OVER () AS q1_rownum, q1.*
FROM (.......) q1
) q1_with_rownum
INNER JOIN (
SELECT ROW_NUMBER() OVER () AS q2_rownum, q2.*
FROM (.......) q2
) q2_with_rownum
ON q1_rownum = q2_rownum
However, if there is a column named q1_rownum in either of the query,
the above will break. It is not possible for me to look into q1 or q2;
the only information available is that they are both valid SQL queries
and do not contain columns with same names. Are there any SQL construct
similar to UNION but for columns instead of rows?

There is no such function. A row in a table is an entity.
If you are constructing generic code to run on any tables, you can try using less common values, such as "an unusual query rownum" -- or something more esoteric than that. I would suggest using the same name in both tables and then using using clause for the join.

Not sure if I understood your exact problem, but I think you mean both q1 and q2 are joined on a column with the same name?
You should add each table name before the column to distinguish which column is referenced:
"table1"."similarColumnName" = "table2"."similarColumnName"
EDIT:
So, problem is that if there is already a column with the same alias as your ROW_NUMBER(), the JOIN cannot be made because you have an ambiguous column name.
The easier solution if you cannot know your incoming table's columns is to make a solid alias, for example _query_join_row_number
EDIT2:
You could look into prefixing all columns with their original table's name, thus removing any conflict (you get q1_with_rows.rows and conflict column is q1_with_rows.q1.rows)
an example stack on this: In a join, how to prefix all column names with the table it came from

Oracle SQL query comparing multiple rows with same identifier

I'm honestly not sure how to title this - so apologies if it is unclear.
I have two tables I need to compare. One table contains tree names and nodes that belong to that tree. Each Tree_name/Tree_node combo will have its own line. For example:
Table: treenode
| TREE_NAME | TREE_NODE |
|-----------|-----------|
| 1 | A |
| 1 | B |
| 1 | C |
| 1 | D |
| 1 | E |
| 2 | A |
| 2 | B |
| 2 | D |
| 3 | C |
| 3 | D |
| 3 | E |
| 3 | F |
I have another table that contains names of queries and what tree_nodes they use. Example:
Table: queryrecord
| QUERY | TREE_NODE |
|---------|-----------|
| Alpha | A |
| Alpha | B |
| Alpha | D |
| BRAVO | A |
| BRAVO | B |
| BRAVO | D |
| CHARLIE | A |
| CHARLIE | B |
| CHARLIE | F |
I need to create an SQL where I input the QUERY name, and it returns any ‘TREE_NAME’ that includes all the nodes associated with the query. So if I input ‘ALPHA’, it would return TREE_NAME 1 & 2. If I ask it for CHARLIE, it would return nothing.
I only have read access, and don’t believe I can create temp tables, so I’m not sure if this is possible. Any advice would be amazing. Thank you!

You can use group by and having as follows:
Select t.tree_name
From tree_node t
join query_record q
on t.tree_node = q.tree_node
WHERE q.query = 'ALPHA'
Group by t.tree_name
Having count(distinct t.tree_node)
= (Select count(distinct q.tree_node) query_record q WHERE q.query = 'ALPHA');

Using an IN condition (a semi-join, which saves time over a join):
with prep (tree_node) as (select tree_node from queryrecord where query = :q)
select tree_name
from treenode
where tree_node in (select tree_node from prep)
group by tree_name
having count(*) = (select count(*) from prep)
;
:q in the prep subquery (in the with clause) is the bind variable to which you will assign the various QUERY values at runtime.
EDIT
I don't generally set up the test case on online engines; but in a comment below this answer, the OP said the query didn't work for him. So, I set up the example on SQLFiddle, here:
http://sqlfiddle.com/#!4/b575e/2
A couple of notes: for some reason, SQLFiddle thinks table names should be at most eight characters, so I had to change the second table name to queryrec (instead of queryrecord). I changed the name in the query, too, of course. And, second, I don't know how I can give bind values on SQLFiddle; I hard-coded the name 'Alpha'. (Note also that in the OP's sample data, this query value is not capitalized, while the other two are; of course, text values in SQL are case sensitive, so one should pay attention when testing.)

You can do this with a join and aggregation. The trick is to count the number of nodes in query_record before joining:
select qr.query, t.tree_name
from (select qr.*,
count(*) over (partition by query) as num_tree_node
from query_record qr
) qr join
tree_node t
on t.tree_node = qr.tree_node
where qr.query = 'ALPHA'
group by qr.query, t.tree_name, qr.num_tree_node
having count(*) = qr.num_tree_node;
Here is a db<>fiddle.

Need alternate SQL

I am currently working with an H2 database and I have written the following SQL, however the H2 database engine does not support the NOT IN being performed on a multiple column sub-query.
DELETE FROM AllowedParam_map
WHERE (AllowedParam_map.famid,AllowedParam_map.paramid) NOT IN (
SELECT famid,paramid
FROM macros
LEFT JOIN macrodata
ON macros.id != macrodata.macroid
ORDER BY famid)
Essentially I want to remove rows from allowedparam_map wherever it has the same combination of famid and paramid as the sub-query
Edit: To clarify, the sub-query is specifically trying to find famid/paramid combinations that are NOT present in macrodata, in an effort to weed out the allowedparam_map, hence the ON macros.id != macrodata.macroid. I'm also terrible at SQL so this might be completely the wrong way to do it.
Edit 2: Here is some more info about the pertinent schema:
Macros
| ID | NAME | FAMID |
| 0 | foo | 1 |
| 1 | bar | 1 |
| 2 | baz | 1 |
MacroData
| ID | MACROID | PARAMID | VALUE |
| 0 | 0 | 1 | 1024 |
| 1 | 0 | 2 | 200 |
| 2 | 0 | 3 | 89.85 |
AllowedParam_Map
| ID | FAMID | PARAMID |
| 0 | 1 | 1 |
| 1 | 1 | 2 |
| 2 | 1 | 3 |
| 3 | 1 | 4 |
The parameters are allowed on a per-family basis. Notice how the allowedParam_map table contains an entry for famid=1 and paramid=4, even though macro 0, aka "foo", does not have an entry for paramid=4. If we expand this, there might be another famid=1 macro that has paramid=4, but we cant be sure. I want to cull from the allowedParam_map table any unused parameters, based on the data in the macrodata table.

IN and NOT IN can always be replaced with EXISTS and NOT EXISTS.
Some points first:
You are using an ORDER BY in your subquery, which is of course superfluous.
You are outer-joining a table, which should have no effect when asking for existence. So either you need to look up a field in the outer-joined table, then inner-join it or you don't, then remove it from the query. (It's queer to join every non-related record (macros.id != macrodata.macroid) anyway.
You say in the comments section that both famid and paramid reside in table macros, so you can remove the outer join to macrodata from your query. You get:
As you say now that famid is in table macros and paramid is in table macrodata and you want to look up pairs that exist in AllowedParam_map, but not in the aformentioned tables, you seem to be looking for a simple inner join.
DELETE FROM AllowedParam_map
WHERE NOT EXISTS
(
SELECT *
FROM macros m
JOIN macrodata md ON md.macroid = m.id
WHERE m.famid = AllowedParam_map.famid
AND md.paramid = AllowedParam_map.paramid
);

You can use not exists instead:
DELETE FROM AllowedParam_map m
WHERE NOT EXISTS (SELECT 1
FROM macros LEFT JOIN
macrodata
ON macros.id <> macrodata.macroid -- I strongly suspect this should be =
WHERE m.famid = ?.famid and m.paramid = ?.paramid -- add the appropriate table aliases
);
Notes:
I strongly suspect the <> should be =. <> does not make sense in this context.
Replace the ? with the appropriate table alias.
NOT EXISTS is better than NOT IN anyway. It does what you expect if one of the value is NULL.

1 to Many Query: Help Filtering Results

Problem: SQL Query that looks at the values in the "Many" relationship, and doesn't return values from the "1" relationship.
Tables Example: (this shows two different tables).
+---------------+----------------------------+-------+
| Unique Number | <-- Table 1 -- Table 2 --> | Roles |
+---------------+----------------------------+-------+
| 1 | | A |
| 2 | | B |
| 3 | | C |
| 4 | | D |
| 5 | | |
| 6 | | |
| 7 | | |
| 8 | | |
| 9 | | |
| 10 | | |
+---------------+----------------------------+-------+
When I run my query, I get multiple, unique numbers that show all of the roles associated to each number like so.
+---------------+-------+
| Unique Number | Roles |
+---------------+-------+
| 1 | C |
| 1 | D |
| 2 | A |
| 2 | B |
| 3 | A |
| 3 | B |
| 4 | C |
| 4 | A |
| 5 | B |
| 5 | C |
| 5 | D |
| 6 | D |
| 6 | A |
+---------------+-------+
I would like to be able to run my query and be able to say, "When the role of A is present, don't even show me the unique numbers that have the role of A".
Maybe if SQL could look at the roles and say, WHEN role A comes up, grab unique number and remove it from column 1.
Based on what I would "like" to happen (I put that in quotations as this might not even be possible) the following is what I would expect my query to return.
+---------------+-------+
| Unique Number | Roles |
+---------------+-------+
| 1 | C |
| 1 | D |
| 5 | B |
| 5 | C |
| 5 | D |
+---------------+-------+
UPDATE:
Query Example: I am querying 8 tables, but I condensed it to 4 for simplicity.
SELECT
c.UniqueNumber,
cp.pType,
p.pRole,
a.aRole
FROM c
JOIN cp ON cp.uniqueVal = c.uniqueVal
JOIN p ON p.uniqueVal = cp.uniqueVal
LEFT OUTER JOIN a.uniqueVal = p.uniqueVal
WHERE
--I do some basic filtering to get to the relevant clients data but nothing more than that.
ORDER BY
c.uniqueNumber
Table sizes: these tables can have anywhere from 50,000 rows to 500,000+

Pretending the table name is t and the column names are alpha and numb:
SELECT t.numb, t.alpha
FROM t
LEFT JOIN t AS s ON t.numb = s.numb
AND s.alpha = 'A'
WHERE s.numb IS NULL;
You can also do a subselect:
SELECT numb, alpha
FROM t
WHERE numb NOT IN (SELECT numb FROM t WHERE alpha = 'A');
Or one of the following if the subselect is materializing more than once (pick the one that is faster, ie, the one with the smaller subtable size):
SELECT t.numb, t.alpha
FROM t
JOIN (SELECT numb FROM t GROUP BY numb HAVING SUM(alpha = 'A') = 0) AS s USING (numb);
SELECT t.numb, t.alpha
FROM t
LEFT JOIN (SELECT numb FROM t GROUP BY numb HAVING SUM(alpha = 'A') > 0) AS s USING (numb)
WHERE s.numb IS NULL;
But the first one is probably faster and better[1]. Any of these methods can be folded into a larger query with multiple additional tables being joined in.
[1] Straight joins tend to be easier to read and faster to execute than queries involving subselects and the common exceptions are exceptionally rare for self-referential joins as they require a large mismatch in the size of the tables. You might hit those exceptions though, if the number of rows that reference the 'A' alpha value is exceptionally small and it is indexed properly.

There are many ways to do it, and the trade-offs depend on factors such as the size of the tables involved and what indexes are available. On general principles, my first instinct is to avoid a correlated subquery such as another, now-deleted answer proposed, but if the relationship table is small then it probably doesn't matter.
This version instead uses an uncorrelated subquery in the where clause, in conjunction with the not in operator:
select num, role
from one_to_many
where num not in (select otm2.num from one_to_many otm2 where otm2.role = 'A')
That form might be particularly effective if there are many rows in one_to_many, but only a small proportion have role A. Of course you can add an order by clause if the order in which result rows are returned is important.
There are also alternatives involving joining inline views or CTEs, and some of those might have advantages under particular circumstances.

LINQ OrderBy. Does it always return the same ordered list?

I was trying out a simple OrderBy statement.
The target data to order is something like below:
[
{"id":40, "description":"aaa", "rate":1},
{"id":1, "description":"bbb", "rate":1},
{"id":4, "description":"ccc", "rate":2},
{"id":19, "description":"aaa", "rate":1}
]
Then I order items by the rate property.
The odd thing is that if I 'order' them, it 'skips' some items by a given offset and then 'take' only portion of the data.
For example,
var result = items.OrderBy(i => i.rate);
var result = result.Skip(2);
var result = result.Take(2);
The result looks fine for the most of it, but the 'edge case' item is not returned at all.
For example,
if the first result came back as
[{"id":40, "description":"aaa", "rate":1}, {"id":1, "description":"bbb", "rate":1}]
the second result comes back like
[{"id":1, "description":"bbb", "rate":1}, {"id":4, "description":"ccc", "rate":2}]
Item "id: 19" has not been returned with the second query call. Instead item "id: 1" has returned twice.
My guess is that the SQL OrderBy statement doesn't produce the same ordered list every single time OrderBy orders by a given property, but the exact order within a group that shares the same property can change.
What is the exact mechanism under the hood?

Short answer: LINQ to Objects uses a stable sort algorithm, so we can say that it is deterministic, and LINQ to SQL depends on the database implementation of Order By that is usually nondeterministic.
A deterministic sort algorithm is one that have always the same behavior on different runs.
In you example, you have duplicates in your OrderBy clause. For a guaranteed and predicted sort, one of the order clauses or the combination of order clauses must be unique.
In LINQ, you can achieve it by adding another OrderBy clause to refer your unique property, like in
items.OrderBy(i => i.Rate).ThenBy(i => i.ID).
Long answer:
LINQ to Objects uses a stable sort, as documented in this link: MSDN.
In LINQ to SQL, it depends on the sort algorithm of the underlying database and it is usually an unstable sort, like in MS SQL Server (MSDN).
In a stable sort, if the keys of two elements are equal, the order of the elements is preserved. In contrast, an unstable sort does not preserve the order of elements that have the same key.
So, for LINQ to SQL, the sorting is usually nondeterministic because the RDMS (Relational Database Management System, like MS SQL Server) may directly use a unstable sort algorithm with a random pivot selection or the randomness can be related with which row the database happens to access first in the file system.
For example, imagine that the size of a page in the file system can hold up to 4 rows.
The page will be full if you insert the following data:
Page 1
| Name | Value |
|------|-------|
| A | 1 |
| B | 2 |
| C | 3 |
| D | 4 |
If you need to insert a new row, the RDMS has two options:
Create a new page to allocate the new row.
Split the current page in two pages. So the first page will hold the Names A and B and the second page will hold C and D.
Suppose that the RDMS chooses option 1 (to reduce index fragmentation). If you insert a new row with Name C and Value 9, you will get:
Page 1 Page 2
| Name | Value | | Name | Value |
|------|-------| |------|-------|
| A | 1 | | C | 9 |
| B | 2 | | | |
| C | 3 | | | |
| D | 4 | | | |
Probably, the OrderBy clause in column Name will return the following:
| Name | Value |
|------|-------|
| A | 1 |
| B | 2 |
| C | 3 |
| C | 9 | -- Value 9 appears after because it was at another page
| D | 4 |
Now, suppose that the RDMS chooses option 2 (to increase the insert performance in a storage system with many spindles). If you insert a new row with Name C and Value 9, you will get:
Page 1 Page 2
| Name | Value | | Name | Value |
|------|-------| |------|-------|
| A | 1 | | C | 3 |
| B | 2 | | D | 4 |
| C | 9 | | | |
| | | | | |
Probably, the OrderBy clause in column Name will return the following:
| Name | Value |
|------|-------|
| A | 1 |
| B | 2 |
| C | 9 | -- Value 9 appears before because it was at the first page
| C | 3 |
| D | 4 |
Regarding your example:
I believe that you have mistyped something in your question, because you have used items.OrderBy(i => i.rate).Skip(2).Take(2); and the first result do not show a row with Rate = 2. This is not possible since the Skip will ignore the first two rows and they have Rate = 1, so your output must show the row with Rate = 2.
You've tagged your question with database, so I believe that you are using LINQ to SQL. In this case, results can be nondeterministic and you could get the following:
Result 1:
[{"id":40, "description":"aaa", "rate":1},
{"id":4, "description":"ccc", "rate":2}]
Result 2:
[{"id":1, "description":"bbb", "rate":1},
{"id":4, "description":"ccc", "rate":2}]
If you had used items.OrderBy(i => i.rate).ThenBy(i => i.ID).Skip(2).Take(2); then the only possible result would be:
[{"id":40, "description":"aaa", "rate":1},
{"id":4, "description":"ccc", "rate":2}]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Database design - efficient text searching - sql

Related

Union two query result column-wise

Oracle SQL query comparing multiple rows with same identifier

Need alternate SQL

1 to Many Query: Help Filtering Results

LINQ OrderBy. Does it always return the same ordered list?

Categories

Resources