Searching a "vertical" table in SQLite - sql

Tables are usually laid out in a "horizontal" fashion:
+-----+----+----+--------+
|recID|FirstName|LastName|
+-----+----+----+--------+
| 1 | Jim | Jones |
+-----+----+----+--------+
| 2 | Adam | Smith |
+-----+----+----+--------+
Here, however, is a table with the same data in a "vertical" layout:
+-----+-----+----+-----+-------+
|rowID|recID| Property | Value |
+-----+-----+----+-----+-------+
| 1 | 1 |FirstName | Jim | \
+-----+-----+----+-----+-------+ These two rows constitute a single logical record
| 2 | 1 |LastName | Jones | /
+-----+-----+----+-----+-------+
| 3 | 2 |FirstName | Adam | \
+-----+-----+----+-----+-------+ These two rows are another single logical record
| 4 | 2 |LastName | Smith | /
+-----+-----+----+-----+-------+
Question: In SQLite, how can I search the vertical table efficiently and in such a way that recIDs are not duplicated in the result set? That is, if multiple matches are found with the same recID, only one (any one) is returned?
Example (incorrect):
SELECT rowID from items WHERE "Value" LIKE "J%"
returns of course two rows with the same recID:
1 (Jim)
2 (Jones)
What is the optimal solution here? I can imagine storing intermediate results in a temp table, but hoping for a more efficient way.
(I need to search through all properties, so the SELECT cannot be restricted with e.g. "Property" = "FirstName". The database is maintained by a third-party product; I suppose the design makes sense because the number of property fields is variable.)

To avoid duplicate rows in the result returned by a SELECT, use DISTINCT:
SELECT DISTINCT recID
FROM items
WHERE "Value" LIKE 'J%'
However, this works only for the values that are actually returned, and only for entire result rows.
In the general case, to return one result record for each group of table records, use GROUP BY to create such groups.
For any column that does not appear in the GROUP BY clause, you then have to choose which rowID in the group to return; here we use MIN:
SELECT MIN(rowID)
FROM items
WHERE "Value" LIKE 'J%'
GROUP BY recID
To make this query more efficient, create an index on the recID column.

Related

Distinct performance in Redshift

I am trying to populate a multiple dimension tables from single Base table.
Sample Base Table:
| id | empl_name | emp_surname | country | dept | university |
|----|-----------|-------------|---------|------|------------|
| 1 | AAA | ZZZ | USA | CE | U_01 |
| 2 | BBB | XXX | IND | CE | U_01 |
| 3 | CCC | XXX | CAN | IT | U_02 |
| 4 | CCC | ZZZ | USA | MECH | U_01 |
Required Dimension tables :
emp_name_dim with values - AAA,BBB,CCC
emp_surname_dim with values - ZZZ,XXX
country_dim with values - USA,IND,CAN
dept_dim with values - CE,IT,MECH
university_dim with values - U_01,U_02
Now to populate above dimension tables from base table, I am thinking of 2 approaches
Get distinct values from base table for all above columns combination, create single temp table out of that and use that temp table for subsequent individual dimension table creation. Here, I will be reading data from base table only once but with more column combination.
Create separate temp tables for distinct values specific to each dimension. This way we need to read base table for multiple times, but created temp table will be smaller(i.e. less number of rows and only single column's distinct values).
Which approach is better if we consider for performance?
Note :
Base table is huge containing millions of rows.
Above columns are just for sample. In actual table there are around 50 columns for
which I need to consider for distinct combination.
Scanning the large table only once is the way to go.
Also there is another way to get the distinct values which in some cases will be faster than distinct. As an alternative approach perform a "group by" on all the columns. Run this as a bake-off to see which is faster. In general if there will be a small number (fits in memory) number of resulting rows from distinct, then distinct will be faster. However, if the result will be large then group by will be faster. There are a lot of corner-cases and factors (distribution style) that can impact this rule-of-thumb so testing both for speed will give you which is faster in your case.
Given that you have 50 columns and you want all the unique combination I'd guess that the output set will be large and that group by will wind but this is just a guess.

How can I return the best matched row first in sort order from a set returned by querying a single search term against multiple columns in Postgres?

Background
I have a Postgres 11 table like so:
CREATE TABLE
some_schema.foo_table (
id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
bar_text TEXT,
foo_text TEXT,
foobar_text TEXT
);
It has some data like this:
INSERT INTO some_schema.foo_table (bar_text, foo_text, foobar_text)
VALUES ('eddie', '123456', 'something0987');
INSERT INTO some_schema.foo_table (bar_text, foo_text, foobar_text)
VALUES ('Snake', '12345-54321', 'that_##$%_snake');
INSERT INTO some_schema.foo_table (bar_text, foo_text, foobar_text)
VALUES ('Sally', '12345', '24-7avocado');
id | bar_text | foo_text | foobar_text
----+----------+-------------+-----------------
1 | eddie | 123456 | something0987
2 | Snake | 12345-54321 | that_##$%_snake
3 | Sally | 12345 | 24-7avocado
The problem
I need to query each one of these columns and compare the values to a given term (passed in as an argument from app logic), and make sure the best-matched row (considering comparison with all the columns, not just one) is returned first in the sort order.
There is no way to know in advance which of the columns is likely to be a better match for the given term.
If I compare the given term to each value using the similarity() function, I can see at a glance which row has the best match in any of the three columns and can see that's the one I would want ranked first in the sort order.
SELECT
f.id,
f.foo_text,
f.bar_text,
f.foobar_text,
similarity('12345', foo_text) AS foo_similarity,
similarity('12345', bar_text) AS bar_similarity,
similarity('12345', foobar_text) AS foobar_similarity
FROM some_schema.foo_table f
WHERE
(
f.foo_text ILIKE '%12345%'
OR
f.bar_text ILIKE '%12345%'
OR
f.foobar_text ILIKE '%12345%'
)
;
id | foo_text | bar_text | foobar_text | foo_similarity | bar_similarity | foobar_similarity
----+-------------+----------+-----------------+----------------+----------------+-------------------
2 | 12345-54321 | Snake | that_##$%_snake | 0.5 | 0 | 0
3 | 12345 | Sally | 24-7avocado | 1 | 0 | 0
1 | 123456 | eddie | something0987 | 0.625 | 0 | 0
(3 rows)
Clearly in this case, id #3 (Sally) is the best match (exact, as it happens); this is the row I'd like to return first.
However, since I don't know ahead of time that foo_text is going to be the column with the best match, I don't know how to define the ORDER BY clause.
I figured this would be a common enough problem, but I haven't found any hints in a fair bit of SO and DDG .
How can I always rank the best-matched row first in the returned set, without knowing which column will provide the best match to the search term?
Use greatest():
greatest(similarity('12345', foo_text), similarity('12345', bar_text), similarity('12345', foobar_text)) desc

SQL Server: Use a column to save order of the record

I'm facing a database that keeps the ORDERING in columns of the table.
It's like:
Id Name Description Category OrderByName OrderByDescription OrderByCategory
1 Aaaa bbbb cccc 1 2 3
2 BBbbb Aaaaa bbbb 2 1 2
3 cccc cccc aaaaa 3 3 1
So, when the user want's to order by name, the SQL goes with an ORDER BY OrderByName.
I think this doesn't make any sense, since that's why Index are for and i tried to find any explanation for that but haven't found. Is this faster than using indexes? Is there any scenario where this is really useful?
It can make sense for many reasons but mainly when you don't want to follow the "natural order" given by the ORDER BY clause.
This is a scenario where this can be useful :
SQL Fiddle
MS SQL Server 2008 Schema Setup:
CREATE TABLE Table1
([Id] int, [Name] varchar(15), [OrderByName] int)
;
INSERT INTO Table1
([Id], [Name], [OrderByName])
VALUES
(1, 'Del Torro', 2 ),
(2, 'Delson', 1),
(3, 'Delugi', 3)
;
Query 1:
SELECT *
FROM Table1
ORDER BY Name
Results:
| ID | NAME | ORDERBYNAME |
|----|-----------|-------------|
| 1 | Del Torro | 2 |
| 2 | Delson | 1 |
| 3 | Delugi | 3 |
Query 2:
SELECT *
FROM Table1
ORDER BY OrderByName
Results:
| ID | NAME | ORDERBYNAME |
|----|-----------|-------------|
| 2 | Delson | 1 |
| 1 | Del Torro | 2 |
| 3 | Delugi | 3 |
I think it makes little sense for two reasons:
Who is going to maintain this set of values in the table? You need to update them every time any row is added, updated, or deleted. You can do this with triggers, or horribly buggy and unreliable constraints using user-defined functions. But why? The information that seems to be in those columns is already there. It's redundant because you can get that order by ordering by the actual column.
You still have to use massive conditionals or dynamic SQL to tell the application how to order the results, since you can't say ORDER BY #column_name.
Now, I'm basing my assumptions on the fact that the ordering columns still reflect the alphabetical order in the relevant columns. It could be useful if there is some customization possible, e.g. if you wanted all Smiths listed first, and then all Morts, and then everyone else. But I don't see any evidence of this in the question or the data.
This could be useful if the ordering was customizable - that is, if users did not want to see the list in alphabetical order, but rather in some custom order.
An index on the int columns would be smaller than an index on the column that holds the actual text, but I don't see that there is any real benefit to this in most cases.

SELECT TOP 1 ...Some stuff... ORDER BY DES gives different result

SELECT TOP 1 Col1,col2
FROM table ... JOIN table2
...Some stuff...
ORDER BY DESC
gives different result. compared to
SELECT Col1,col2
FROM table ... JOIN table2
...Some stuff...
ORDER BY DESC
2nd query gives me some rows , When I want the Top 1 of this result I write the 1st query with TOP 1 clause. These both give different results.
why is this behavior different
This isn't very clear, but I guess you mean the row returned by the first query isn't the same as the first row returned by the second query. This could be because your order by has duplicate values in it.
Say, for example, you had a table called Test
+-----+------+
| Seq | Name |
+-----+------+
| 1 | A |
| 1 | B |
| 2 | C |
+-----+------+
If you did Select * From Test Order By Seq, either of these is valid
+-----+------+
| Seq | Name |
+-----+------+
| 1 | A |
| 1 | B |
| 2 | C |
+-----+------+
+-----+------+
| Seq | Name |
+-----+------+
| 1 | B |
| 1 | A |
| 2 | C |
+-----+------+
With the top, you could get either row.
Having the top 1 clause could mean the query optimizer uses a completely different approach to generate the results.
I'm going to assume that you're working in SQL Server, so Laurence's answer is probably accurate. But for completeness, this also depends on what database technology you are using.
Typically, index-based databases, like SQL Server, will return results that are sorted by the index, depending on how the execution plan is created. But not all databases utilize indices.
Netezza, for example, keeps track of where data lives in the system without the concept of an index (Netezza's system architecture is quite a bit different). As a result, selecting the 1st record of a query will result in a random record from the result set floating to the top. Executing the same query multiple times will likely result in a different order each time.
If you have a requirement to order data, then it is in your best interest to enforce the ordering yourself instead of relying on the arbitrary ordering that the database will use when creating its execution plan. This will make your results more predictable.
Your 1st query will get one table's top row and compare with another table with condition. So it will return different values compare to normal join.

Select any(array) in any(array)

I have a bigint[] colunm:
person
------
id | name | other_information
------------------------------
1 | Zé | {1,2,3}
2 | João | {1,3}
3 | Maria | {3,5}
I need select persons with 2 or 5 in other_information. How?
select *
from person
where 2 = ANY(other_information)
or 5 = ANY(other_information)
However, using arrays is not recommended, as it results in a denormalized schema.
From the PostgreSQL docs:
Tip: Arrays are not sets; searching for specific array elements can be
a sign of database misdesign. Consider using a separate table with a
row for each item that would be an array element. This will be easier
to search, and is likely to scale better for a large number of
elements.