Selecting only distinct rows based on a column in Knex

Selecting only distinct rows based on a column in Knex - sql

I'm using Knex, a pretty nice SQL builder.
I've got a table called Foo which has 3 columns
+--------------+-----------------+
| id | PK |
+--------------+-----------------+
| idFoo | FK (not unique) |
+--------------+-----------------+
| serialNumber | Number |
+--------------+-----------------+
I'd like to select all rows with idFoo IN (1, 2, 3).
However I'd like to avoid duplicate records based on the same idFoo.
Since that column is not unique there could be many rows with the same idFoo.
A possible solution
My query above will of course return all with idFoo IN (1, 2, 3), even duplicates.
db.select(
"id",
"idFoo",
"age"
)
.from("foo")
.whereIn("idFoo", [1, 2, 3])
However this will return results with duplicated idFoo's like so:
+----+-------+--------------+
| id | idFoo | serialNumber |
+----+-------+--------------+
| 1 | 2 | 56454 |
+----+-------+--------------+
| 2 | 3 | 75757 |
+----+-------+--------------+
| 3 | 3 | 00909 |
+----+-------+--------------+
| 4 | 1 | 64421 |
+----+-------+--------------+
What I need is this:
+----+-------+--------------+
| id | idFoo | serialNumber |
+----+-------+--------------+
| 1 | 2 | 56454 |
+----+-------+--------------+
| 3 | 3 | 00909 |
+----+-------+--------------+
| 4 | 1 | 64421 |
+----+-------+--------------+
I can take the result and use Javascript to filter out the duplicates. I'd specifically like to avoid that and write this in Knex.
The question is how can I do this with Knex code?
I know it can be done with plain SQL (perhaps something using GROUP BY) but I'd specifically like to achieve this in "pure" knex without using raw SQL.

Knex.js supports groupBy natively. You can write:
knex('foo').whereIn('id',
knex('foo').max('id').groupBy('idFoo')
)
Which is rewritten to the following SQL:
SELECT * FROM foo
WHERE id IN (
SELECT max(id) FROM foo
GROUP BY idFoo
)
Note that you need to use the subselect to make sure you won't mix values from diffrent rows within the same group.

In normal sql you do it like this.
You perform a self join and try to find a row with same idFoo but bigger id, if you dont find it you have NULL. And will know you are the bigger one.
SELECT t1.id, t1.idFoo, t1.serialNumber
FROM foo as t1
LEFT JOIN foo as t2
ON t1.id < t2.id
AND t1.idFoo = t2.idFoo
WHERE t2.idFoo IS NULL
So check for left join on knex.js
EDIT:
Just check documentation build this (not tested):
knex.select('t1.*')
.from('foo as t1')
.leftJoin('foo as t2', function() {
this.on('t1.id', '<', 't2.id')
.andOn('t1.idFoo ', '=', 't2.idFoo')
})
.whereNull("t2.idFoo")

Related

How do I update a column from a table with data from a another column from this same table?

I have a table "table1" like this:
+------+--------------------+
| id | barcode | lot |
+------+-------------+------+
| 0 | ABC-123-456 | |
| 1 | ABC-123-654 | |
| 2 | ABC-789-EFG | |
| 3 | ABC-456-EFG | |
+------+-------------+------+
I have to extract the number in the center of the column "barcode", like with this request :
SELECT SUBSTR(barcode, 5, 3) AS ToExtract FROM table1;
The result:
+-----------+
| ToExtract |
+-----------+
| 123 |
| 123 |
| 789 |
| 456 |
+-----------+
And insert this into the column "lot" .

follow along the lines
UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE condition;
i.e in your case
UPDATE table_name
SET lot = SUBSTR(barcode, 5, 3)
WHERE condition;(if any)

UPDATE table1 SET Lot = SUBSTR(barcode, 5, 3)
-- WHERE ...;

Many databases support generated (aka "virtual"/"computed" columns). This allows you to define a column as an expression. The syntax is something like this:
alter table table1 add column lot varchar(3) generated always as (SUBSTR(barcode, 5, 3))
Using a generated column has several advantages:
It is always up-to-date.
It generally does not occupy any space.
There is no overhead when creating the table (although there is overhead when querying the table).
I should note that the syntax varies a bit among databases. Some don't require the type specification. Some use just as instead of generated always as.

CREATE TABLE Table1(id INT,barcode varchar(255),lot varchar(255))
INSERT INTO Table1 VALUES (0,'ABC-123-456',NULL),(1,'ABC-123-654',NULL),(2,'ABC-789-EFG',NULL)
,(3,'ABC-456-EFG',NULL)
UPDATE a
SET a.lot = SUBSTRING(b.barcode, 5, 3)
FROM Table1 a
INNER JOIN Table1 b ON a.id=b.id
WHERE a.lot IS NULL
id | barcode | lot
-: | :---------- | :--
0 | ABC-123-456 | 123
1 | ABC-123-654 | 123
2 | ABC-789-EFG | 789
3 | ABC-456-EFG | 456
db<>fiddle here

Why is this Query not Updateable?

I was looking to provide an answer to this question in which the OP has two tables:
Table1
+--------+--------+
| testID | Status |
+--------+--------+
| 1 | |
| 2 | |
| 3 | |
+--------+--------+
Table2
+----+--------+--------+--------+
| ID | testID | stepID | status |
+----+--------+--------+--------+
| 1 | 1 | 1 | pass |
| 2 | 1 | 2 | fail |
| 3 | 1 | 3 | pass |
| 4 | 2 | 1 | pass |
| 5 | 2 | 2 | pass |
| 6 | 3 | 1 | fail |
+----+--------+--------+--------+
Here, the OP is looking to update the status field for each testID in Table1 with pass if the status of all stepID records associated with the testID in Table2 have a status of pass, else Table1 should be updated with fail for that testID.
In this example, the result should be:
+--------+--------+
| testID | Status |
+--------+--------+
| 1 | fail |
| 2 | pass |
| 3 | fail |
+--------+--------+
I wrote the following SQL code in an effort to accomplish this:
update Table1 a inner join
(
select
b.testID,
iif(min(b.status)=max(b.status) and min(b.status)='pass','pass','fail') as v
from Table2 b
group by b.testID
) c on a.testID = c.testID
set a.testStatus = c.v
However, MS Access reports the all-too-familiar, 'operation must use an updateable query' response.
I know that a query is not updateable if there is a one-to-many relationship between the record being updated and the set of values, but in this case, the aggregated subquery would yield a one-to-one relationship between the two testID fields.
Which left me asking, why is this query not updateable?

You're joining in a query with an aggregate (Max).
Aggregates are not updateable. In Access, in an update query, every part of the query has to be updateable (with the exception of simple expressions, and subqueries in WHERE part of your query), which means your query is not updateable.
You can work around this by using domain aggregates (DMin and DMax) instead of real ones, but this query will take a large performance hit if you do.
You can also work around it by rewriting your aggregates to take place in an EXISTS or NOT EXISTS clause, since that's part of the WHERE clause thus doesn't need to be updateable. That would likely minimally affect performance, but means you have to split this query in two: 1 query to set all the fields to "pass" that meet your condition, another to set them to "fail" if they don't.

1 to Many Query: Help Filtering Results

Problem: SQL Query that looks at the values in the "Many" relationship, and doesn't return values from the "1" relationship.
Tables Example: (this shows two different tables).
+---------------+----------------------------+-------+
| Unique Number | <-- Table 1 -- Table 2 --> | Roles |
+---------------+----------------------------+-------+
| 1 | | A |
| 2 | | B |
| 3 | | C |
| 4 | | D |
| 5 | | |
| 6 | | |
| 7 | | |
| 8 | | |
| 9 | | |
| 10 | | |
+---------------+----------------------------+-------+
When I run my query, I get multiple, unique numbers that show all of the roles associated to each number like so.
+---------------+-------+
| Unique Number | Roles |
+---------------+-------+
| 1 | C |
| 1 | D |
| 2 | A |
| 2 | B |
| 3 | A |
| 3 | B |
| 4 | C |
| 4 | A |
| 5 | B |
| 5 | C |
| 5 | D |
| 6 | D |
| 6 | A |
+---------------+-------+
I would like to be able to run my query and be able to say, "When the role of A is present, don't even show me the unique numbers that have the role of A".
Maybe if SQL could look at the roles and say, WHEN role A comes up, grab unique number and remove it from column 1.
Based on what I would "like" to happen (I put that in quotations as this might not even be possible) the following is what I would expect my query to return.
+---------------+-------+
| Unique Number | Roles |
+---------------+-------+
| 1 | C |
| 1 | D |
| 5 | B |
| 5 | C |
| 5 | D |
+---------------+-------+
UPDATE:
Query Example: I am querying 8 tables, but I condensed it to 4 for simplicity.
SELECT
c.UniqueNumber,
cp.pType,
p.pRole,
a.aRole
FROM c
JOIN cp ON cp.uniqueVal = c.uniqueVal
JOIN p ON p.uniqueVal = cp.uniqueVal
LEFT OUTER JOIN a.uniqueVal = p.uniqueVal
WHERE
--I do some basic filtering to get to the relevant clients data but nothing more than that.
ORDER BY
c.uniqueNumber
Table sizes: these tables can have anywhere from 50,000 rows to 500,000+

Pretending the table name is t and the column names are alpha and numb:
SELECT t.numb, t.alpha
FROM t
LEFT JOIN t AS s ON t.numb = s.numb
AND s.alpha = 'A'
WHERE s.numb IS NULL;
You can also do a subselect:
SELECT numb, alpha
FROM t
WHERE numb NOT IN (SELECT numb FROM t WHERE alpha = 'A');
Or one of the following if the subselect is materializing more than once (pick the one that is faster, ie, the one with the smaller subtable size):
SELECT t.numb, t.alpha
FROM t
JOIN (SELECT numb FROM t GROUP BY numb HAVING SUM(alpha = 'A') = 0) AS s USING (numb);
SELECT t.numb, t.alpha
FROM t
LEFT JOIN (SELECT numb FROM t GROUP BY numb HAVING SUM(alpha = 'A') > 0) AS s USING (numb)
WHERE s.numb IS NULL;
But the first one is probably faster and better[1]. Any of these methods can be folded into a larger query with multiple additional tables being joined in.
[1] Straight joins tend to be easier to read and faster to execute than queries involving subselects and the common exceptions are exceptionally rare for self-referential joins as they require a large mismatch in the size of the tables. You might hit those exceptions though, if the number of rows that reference the 'A' alpha value is exceptionally small and it is indexed properly.

There are many ways to do it, and the trade-offs depend on factors such as the size of the tables involved and what indexes are available. On general principles, my first instinct is to avoid a correlated subquery such as another, now-deleted answer proposed, but if the relationship table is small then it probably doesn't matter.
This version instead uses an uncorrelated subquery in the where clause, in conjunction with the not in operator:
select num, role
from one_to_many
where num not in (select otm2.num from one_to_many otm2 where otm2.role = 'A')
That form might be particularly effective if there are many rows in one_to_many, but only a small proportion have role A. Of course you can add an order by clause if the order in which result rows are returned is important.
There are also alternatives involving joining inline views or CTEs, and some of those might have advantages under particular circumstances.

Aggregate ENTIRE rows based on single field without querying source twice or using CTEs?

Assume I have the following table:
+--------+--------+--------+
| field1 | field2 | field3 |
+--------+--------+--------+
| a | a | 1 |
| a | b | 2 |
| a | c | 3 |
| b | a | 1 |
| b | b | 2 |
| c | b | 2 |
| c | b | 3 |
+--------+--------+--------+
I want to select only the rows where field3 is the minimum value, so only these rows:
+--------+--------+--------+
| field1 | field2 | field3 |
+--------+--------+--------+
| a | a | 1 |
| b | a | 1 |
| c | b | 2 |
+--------+--------+--------+
The most popular solution is to query the source twice, once directly and then joined to a subquery where the source is queried again and then aggregated. However, since my data source is actually a derived table/subquery itself, I'd have to duplicate the subquery in my SQL which is ugly. The other option is to use the WITH CTE and reuse the subquery which would be nice, but Teradata, the database I am using, doesn't support CTEs in views, though it does in macros which is not an option for me now.
So is it possible in standard SQL to group multiple records into a single record by using only a single field in the aggregation without querying the source twice or using a CTE?

This is possible using a window function:
select *
from (
select column_1, column_2, column_3,
min(column_3) over (partition by column_1) as min_col_3
from the_table
) t
where column_3 = min_col_3;
The above is standard SQL and I believe Teradata also supports window functions.
The derived table is necessary because you can't refer to a column alias in the where clause - at least not in standard SQL.
I think Teradata actually allows that using the qualify operator, but as I have never used it, I am not sure:
select *
from the_table
qualify min(column_3) over (partition by column_1) = column_3;

Use NOT EXISTS to return a row if there are no other row with same field1 value but a lower field3 value:
select * from table t1
where not exists (select 1 from table t2
where t2.field1 = t1.field1
and t2.field3 < t1.field3)

Database design - efficient text searching

I have a table that contains URL strings, i.e.
/A/B/C
/C/E
/C/B/A/R
Each string is split into tokens where the separator in my case is '/'. Then I assign integer value to each token and the put them into dictionary (different database table) i.e.
A : 1
B : 2
C : 3
E : 4
D : 5
G : 6
R : 7
My problem is to find those rows in first tables which contain given sequence of tokens. Additional problem is that my input is sequence of ints, i.e. I have
3, 2
and I'd like to find following rows
/A/B/C
/C/B/A/R
How to do this in efficient way. By this I mean how to design proper database structure.
I use PostgreSQL, solution should work well for 2 mln of rows in first table.
To clarify my example - I need both 'B' AND 'C' to be in the URL. Also 'B' and 'C' can occur in any order in the URL.
I need efficient SELECT. INSERT does not have to be efficient. I do not have to do all work in SQL if this changes anything.
Thanks in advance

I'm not sure how to do this, but I'm just giving you some idea that might be useful. You already have your initial table. You process is and create the token table:
+------------+---------+
| TokenValue | TokenId |
+------------+---------+
| A | 1 |
| B | 2 |
| C | 3 |
| E | 4 |
| D | 5 |
| G | 6 |
| R | 7 |
+------------+---------+
That's ok for me. Now, what I would do is to create a new table in which I would match the original table with the tokens of the token table (OrderedTokens). Something like:
+-------+---------+---------+
| UrlID | TokenId | AnOrder |
+-------+---------+---------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | 3 |
| 2 | 5 | 1 |
| 2 | 2 | 2 |
| 2 | 1 | 3 |
| 2 | 7 | 4 |
| 3 | 3 | 1 |
| 3 | 4 | 2 |
+-------+---------+---------+
This way you can even recreate your original table as long as you use the order field. For example:
select string_agg(t.tokenValue, '/' order by ot.anOrder) as OriginalUrl
from OrderedTokens as ot
join tokens t on t.tokenId = ot.tokenId
group by ot.urlId
The previous query would result in:
+-------------+
| OriginalUrl |
+-------------+
| A/B/C |
| D/B/A/R |
| C/E |
+-------------+
So, you don't even need your original table anymore. If you want to get Urls that have any of the provided token ids (in this case B OR C), you sould use this:
select string_agg(t.tokenValue, '/' order by ot.anOrder) as OriginalUrl
from OrderedTokens as ot
join Tokens t on t.tokenId = ot.tokenId
group by urlid
having count(case when ot.tokenId in (2, 3) then 1 end) > 0
This results in:
+-------------+
| OriginalUrl |
+-------------+
| A/B/C | => It has both B and C
| D/B/A/R | => It has only B
| C/E | => It has only C
+-------------+
Now, if you want to get all Urls that have BOTH ids, then try this:
select string_agg(t.tokenValue, '/' order by ot.anOrder) as OriginalUrl
from OrderedTokens as ot
join Tokens t on t.tokenId = ot.tokenId
group by urlid
having count(distinct case when ot.tokenId in (2, 3) then ot.tokenId end) = 2
Add in the count all the ids you want to filter and then equal that count the the amount of ids you added. The previous query will result in:
+-------------+
| OriginalUrl |
+-------------+
| A/B/C | => It has both B and C
+-------------+
The funny thing is that none of the solutions I provided results in your expected result. So, have I misunderstood your requirements or is the expected result you provided wrong?
Let me know if this is correct.

It really depends on what you mean by efficient. It will be a trade-off between query performance and storage.
If you want to efficiently store this information, then your current approach is appropriate. You can query the data by doing something like this:
SELECT DISTINCT
u.url
FROM
urls u
INNER JOIN
dictionary d
ON
d.id IN (3, 2)
AND u.url ~ E'\\m' || d.url_component || E'\\m'
This query will take some time, as it will be required to do a full table scan, and perform regex logic on each URL. It is, however, very easy to insert and store data.
If you want to optimize for query performance, though, you can create a reference table of the URL components; it would look something like this:
/A/B/C A
/A/B/C B
/A/B/C C
/C/E C
/C/E E
/D/B/A/R D
/D/B/A/R B
/D/B/A/R A
/D/B/A/R R
You can then create a clustered index on this table, on the URL component. This query would retrieve your results very quickly:
SELECT DISTINCT
u.full_url
FROM
url_components u
INNER JOIN
dictionary d
ON
d.id IN (3, 2)
AND u.url_component = d.url_component
Basically, this approach moves the complexity of the query up front. If you are doing few inserts, but lots of queries against this data, then that is appropriate.
Creating this URL component table is trivial, depending on what tools you have at your disposal. A simple awk script could work through your 2M records in a minute or two, and the subsequent copy back into the database would be quick as well. If you need to support real-time updates to this table, I would recommend a non-SQL solution: whatever your app is coded in could use regular expressions to parse the URL and insert the components into the component table. If you are limited to using the database, then an insert trigger could fulfill the same role, but it will be a more brittle approach.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Selecting only distinct rows based on a column in Knex - sql

Related

How do I update a column from a table with data from a another column from this same table?

Why is this Query not Updateable?

1 to Many Query: Help Filtering Results

Aggregate ENTIRE rows based on single field without querying source twice or using CTEs?

Database design - efficient text searching

Categories

Resources