SQL DB2 : How to JOIN dynamicaly 2 tables using "running total calculations"

SQL DB2 : How to JOIN dynamicaly 2 tables using "running total calculations" - sql

I am learning SQL on a server: IBM V7R1M0, DB2.
I am trying to build a SQL report.
After seeking a similar example several days, I launch this bottle in the ocean of knowledge...
Context:
The stores request goods from the warehouse.
Those goods are pick on pallets.
Those pallets will be put on staging lane before to load them in a truck.
Rule1: We want only pallet(s) from one store on a staging lane (We don't want to mix the pallets from different stores)
Rule2: A store will occupy staging lanes which are nearby.
Rule3: Staging lanes are ordered by there ID (with gaps)
Table 1:
|-----|-----|-----------------|
| ID |store|pallet_estimation|
|-----|-----|-----------------|
| 1 | A | 35 |
| 2 | C | 2 |
| 3 | B | 30 |
|-----|-----|-----------------|
SELECT * FROM (
VALUES (1, 'A', 35), (2, 'C', 2), (3, 'B', 30)
) T1(ID, store, pallet_estimation)
Table 2 :
|---------------|---------------|
|ID_staging_lane|pallet_capacity|
|---------------|---------------|
| 201 | 10 |
| 202 | 10 |
| 204 | 30 |
| 205 | 40 |
| 208 | 30 |
| 210 | 30 |
|---------------|---------------|
SELECT * FROM(
VALUES (201, 10), (202, 10), (204, 30), (205, 40), (208, 30), (210, 30)
) T2(ID_staging_lane, pallet_capacity)
Expected result:
|-----------|--------|--------------------|---------------|------------------|
|T1_sequence|T1_store|T1_pallet_estimation|T2_staging_lane|T2_pallet_capacity|
|-----------|--------|--------------------|---------------|------------------|
| 1 | A | 35 | 201 | 10 |
| 1 | A | 35 | 202 | 10 |
| 1 | A | 35 | 204 | 30 |
| 2 | C | 2 | 205 | 40 |
| 3 | B | 30 | 208 | 30 |
|-----------|--------|--------------------|---------------|------------------|
Thanks you, Charles, for you time.
I'll try to improve my demand.
If needed, I want to split/divide the pallet_estimation on several staging lanes, following the sequence
Example:
For store A which has 35 pallets,
I want to use staging lane 201 then it remains 35 - 10 = 25 ,
then I want to use staging lane 202 then it remains 25 - 10 = 15,
then I want to use staging lane 204 then it remains 15 - 30 = -15
then I want to continue with the store C on the next staging lane 205 then it remains 2 - 40 = -38
then I want to continue with the store B on the next staging lane 208 then it remains 30 - 30 = 0
How would you start to build that ?
- with window function ? SUM() OVER()
- with recursive SQL ? DECLARE FETCH
- is it possible to build a dynamic JOIN in SQL ?
- other idea ?
Thanks in advance,
Renaud

First of all, v7r1 is very old...10 years to be exact...
Secondly, I don't understand what you're trying to join on...I see nothing that would explain why store A ended up with 3 rows in your results.
Thirdly, there's no such thing as a "dynamic join", in any RDBMS. You can have a dyanmic statement, which could include a join. Or you can have a static statement, which also could include a join. For Db2 on the IBM i, it only matters if your incorporating the statement in an RPG/COLBOL program or an SQL stored procedure/function.
Now having said all that, let me introduce you to Common Table Expressions (CTE). Basicaly the same as a Nested Table Expression (NTE) but IMO easier to follow and CTEs also can have a performance benefit over NTE on the i.
with T1 as (
SELECT * FROM (
VALUES (1, 'A', 35), (2, 'C', 2), (3, 'B', 30)
) T1(ID, store, pallet_estimation)
), T2 as (
SELECT * FROM(
VALUES (201, 10), (202, 10), (204, 30), (205, 40), (208, 30), (210, 30)
) T2(ID_staging_lane, pallet_capacity)
), fitment as (
select T1.*, T2.*, row_number() OVER(partition by ID_STAGING_LANE) as rowNbr
from T1 join T2 on pallet_estimation <= pallet_capacity
)
select * from fitment where rowNbr = 1;
The with T1 as (<select statement>) is the common table expression; as is T2 and fitment. The with keyword is only used for the First CTE.
The fitment CTE joins T1 and T2 based upon which estimate fits in the lane description, assigning a row_number to each possibility. The final select takes the first fit for each lane.
The nice thing about CTE's is you can easily build them and see the results as you go along. At any point you can add select * from MYCTE and see what you have so far.
Note that as shown, a CTE can reference another CTE. (fitment reference both T1 and T2)
EDIT
The functions your need to use, to look forward or backwards in the result set are named LAG() and LEAD(). They are part of the OLAP functionality built into Db2 for i. Unfortunately for you, they were added at 7.3.
You will need to roll your own version using a user defined function (UDF) that makes use what's known as the scatchpad to save data between the calls to the function for each row.
I found an very old article Scribble on SQL's Scratchpad showing how to the scratch pad in RPG. You can also use it inside an SQL defined UDF.
Do a bit a googling to see if you can get started. If you run into issues, create a new question here. (or check out the Midrange mailing lists

Related

How can I query a table for transitive matches?

Giving the following tables
Units:
| id | singular | plural |
|----|----------|--------|
| 3 | onion | onions |
| 4 | bag | bags |
| 5 | gram | grams |
| 6 | ml | ml |
| 7 | mm | mm |
and
Conversions:
| id | convert_from | convert_to | factor |
|----|--------------|------------|--------|
| 3 | 4 | 3 | 5 |
| 4 | 3 | 5 | 125 |
How could I obtain all possible conversion factors from (for example) bag (unit 4)?
I would expect the answers to resemble the form
| convert_from | convert_to | factor |
|--------------|------------|--------|
| 4 | 3 | 5 |
| 4 | 5 | 625 |
Caveats:
There is no guarantee about which column of the conversions table (convert_from, convert_to) a unit might appear in.
Conversions that transit through units 5, 6, or 7 should be ignored.
That is to say,
1->2->4->5 is valid, 1->2->4->5->7 is not.
A SQL solution (or re-architecting of the database to facilitate a SQL solution) would be ideal, but a code solution that makes multiple SQL queries would also be appreciated.
There will be other units in the units table that should be ignored if they do not form part of the conversion graph (or if they form part of the branch through an invalid transition (5, 6, or 7)). This is a simplified view.
Illustrative example
Ignoring SQL and retrieving the data for a moment, here's what I'm trying to achieve:
I want to build a system where users can store household products. A product has a unit associated with it. The unit might be an SI unit, such as mm, ml, g.. or it might be a discrete unit such as onion, or can.
Units can have relationships amongst themselves, so for example 1 can -> 330 ml.
The complexity of my question comes from the fact that the conversions for a single unit might be spread across many products.
Considering the can example again, we can have a product called pepsi (crate of 24) with the unit being crate, and another product called pepsi (can) with the unit of can.
When the user creates the pepsi (can) product, they provide the following conversion:
1 can -> 330 ml
Later, the user creates the pepsi (crate of 24) product, and provides the following conversion:
1 crate -> 24 can
Finally, the user asks the question "how much pepsi do I have?"
I'd like to be able to answer:
25 cans
1.0417 crates
8250 ml.
However, I don't know how to convert crates to ml.
Here's another example in illustrated form:
Edit:
Changed mms and mls to mm and ml. Not sure what I was thinking...
Added diagram to help clarify what i'm looking for rather than the solution.

You can use a recursive CTE, assuming there are no cycles in the data.
I added an extra is_terminal column to identify the terminal units where you don't want to convert from anymore (5, 6, and 7). The query is:
with recursive
e (convert_from, convert_to, factor, is_terminal) as (
select id, id, 1, is_terminal from units where id = 4 -- bag
union all
select e.convert_from, c.convert_to, e.factor * c.factor, u.is_terminal
from e
join conversions c on c.convert_from = e.convert_to
join units u on u.id = c.convert_to
where not e.is_terminal
)
select * from e where convert_from <> convert_to
Result:
convert_from convert_to factor is_terminal
------------ ---------- ------ -----------
4 3 5 false
4 5 625 true
See running example at DB Fiddle . Here's the data script I used to test:
create table units (
id int,
is_terminal boolean
);
insert into units (id, is_terminal) values
(3, false), (4, false),
(5, true), (6, true), (7, true);
create table conversions (
id int,
convert_from int,
convert_to int,
factor int
);
insert into conversions (id, convert_from, convert_to, factor) values
(3, 4, 3, 5),
(4, 3, 5, 125);

Find spectators that have seen the same shows (match multiple rows for each)

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.
The model looks like this:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez#gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille#gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.
So the result I'm expecting should be something like this
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.

Note based on comments: Wanted to make clear that this answer may be of limited use as it was answered in the context of SQL-Server (tag was present at the time)
There is probably a better way to do it, but you could do it with the 'stuff 'function. The only drawback here is that, since your ids are ints, placing a comma between values will involve a work around (would need to be a string). Below is the method I can think of using a work around.
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
This will show you all other spectators that attended the same shows.

Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)
This is a special case of relational division. We have assembled an arsenal of basic techniques here:
How to filter SQL results in a has-many-through relation
It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.
Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.
A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.
1. Brute force
The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
#> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.
2. Smarter
Use a CTE to determine relevant shows, then only consider those
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
#> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.
Faster than 1. because only relevant shows are considered.
3. Real smart
To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).
This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.
SQL Fiddle demonstrating all three.

It sounds like you have one half of the total question--determining which id_shows a particular id_spectator attended.
What you want to ask yourself is how you can determine which id_spectators attended an id_show, given an id_show. Once you have that, combine the two answers to get the full result.

So the final answer I got, looks like this :
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
Which prints something like this:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
Which suits my needs, however if you have any improvements to offer, please share :) Thanks again everybody!

Access 2007 select first value of query results

I am running into a rather annoying thingy in Access (2007) and I am not sure if this is a feature or if I am asking for the impossible.
Although the actual database structure is more complex, my problem boils down to this:
I have a table with data about Units for specific years. This data comes from different sources and might overlap.
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 80 | 1 |
A | 2010 | 101 | 2 |
A | 2010 | 150 | 3 |
A | 2011 | 90 | 1 |
...
Now I would like the user to select certain sources, order them by priority and then extract one data value for each year.
For example, if the user selects source 1, 2 and 3 and orders them by (3, 1, 2), then I would like the following result:
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 150 | 3 |
A | 2011 | 90 | 1 |
I am able to order the initial table, based on a specific order. I do this with the following query
SELECT Unit, IYR, X1, Source
FROM TestTable
WHERE Source In (1,2,3)
ORDER BY Unit, IYR,
IIf(Source=3,1,IIf(Source=1,2,IIf(Source=2,3,4)))
This gives me the following intermediate result:
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 150 | 3 |
A | 2010 | 80 | 1 |
A | 2010 | 101 | 2 |
A | 2011 | 90 | 1 |
Next step is to only get the first value of each year. I was thinking to use the following query:
SELECT X.Unit, X.IYR, first(X.X1) as FirstX1
FROM (...) AS X
GROUP BY X.Unit, X.IYR
Where (…) is the above query.
Now Access goes bananas. Whatever order I give to the intermediate results, the result of this query is.
Unit | IYR | X1 |
--------------------
A | 2009 | 55 |
A | 2010 | 80 |
A | 2011 | 90 |
In other words, for year 2010 it shows the value of source 1 instead of 3. It seems that Access does not care about the ordering of the nested query when it applies the FIRST() function and sticks to the original ordering of the data.
Is this a feature of Access or is there a different way of achieving the desired results?
Ps: Next step would be to use a self join to add the source column to the results again, but I first need to resolve above problem.

Rather than use first it may be better to determine the MIN Priority and then join back e.g.
SELECT
t.UNIT,
t.IYR,
t.X1,
t.Source ,
t.PrioritySource
FROM
(SELECT
Unit,
IYR,
X1,
Source,
SWITCH ( [Source]=3, 1,
[Source]=1, 2,
[Source]=2, 3) as PrioritySource
FROM
TestTable
WHERE
Source In (1,2,3)
) as t
INNER JOIN
(SELECT
Unit,
IYR,
MIN(SWITCH ( [Source]=3, 1,
[Source]=1, 2,
[Source]=2, 3)) as PrioritySource
FROM
TestTable
WHERE
Source In (1,2,3)
GROUP BY
Unit,
IYR ) as MinPriortiy
ON t.Unit = MinPriortiy.Unit and
t.IYR = MinPriortiy.IYR and
t.PrioritySource = MinPriortiy.PrioritySource
which will produce this result (Note I include Source and priority source for demonstration purposes only)
UNIT | IYR | X1 | Source | PrioritySource
----------------------------------------------
A | 2009 | 55 | 1 | 2
A | 2010 | 150 | 3 | 1
A | 2011 | 90 | 1 | 2
Note the first subquery is to handle the fact that Access won't let you join on a Switch

Yes, FIRST() does use an arbitrary ordering. From the Access Help:
These functions return the value of a specified field in the first or
last record, respectively, of the result set returned by a query. If
the query does not include an ORDER BY clause, the values returned by
these functions will be arbitrary because records are usually returned
in no particular order.
I don't know whether FROM (...) AS X means you are using an ORDER BY inline (assuming that is actually possible) or if you are using a VIEW ('stored Query object') here but either way I assume the ORDER BY is being disregarded (because an ORDER BY should only apply to the final result).
The alternative is to use MIN() (or possibly MAX()).

This is the most concise way I have found to write such queries in Access that require pulling back all columns that correspond to the first row in a group of records that are ordered in a particular way.
First, I added a UniqueID to your table. In this case, it's just an AutoNumber field. You may already have a unique value in your table, in which case you can use that.
This will choose the row with a Source 3 first, then Source 1, then Source 2. If there is a tie, it picks the one with the higher X1 value. If there is a further tie, it is broken by the UniqueID value:
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.UniqueID=
(SELECT TOP 1 [UniqueID] FROM [TestTable]
WHERE t.IYR=IYR ORDER BY Choose([Source],2,3,1), X1 DESC, UniqueID)
This yields:
Unit IYR X1 Source UniqueID
A 2009 55 1 1
A 2010 150 3 4
A 2011 90 1 5
I recommend (1) you create an index on the IYR field -- this will dramatically increase your performance for this type of query, and (2) if you have a lot (>~100K) records, this isn't the best choice. I find it works quite well for tables in the 1-70K range. For larger datasets, I like to use my GroupIncrement function to partition each group (similar to SQL Server's ROW_NUMBER() OVER statement).
The Choose() function is a VBA function and may not be clear here. In your case, it sounds like there is some interactivity required. For that, you could create a second table called "Choices", like so:
Rank Choice
1 3
2 1
3 2
Then, you could substitute the following:
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.UniqueID=(SELECT TOP 1 [UniqueID] FROM
[TestTable] t2 INNER JOIN [Choices] c
ON t2.Source=c.Choice
WHERE t.IYR=t2.IYR ORDER BY c.[Rank], t2.X1 DESC, t2.UniqueID);
Indexing Source on TestTable and Choice on the Choices table may be helpful here, too, depending on the number of choices required.
Q:
Can you get this to work without the need for surrogate key? For
example what if the unique key is the composite of
{Unit,IYR,X1,Source}
A:
If you have a compound key, you can do it like this-- however I think that if you have a large dataset, it will totally kill the performance of the query. It may help to index all four columns, but I can't say for sure because I don't regularly use this method.
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.Unit & t.IYR & t.X1 & t.Source =
(SELECT TOP 1 Unit & IYR & X1 & Source FROM [TestTable]
WHERE t.IYR=IYR ORDER BY Choose([Source],2,3,1), X1 DESC, Unit, IYR)
In certain cases, you may have to coalesce some of the individual parts of the key as follows (though Access generally will coalesce values automatically):
t.Unit & CStr(t.IYR) & CStr(t.X1) & CStr(t.Source)
You could also use a query in your FROM statements instead of the actual table. The query itself would build a composite of the four fields used in the key, and then you'd use the new key name in the WHERE clause of the top SELECT statement, and in the SELECT TOP 1 [key] of the subquery.
In general, though, I will either: (a) create a new table with an AutoNumber field, (b) add an AutoNumber field, (c) add an integer and populate it with a unique number using VBA - this is useful when you get a MaxLocks error when trying to add an AutoNumber, or (d) use an already indexed unique key.

Selecting a record based on integer being in an array field

I have a database of houses. Within the houses mssql database record is a field called areaID. A house could be in multiple areas so an entry could be as follows in the database:
+---------+----------------------+-----------+-------------+-------+
| HouseID | AreaID | HouseType | Description | Title |
+---------+----------------------+-----------+-------------+-------+
| 21 | 17, 32, 53 | B | data | data |
+---------+----------------------+-----------+-------------+-------+
| 23 | 23, 73 | B | data | data |
+---------+----------------------+-----------+-------------+-------+
| 24 | 53, 12, 153, 72, 153 | B | data | data |
+---------+----------------------+-----------+-------------+-------+
| 23 | 23, 53 | B | data | data |
+---------+----------------------+-----------+-------------+-------+
If I open a page that called for houses only in area 53 how would I search for it. I know in MySQL you can use find_in_SET but I am using Microsoft SQL Server 2005.

If your formatting is EXACTLY
N1, N2 (e.g.) one comma and space between each N
Then use this WHERE clause
WHERE ', ' + AreaID + ',' LIKE '%, 53,%'
The addition of the prefix and suffix makes every number, anywhere in the list, consistently wrapped by comma-space and suffixed by comma. Otherwise, you may get false positives with 53 appearing in part of another number.
Note
A LIKE expression will be anything but fast, since it will always scan the entire table.
You should consider normalizing the data into two tables:
Tables become
House
+---------+----------------------+----------+
| HouseID | HouseType | Description | Title |
+---------+----------------------+----------+
| 21 | B | data | data |
| 23 | B | data | data |
| 24 | B | data | data |
| 23 | B | data | data |
+---------+----------------------+----------+
HouseArea
+---------+-------
| HouseID | AreaID
+---------+-------
| 21 | 17
| 21 | 32
| 21 | 53
| 23 | 23
| 23 | 73
..etc
Then you can use
select * from house h
where exists (
select *
from housearea a
where h.houseid=a.houseid and a.areaid=53)

2 options, change the id's of AreaId so that you can use the & operator OR create a table that links the House and Area's....

What datatype is AreaID?
If it's a text field you could something like
WHERE (
AreaID LIKE '53,%' -- Covers: multi number seq w/ 53 at beginning
OR AreaID LIKE '% 53,%' -- Covers: multi number seq w/ 53 in middle
OR AreaID LIKE '% 53' -- Covers: multi number seq w/ 53 at end
OR AreaID = '53' -- Covers: single number seq w/ only 53
)
Note: I haven't used SQL-Server in some time, so I'm not sure about the operators. PostgreSQL has a regex function, which would be better at condensing that WHERE statement. Also, I'm not sure if the above example would include numbers like 253 or 531; it shouldn't but you still need to verify.
Furthermore, there are a bunch of functions that iterate through arrays, so storing it as an array vs text might be better. Finally, this might be a good example to use a stored procedure, so you can call your homebrewed function instead of cluttering your SQL.

Use a Split function to convert comma-separated values into rows.
CREATE TABLE Areas (AreaID int PRIMARY KEY);
CREATE TABLE Houses (HouseID int PRIMARY KEY, AreaIDList varchar(max));
GO
INSERT INTO Areas VALUES (84);
INSERT INTO Areas VALUES (24);
INSERT INTO Areas VALUES (66);
INSERT INTO Houses VALUES (1, '84,24,66');
INSERT INTO Houses VALUES (2, '24');
GO
CREATE FUNCTION dbo.Split (#values varchar(512)) RETURNS table
AS
RETURN
WITH Items (Num, Start, [Stop]) AS (
SELECT 1, 1, CHARINDEX(',', #values)
UNION ALL
SELECT Num + 1, [Stop] + 1, CHARINDEX(',', #values, [Stop] + 1)
FROM Items
WHERE [Stop] > 0
)
SELECT Num, SUBSTRING(#values, Start,
CASE WHEN [Stop] > 0 THEN [Stop] - Start ELSE LEN(#values) END) Value
FROM Items;
GO
CREATE VIEW dbo.HouseAreas
AS
SELECT h.HouseID, s.Num HouseAreaNum,
CASE WHEN s.Value NOT LIKE '%[^0-9]%'
THEN CAST(s.Value AS int)
END AreaID
FROM Houses h
CROSS APPLY dbo.Split(h.AreaIDList) s
GO
SELECT DISTINCT h.HouseID, ha.AreaID
FROM Houses h
INNER JOIN HouseAreas ha ON ha.HouseID = h.HouseID
WHERE ha.AreaID = 24

Interview question: SQL recursion in one statement

Someone I know went to an interview and was given the following problem to solve. I've thought about it for a few hours and believe that it's not possible to do without using some database-specific extensions or features from recent standards that don't have wide support yet.
I don't remember the story behind what is being represented, but it's not relevant. In simple terms, you're trying to represent chains of unique numbers:
chain 1: 1 -> 2 -> 3
chain 2: 42 -> 78
chain 3: 4
chain 4: 7 -> 8 -> 9
...
This information is already stored for you in the following table structure:
id | parent
---+-------
1 | NULL
2 | 1
3 | 2
42 | NULL
78 | 42
4 | NULL
7 | NULL
8 | 7
9 | 8
There could be millions of such chains and each chain can have an unlimited number of entries. The goal is to create a second table that would contain the exact same information, but with a third column that contains the starting point of the chain:
id | parent | start
---+--------+------
1 | NULL | 1
2 | 1 | 1
3 | 2 | 1
42 | NULL | 42
78 | 42 | 42
4 | NULL | 4
7 | NULL | 7
8 | 7 | 7
9 | 8 | 7
The claim (made by the interviewers) is that this can be achieved with just 2 SQL queries. The hint they provide is to first populate the destination table (I'll call it dst) with the root elements, like so:
INSERT INTO dst SELECT id, parent, id FROM src WHERE parent IS NULL
This will give you the following content:
id | parent | start
---+--------+------
1 | NULL | 1
42 | NULL | 42
4 | NULL | 4
7 | NULL | 7
They say that you can now execute just one more query to get to the goal shown above.
In my opinion, you can do one of two things. Use recursion in the source table to get to the front of each chain, or continuously execute some version of SELECT start FROM dst WHERE dst.id = src.parent after each update to dst (i.e. can't cache the results).
I don't think either of these situations is supported by common databases like MySQL, PostgreSQL, SQLite, etc. I do know that in PostgreSQL 8.4 you can achieve recursion using WITH RECURSIVE query, and in Oracle you have START WITH and CONNECT BY clauses. The point is that these things are specific to database type and version.
Is there any way to achieve the desired result using regular SQL92 in just one query? The best I could do is fill-in the start column for the first child with the following (can also use a LEFT JOIN to achieve the same result):
INSERT INTO dst
SELECT s.id, s.parent,
(SELECT start FROM dst AS d WHERE d.id = s.parent) AS start
FROM src AS s
WHERE s.parent IS NOT NULL
If there was some way to re-execute the inner select statement after each insert into dst, then the problem would be solved.

It can not be implemented in any static SQL that follows ANSI SQL 92.
But as you said it can be easy implemented with oracle's CONNECT BY
SELECT id,
parent,
CONNECT_BY_ROOT id
FROM table
START WITH parent IS NULL
CONNECT BY PRIOR id = parent

In SQL Server you would use a Common Table Expression (CTE).
To replicate the stored data I've created a temporary table
-- Create a temporary table
CREATE TABLE #SourceData
(
ID INT
, Parent INT
)
-- Setup data (ID, Parent, KeyField)
INSERT INTO #SourceData VALUES (1, NULL);
INSERT INTO #SourceData VALUES (2, 1);
INSERT INTO #SourceData VALUES (3, 2);
INSERT INTO #SourceData VALUES (42, NULL);
INSERT INTO #SourceData VALUES (78, 42);
INSERT INTO #SourceData VALUES (4, NULL);
INSERT INTO #SourceData VALUES (7, NULL);
INSERT INTO #SourceData VALUES (8, 7);
INSERT INTO #SourceData VALUES (9, 8);
Then I create the CTE to compile the data result:
-- Perform CTE
WITH RecursiveData (ID, Parent, Start) AS
(
-- Base query
SELECT ID, Parent, ID AS Start
FROM #SourceData
WHERE Parent IS NULL
UNION ALL
-- Recursive query
SELECT s.ID, s.Parent, rd.Start
FROM #SourceData AS s
INNER JOIN RecursiveData AS rd ON s.Parent = rd.ID
)
SELECT * FROM RecursiveData WHERE Parent IS NULL
Which will output the following:
id | parent | start
---+--------+------
1 | NULL | 1
42 | NULL | 42
4 | NULL | 4
7 | NULL | 7
Then I clean up :)
-- Clean up
DROP TABLE #SourceData

There is no recursive query support in ANSI-92, because it was added in ANSI-99. Oracle has had it's own recursive query syntax (CONNECT BY) since v2. While Oracle supported the WITH clause since 9i, SQL Server is the first I knew of to support the recursive WITH/CTE syntax -- Oracle didn't start until 11gR2. PostgreSQL added support in 8.4+. MySQL has had a request in for WITH support since 2006, and I highly doubt you'll see it in SQLite.
The example you gave is only two levels deep, so you could use:
INSERT INTO dst
SELECT a.id,
a.parent,
COALESCE(c.id, b.id) AS start
FROM SRC a
LEFT JOIN SRC b ON b.id = a.parent
LEFT JOIN SRC c ON c.id = b.parent
WHERE a.parent IS NOT NULL
You'd have to add a LEFT JOIN for the number of levels deep, and add them in proper sequence to the COALESCE function.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL DB2 : How to JOIN dynamicaly 2 tables using "running total calculations" - sql

Related

How can I query a table for transitive matches?

Find spectators that have seen the same shows (match multiple rows for each)

Access 2007 select first value of query results

Selecting a record based on integer being in an array field

Interview question: SQL recursion in one statement

Categories

Resources