Extracting specific strings from text field - sql

I'm working in a SQL Server database. I have a table with a 5 character alphanumeric field which will always be 5 characters. It will always be 5 characters and there will never be special characters. This table has roughly 100K rows.
I have another table with a string field that may or may not contains these characters. This table currently has roughly 2500 possible formats. But those can be both added to and modified. Unfortunately, I don't have access to the data used to determine what should be in the field.
Table1.Model
A1234
B1234
A6485
16849
A4665
99999
Table2.StringField
I have purchased model number A1234 after returning B6485
I have purchased model number 16849 after we thought about 99999
I have purchased model number B1234 before also looking at A1234
I returned A4665 and never purchased anything else
I have no money and don’t buy anything
I am looking to scrape the model numbers from these. I am currently using a case statement which accounts for basically 20 of the possible formats. I add on to the case statement as I find new scenarios that might appear in my data.
pseudo code:
Case when stringfield like 'I have purchased model number%return%'
Then substring(stringfield,30,5) as Model1 and substring(stringfield,52,5) as Model2
When stringfield like 'I have purchased model number%'
Then substring(stringfield,30,5) as Model1 and substring(stringfield,59,5) as Model2
When stringfield like 'I returned%'
Then substring(stringfield,11,5) as Model1 and 'N/A' as Model2
Else 'N/A' as Model1 and 'N/A' as Model2
END
Expected results:
I have purchased model number A1234 after returning B6485
Model1 = A1234 Model2=B6485
I have purchased model number 16849 after we thought about 99999
Model1 = 16849 Model2=99999
I have purchased model number B1234 before also looking at A1234
Model1 = B1234 Model2=A1234
I returned A4665 and never purchased anything else
Model1 = A4665 Model2=N/A
I have no money and don’t buy anything
Model1 = N/A Model2=N/A
I am putting the various scenarios into a reference table so that I can just update that as needed.
Is there a better way to do this? It's not a huge deal to just keep an eye on things and make updates as necessary. But it's just one more item on my list of things that needs to be maintained.
Thanks in advance.
One thing that I forgot to mention is that there is sometimes another substring of the field that is like A14351835410571982 - and I don't want anything from that string.
The things that I've thought about trying are:
Crossjoin from Table1 to itself and then saying
If stringfield like '%value1%value2%' then value1 and value2.
But that is 100k x 100k combinations which seems prohibitively large.
Searching stringfield for anything that's 5 characters long followed by a space or a period or a comma that's either all numbers or a single letter and 4 numbers and then somehow getting the first string and the second string in that order.
A combination of the first two: Identify all 5 character strings in all records then crossjoin them and match with wildcards. This would probably be about 20k values instead of 100k
Continuing down the path that I'm currently on and just do it with brute force
** Note: I am a report analyst, not a developer, so I know enough SQL to be dangerous. I can typically follow along with up to mid-complexity SQL but might need help with anything above that.

Here an example of how a combination of string_split and cross apply can get the models from the strings.
create table Models (
code char(5) primary key,
name varchar(30) not null
);
insert into Models (code, name) values
('A1234','Model A4')
, ('B6485','Model B5')
, ('16849','Model 49')
, ('99999','Model Five9')
, ('B1234','Model B4')
, ('A4665','Model A5')
;
create table Comments (
id int identity(1,1) primary key,
comment varchar(max)
);
insert into Comments (comment) values
('I have purchased model number A1234 after returning B6485')
, ('I have purchased model number 16849 after we thought about 99999')
, ('I have purchased model number B1234 before also looking at A1234')
, ('I returned A4665 and never purchased anything')
, ('I have no money and don’t buy anything')
;
create table #tmpCommentCodes (
id int identity(1,1) primary key,
comment_id int,
model_code varchar(max)
);
insert into #tmpCommentCodes (comment_id, model_code)
select c.id, ca.code
from Comments c
cross apply (
select value as code
from string_split(c.comment, ' ') spl
where value COLLATE Latin1_General_BIN like '[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]'
) ca;
select tmp.*, model.name
from #tmpCommentCodes as tmp
left join Models as model
on model.code = tmp.model_code
id | comment_id | model_code | name
-: | ---------: | :--------- | :----------
1 | 1 | A1234 | Model A4
2 | 1 | B6485 | Model B5
3 | 2 | 16849 | Model 49
4 | 2 | 99999 | Model Five9
5 | 3 | B1234 | Model B4
6 | 3 | A1234 | Model A4
7 | 4 | A4665 | Model A5
Then the temporary table can be used to replace the codes.
For example:
WITH RCTE_COMMENTS AS
(
SELECT TOP 1 WITH TIES
c.id AS comment_id
, 0 AS lvl
, tmp.id AS tmpId
, REPLACE(c.comment, CONCAT(' ', tmp.model_code), CONCAT(' ', model.name)) AS comment
FROM Comments AS c
LEFT JOIN #tmpCommentCodes AS tmp
ON tmp.comment_id = c.id
LEFT JOIN Models AS model
ON model.code = tmp.model_code
ORDER BY ROW_NUMBER() OVER (PARTITION BY c.id ORDER BY tmp.id)
UNION ALL
SELECT c.comment_id
, c.lvl+1
, tmp.id
, REPLACE(c.comment, ' '+tmp.model_code, ' '+model.name)
FROM RCTE_COMMENTS AS c
JOIN #tmpCommentCodes AS tmp
ON tmp.comment_id = c.comment_id
AND tmp.id = c.tmpId + 1
JOIN Models AS model
ON model.code = tmp.model_code
)
SELECT TOP 1 WITH TIES comment_id
, REPLACE(comment, 'model number ', '') AS comment
FROM RCTE_COMMENTS
ORDER BY ROW_NUMBER() OVER (PARTITION BY comment_id ORDER BY tmpId DESC)
comment_id | comment
---------: | :-----------------------------------------------------------
1 | I have purchased Model A4 after returning Model B5
2 | I have purchased Model 49 after we thought about Model Five9
3 | I have purchased Model B4 before also looking at Model A4
4 | I returned Model A5 and never purchased anything
5 | I have no money and don’t buy anything
db<>fiddle here

Related

Case statement logic and substring

Say I have the following data:
Passes
ID | Pass_code
-----------------
100 | 2xBronze
101 | 1xGold
102 | 1xSilver
103 | 2xSteel
Passengers
ID | Passengers
-----------------
100 | 2
101 | 5
102 | 1
103 | 3
I want to count then create a ticket in the output of:
ID 100 | 2 pass (bronze)
ID 101 | 5 pass (because it is gold, we count all passengers)
ID 102 | 1 pass (silver)
ID 103 | 2 pass (steel)
I was thinking something like the code below however, I am unsure how to finish my case statement. I want to substring pass_code so that we get show pass numbers e.g '2xBronze' should give me 2. Then for ID 103, we have 2 passes and 3 customers so we should output 2.
Also, is there a way to firstly find '2xbronze' if the pass_code contained lots of other things such as '101001, 1xbronze, FirstClass' - this may change so i don't want to substring, could we search for '2xbronze' and then pull out the 2??
SELECT
CASE
WHEN Passes.pass_code like '%gold%' THEN Passengers.passengers
WHEN Passes.pass_code like '%steel%' THEN SUBSTRING(passes.pass_code, 1,1)
WHEN Passes.pass_code like '%bronze%' THEN SUBSTRING(passes.pass_code, 1,1)
WHEN Passes.pass_code like '%silver%' THEN SUBSTRING(passes.pass_code, 1,1)
else 0 end as no,
Passes.ID,
Passes.Pass_code,
Passengers.Passengers
FROM Passes
JOIN Passengers ON Passes.ID = Passengers.ID
https://dbfiddle.uk/?rdbms=oracle_18&fiddle=db698e8562546ae7658270e0ec26ca54
So assuming you are indeed using Oracle (as your DB fiddle implies).
You can do some string magic with finding position of a splitter character (in your case the x), then substringing based on that. Obviously this has it's problems, and x is a bad character seperator as well.. but based on your current set.
WITH PASSCODESPLIT AS
(
SELECT PASSES.ID,
TO_Number(SUBSTR(PASSES.PASS_CODE, 0, (INSTR(PASSES.PASS_CODE, 'x')) - 1)) AS NrOfPasses,
SUBSTR(PASSES.PASS_CODE, (INSTR(PASSES.PASS_CODE, 'x')) + 1) AS PassType
FROM Passes
)
SELECT
PASSCODESPLIT.ID,
CASE
WHEN PASSCODESPLIT.PassType = 'gold' THEN Passengers.Passengers
ELSE PASSCODESPLIT.NrOfPasses
END AS NrOfPasses,
PASSCODESPLIT.PassType,
Passengers.Passengers
FROM PASSCODESPLIT
INNER JOIN Passengers ON PASSCODESPLIT.ID = Passengers.ID
ORDER BY PASSCODESPLIT.ID ASC
Gives the result of:
ID NROFPASSES PASSTYPE PASSENGERS
100 2 bronze 2
101 5 gold 5
102 1 silver 1
103 2 steel 3
As can also be seen in this fiddle
But I would strongly advise you to fix your table design. Having multiple attributes in the same column leads to troubles like these. And the more variables/variations you start storing, the more 'magic' you need to keep doing.
In this particular example i see no reason why you don't simply have the 3 columns in Passes, also giving you the opportunity to add new columns going forward. I.e. to keep track of First class.
You can extract the numbers using regexp_substr(). So I think this does what you want:
SELECT (CASE WHEN p.pass_code LIKE '%gold%'
THEN TO_NUMBER(REGEXP_SUBSTR(p.pass_code, '^[0-9]+'))
ELSE pp.passengers
END) as num,
p.ID, p.Pass_code, pp.Passengers
FROM Passes p JOIN
Passengers pp
ON p.ID = pp.ID;
Here is a db<>fiddle.
This converts the leading digits in the code to a number. Also note the use of table aliases to simplify the query.

How to create two JOIN-tables so that I can compare attributes within?

I take a Database course in which we have listings of AirBnBs and need to be able to do some SQL queries in the Relationship-Model we made from the data, but I struggle with one in particular :
I have two tables that we are interested in, Billing and Amenities. The first one have the id and price of listings, the second have id and wifi (let's say, to simplify, that it equals 1 if there is Wifi, 0 otherwise). Both have other attributes that we don't really care about here.
So the query is, "What is the difference in the average price of listings with and without Wifi ?"
My idea was to build to JOIN-tables, one with listings that have wifi, the other without, and compare them easily :
SELECT avg(B.price - A.price) as averagePrice
FROM (
SELECT Billing.price, Billing.id
FROM Billing
INNER JOIN Amenities
ON Billing.id = Amenities.id
WHERE Amenities.wifi = 0
) A, (
SELECT Billing.price, Billing.id
FROM Billing
INNER JOIN Amenities
ON Billing.id = Amenities.id
WHERE Amenities.wifi = 1) B
WHERE A.id = B.id;
Obviously this doesn't work... I am pretty sure that there is a far easier solution to it tho, what do I miss ?
(And by the way, is there a way to compute the absolute between the difference of price ?)
I hope that I was clear enough, thank you for your time !
Edit : As mentionned in the comments, forgot to say that, but both tables have idas their primary key, so that there is one row per listing.
Just use conditional aggregation:
SELECT AVG(CASE WHEN a.wifi = 0 THEN b.price END) as avg_no_wifi,
AVG(CASE WHEN a.wifi = 1 THEN b.price END) as avg_wifi
FROM Billing b JOIN
Amenities a
ON b.id = a.id
WHERE a.wifi IN (0, 1);
You can use a - if you want the difference instead of the specific values.
Let's assume we're working with data like the following (problems with your data model are noted below):
Billing
+------------+---------+
| listing_id | price |
+------------+---------+
| 1 | 1500.00 |
| 2 | 1700.00 |
| 3 | 1800.00 |
| 4 | 1900.00 |
+------------+---------+
Amenities
+------------+------+
| listing_id | wifi |
+------------+------+
| 1 | 1 |
| 2 | 1 |
| 3 | 0 |
+------------+------+
Notice that I changed "id" to "listing_id" to make it clear what it was (using "id" as an attribute name is problematic anyways). Also, note that one listing doesn't have an entry in the Amenities table. Depending on your data, that may or may not be a concern (again, refer to the bottom for a discussion of your data model).
Based on this data, your averages should be as follows:
Listings with wifi average $1600 (Listings 1 and 2)
Listings without wifi (just 3) average 1800).
So the difference would be $200.
To achieve this result in SQL, it may be helpful to first get the average cost per amenity (whether wifi is offered). This would be obtained with the following query:
SELECT
Amenities.wifi AS has_wifi,
AVG(Billing.price) AS avg_cost
FROM Billing
INNER JOIN Amenities ON
Amenities.listing_id = Billing.listing_id
GROUP BY Amenities.wifi
which gives you the following results:
+----------+-----------------------+
| has_wifi | avg_cost |
+----------+-----------------------+
| 0 | 1800.0000000000000000 |
| 1 | 1600.0000000000000000 |
+----------+-----------------------+
So far so good. So now we need to calculate the difference between these 2 rows. There are a number of different ways to do this, but one is to use a CASE expression to make one of the values negative, and then simply take the SUM of the result (note that I'm using a CTE, but you can also use a sub-query):
WITH
avg_by_wifi(has_wifi, avg_cost) AS
(
SELECT Amenities.wifi, AVG(Billing.price)
FROM Billing
INNER JOIN Amenities ON
Amenities.listing_id = Billing.listing_id
GROUP BY Amenities.wifi
)
SELECT
ABS(SUM
(
CASE
WHEN has_wifi = 1 THEN avg_cost
ELSE -1 * avg_cost
END
))
FROM avg_by_wifi
which gives us the expected value of 200.
Now regarding your data model:
If both your Billing and Amenities table only have 1 row for each listing, it makes sense to combine them into 1 table. For example: Listings(listing_id, price, wifi)
However, this is still problematic, because you probably have a bunch of other amenities you want to model (pool, sauna, etc.) So you might want to model a many-to-many relationship between listings and amenities using an intermediate table:
Listings(listing_id, price)
Amenities(amenity_id, amenity_name)
ListingsAmenities(listing_id, amenity_id)
This way, you could list multiple amenities for a given listing without having to add additional columns. It also becomes easy to store additional information about an amenity: What's the wifi password? How deep is the pool? etc.
Of course, using this model makes your original query (difference in average cost of listings by wifi) a bit tricker, but definitely still doable.

Sql query for calculating room prices

Hi I have a problem i am working on for a while now , let say i have a view lets call it room_price looking like that :
room | people | price | hotel
1 | 1 | 200 | A
2 | 2 | 99 | A
3 | 3 | 95 | A
4 | 1 | 90 | B
5 | 6 | 300 | B
i am looking for the lowest price in given hotel for x amount of people
for 1 i would expect i will have :
hotel | price
A | 200
B | 90
for 2 i would have :
hotel | price
A | 99
it is because hotel B have no rooms that can exactly fit 2 persons. 6 can not be used for less (or more) than 6 people.
for hotel A price is 99 it is because i use room 2
for 6 result should be :
hotel | price
A | 394
B | 300
so for hotel A i take rooms 1,2,3 and for hotel B lowest price would be for one room 5 for 300
I did it with restriction that i will be able to fit people max in to 3 rooms and that is acceptable but my query is to slow :( it looks something like that :
select a.hotel,a.price+a1.price+a2.price
from room_price a, room_price a1, room_price a2
where
a.room<> a1.room
and a1.room<> a2.room
and a.room<> a2.room
and a.hotel = a1.hotel
and a.hotel = a2.hotel
after that i made a grup by hotel and took min(price) and it worked ... but executing 3 times query that gets me room_price and than Cartesian product of that took to much time. There are around 5000 elements in room_price and it is a rather complicated sql which generates this data (takes dates start end multiple prices, currency exchange...)
I can use sql, custom functions ... or anything that will make this work fast , but i would prefer to stay on database level without need to process this data in application (i am using java) as i will be extending this further on to add some additional data to the query.
I would be grateful for any help .
Query itself:
WITH RECURSIVE
setup as (
SELECT 3::INT4 as people
),
room_sets AS (
SELECT
n.hotel,
array[ n.room ] as rooms,
n.price,
n.people
FROM
setup s,
room_price n
WHERE
n.people <= s.people
UNION ALL
SELECT
rs.hotel,
rs.rooms || n.room,
rs.price + n.price as price,
rs.people + n.people as people
FROM
setup s,
room_sets rs
join room_price n using (hotel)
WHERE
n.room > rs.rooms[ array_upper( rs.rooms, 1 )]
AND rs.people + n.people <= s.people
),
results AS (
SELECT
DISTINCT ON (rs.hotel)
rs.*
FROM
room_sets rs,
setup s
WHERE
rs.people = s.people
ORDER BY
rs.hotel, rs.price
)
SELECT * FROM results;
Tested it on this dataset:
CREATE TABLE room_price (
room INT4 NOT NULL,
people INT4 NOT NULL,
price INT4 NOT NULL,
hotel TEXT NOT NULL,
PRIMARY KEY (hotel, room)
);
copy room_price FROM stdin WITH DELIMITER ',';
1,1,200,A
2,2,99,A
3,3,95,A
4,1,90,B
5,6,300,B
\.
Please note that it will become much slower when you'll add more rooms to your base.
Ah, to customize for how many people you want results - change the setup part.
Wrote detailed explanation on how it works.
It looks like your query as typed is incorrect with the FROM clause... it looks like aliases are out of whack
from room_price a, room_price,a1 room_price,room_price a2
and should be
from room_price a, room_price a1, room_price a2
That MIGHT be giving the query a false alias / extra table giving some sort of Cartesian product making it hang....
--- ok on the FROM clause...
Additionally, and just a thought... Since the "Room" appears to be an internal auto-increment ID column, it will never be duplicated, such as Room 100 in hotel A and Room 100 in hotel B. Your query to do <> on the room make sense so you are never comparing across the board on all 3 tables...
Why not force the a1 and a2 joins to only qualify for room GREATER than "a" room. Otherwise you'll be re-testing the same conditions over and over. From your example data, just on hotel A, you have room IDs of 1, 2 and 3. You are thus comparing
a a1 a2
1 2 3
1 3 2
2 1 3
2 3 1
3 1 2
3 2 1
Would it help to only compare where "a1" is always greater than "a" and "a2" is always greater than "a1" thus doing tests of
a a1 a2
1 2 3
would give the same results as all the rest, and thus bloat your result down to one record in this case... but then, how can you really compare against a location of only TWO room types "hotel B". You would NEVER get an answer since your qualification for rooms is
a <> a1 AND
a <> a2 AND
a1 <> a2
You may want to try cutting down to only a single self-join for a1, a2 and keep the compare only to the two, such as
select a1.hotel, a1.price + a2.price
from room_price a1, room_price a2
where a1.hotel = a2.hotel
and a2.room > a1.room
For hotel "A", you would thus have final result comparisons of
a1 a2
1 2
1 3
2 3
and for hotel "B"
a1 a2
4 5
The implementation of <> is a going to have a rather large impact when you start to look at larger data sets. Especially if the prior filtering doesn't drastically reduce its size. By using this you may potentially negate the possiblity of the direct query being optimised and implementing indexing but also the view may not implement indexing because SQL will attempt to run the filters for the query and the view against the tables in as few statements as possible (pending optimisations done by the engine).
I would ideally start with the view and confirm it's properly optimised. Just looking at the query itself this has a better chance of being optimised;
SELECT
a.hotel, a.price + a1.price + a2.price
FROM
room_price a,
room_price,
room_price a1,
room_price a2
WHERE
(a.room > a1.room OR a.room < a1.room) AND
(a1.room > a2.room OR a1.room < a2.room) AND
(a.room > a2.room OR a.room < a2.room) AND
a.hotel = a1.hotel AND
a.hotel = a2.hotel
It appears to return the same results, but I'm not sure how you implement this query in your overall solution. So consider just the nature of the changes to the existing query and what you have done already.
Hopefully that helps. If not you might need to consider what the view is doing and how it's working a view that returns results from a temp table or variable can't implement indexing either. In that case maybe generating an indexed temp table would be better for you.

Optimal solution for interview question

Recently in a job interview, I was given the following problem.
Say I have the following table
widget_Name | widget_Costs | In_Stock
---------------------------------------------------------
a | 15.00 | 1
b | 30.00 | 1
c | 20.00 | 1
d | 25.00 | 1
where widget_name is holds the name of the widget, widget_costs is the price of a widget, and in stock is a constant of 1.
Now for my business insurance I have a certain deductible. I am looking to find a sql statement that will tell me every widget and it's price exceeds the deductible. So if my dedudctible is $50.00 the above would just return
widget_Name | widget_Costs | In_Stock
---------------------------------------------------------
a | 15.00 | 1
d | 25.00 | 1
Since widgets b and c where used to meet the deductible
The closest I could get is the following
SELECT
*
FROM (
SELECT
widget_name,
widget_price
FROM interview.tbl_widgets
minus
SELECT widget_name,widget_price
FROM (
SELECT
widget_name,
widget_price,
50 - sum(widget_price) over (ORDER BY widget_price ROWS between unbounded preceding and current row) as running_total
FROM interview.tbl_widgets
)
where running_total >= 0
)
;
Which gives me
widget_Name | widget_Costs | In_Stock
---------------------------------------------------------
c | 20.00 | 1
d | 25.00 | 1
because it uses a and b to meet the majority of the deductible
I was hoping someone might be able to show me the correct answer
EDIT: I understood the interview question to be asking this. Given a table of widgets and their prices and given a dollar amount, substract as many of the widgets you can up to the dollar amount and return those widgets and their prices that remain
I'll put an answer up, just in case it's easier than it looks, but if the idea is just to return any widget that costs more than the deductible then you'd do something like this:
Select
Widget_Name, Widget_Cost, In_Stock
From
Widgets
Where
Widget_Cost > 50 -- SubSelect for variable deductibles?
For your sample data my query returns no rows.
I believe I understand your question, but I'm not 100%. Here is what I'm assuming you mean:
Your deductible is say, $50. To meet the deductible you have you "use" two items. (Is this always two? How high can it go? Can it be just one? What if they don't total exactly $50, there is a lot of missing information). You then want to return the widgets that aren't being used towards deductible. I have the following.
CREATE TABLE #test
(
widget_name char(1),
widget_cost money
)
INSERT INTO #test (widget_name, widget_cost)
SELECT 'a', 15.00 UNION ALL
SELECT 'b', 30.00 UNION ALL
SELECT 'c', 20.00 UNION ALL
SELECT 'd', 25.00
SELECT * FROM #test t1
WHERE t1.widget_name NOT IN (
SELECT t1.widget_name FROM #test t1
CROSS JOIN #test t2
WHERE t1.widget_cost + t2.widget_cost = 50 AND t1.widget_name != t2.widget_name)
Which returns
widget_name widget_cost
----------- ---------------------
a 15.00
d 25.00
This looks like a Bin Packing problem these are really hard to solve especially with SQL.
If you search on SO for Bin Packing + SQL, you'll find how to find Sum(field) in condition ie “select * from table where sum(field) < 150” Which is basically the same problem except you want to add a NOT IN to it.
I couldn't get the accepted answer by brianegge to work but what he wrote about it in general was interesting
..the problem you
describe of wanting the selection of
users which would most closely fit
into a given size, is a bin packing
problem. This is an NP-Hard problem,
and won't be easily solved with ANSI
SQL. However, the above seems to
return the right result, but in fact
it simply starts with the smallest
item, and continues to add items until
the bin is full.
A general, more effective bin packing
algorithm would is to start with the
largest item and continue to add
smaller ones as they fit. This
algorithm would select users 5 and 4.
So with this advice you could write a cursor to loop over the table to do just this (it just wouldn't be pretty).
Aaron Alton gives a nice link to a series of articles that attempts to solve the Bin Packing problem with sql but basically concludes that its probably best to use a cursor to do it.

Sql Ordering Hiarchy

I am working on a SQL Statement that I can't seem to figure out. I need to order the results alphabetically, however, I need "children" to come right after their "parent" in the order. Below is a simple example of the table and data I'm working with. All non relevant columns have been removed. I'm using SQL Server 2005. Is there an easy way to do this?
tblCats
=======
idCat | fldCatName | idParent
--------------------------------------
1 | Some Category | null
2 | A Category | null
3 | Top Category | null
4 | A Sub Cat | 1
5 | Sub Cat1 | 1
6 | Another Cat | 2
7 | Last Cat | 3
8 | Sub Sub Cat | 5
Results of Sql Statement:
A Category
Another Cat
Some Category
A Sub Cat1
Sub Cat 1
Sub Sub Cat
Top Category
Last Cat
(The prefixed spaces in the result are just to add in understanding of the results, I don't want the prefixed spaces in my sql result. The result only needs to be in this order.)
You can do it with a hierarchical query, as below.
It looks a lot more complicated than it is, due to the lack of a PAD funciton in t-sql. The seed of the hierarchy are the categories without parents. The fourth column we select is their ranking alphabetically (converted to a string and padded). Then we union this with their children. At each recursion, the children will all be at the same level, so we can get their ranking alphabetically without needing to partition. We can concatenate these rankings together down the tree, and order by that.
;WITH Hierarchy AS (
SELECT
idCat, fldCatName, idParent,
CAST(RIGHT('00000'+
CAST(ROW_NUMBER() OVER (ORDER BY fldCatName) AS varchar(8))
, 5)
AS varchar(256)) AS strPath
FROM Category
WHERE idParent IS NULL
UNION ALL
SELECT
c.idCat, c.fldCatName, c.idParent,
CAST(h.strPath +
CAST(RIGHT('00000'+
CAST(ROW_NUMBER() OVER (ORDER BY c.fldCatName) AS varchar(8))
, 5) AS varchar(16))
AS varchar(256))
FROM Hierarchy h
INNER JOIN Category c ON c.idParent = h.idCat
)
SELECT idCat, fldCatName, idParent, strPath
FROM Hierarchy
ORDER BY strPath
With your data:
idCat fldCatName idParent strPath
------------------------------------------------
2 A Category NULL 00001
6 Another Category 2 0000100001
1 Some Category NULL 00002
4 A Sub Category 1 0000200001
5 Sub Cat1 1 0000200002
8 Sub Sub Category 5 000020000200001
3 Top Category NULL 00003
7 Last Category 3 0000300001
It can be done in CTE... Is this what you're after ?
With MyCats (CatName, CatId, CatLevel, SortValue)
As
( Select fldCatName CatName, idCat CatId,
0 Level, Cast(fldCatName As varChar(200)) SortValue
From tblCats
Where idParent Is Null
Union All
Select c.fldCatName CatName, c.idCat CatID,
CatLevel + 1 CatLevel,
Cast(SortValue + '\' + fldCatName as varChar(200)) SortValue
From tblCats c Join MyCats p
On p.idCat = c.idParent)
Select CatName, CatId, CatLevel, SortValue
From MyCats
Order By SortValue
EDIT: (thx to Pauls' comment below)
If 200 characters is not enough to hold the longest concatenated string "path", then change the value to as high as is needed... you can make it as high as 8000
I'm not aware of any SQL Server (or Ansi-SQL) inherent support for this.
I don't supposed you'd consider a temp table and recursive stored procedure an "easy" way ? J
Paul's answer is excellent, but I thought I would throw in another idea for you. Joe Celko has a solution for this in his SQL for Smarties book (chapter 29). It involves maintaining a separate table containing the hierarchy info. Inserts, updates, and deletes are a little complicated, but selects are very fast.
Sorry I don't have a link or any code to post, but if you have access to this book, you may find this helpful.