I have a database full of messages from various chatbots. The chatbots all
follow decision tree format and ultimately are questions presented with choices
to which the user responds.
The bot may send a message (Hello would you like A or B?) which has options
attached, A and B for example. The user responds B. Both of these messages are
recorded and the previous message id attached.
id
message
options
previous_id
1
Hello would you like A or B?
A,B
2
A
1
The structure of these conversations is not fixed. There may be various forms
of message flow. The above is a simplistic example of how the messages are
chained together. For example
// text question on same message as options, with preceding unrelated messages
Hello -> My name is Ben the bot. -> How are you today? (good, bad) -> [good]
// text question not on same message as options
Pick your favourite colour -> {picture of blue and red} (blue, red) -> [blue]
// no question just option prompt - here precending text wasn't a question
[red] -> (ferrari, lamborghini) -> [ferrari]
-> denotes separation of messages
[] denotes reply to bot from user
() denotes options attached to messages
{} denotes attachments
What I am trying to get from this data is a row for every question with its
corresponding answer. The problem i'm facing is the (presumable) recursion i'd
have to use to retrieve the previous message each time until it met criteria
indicating it's gone back far enough for that particular answer in the chain of
messages.
In theory what I am trying to achieve is
Find all answers to questions
From those results look at the previous message
2a. If previous message has text and is not an answer itself then use said text and stop recursing
2b. Else move onto the next previous message until the criteria is met.
Return rows containing answer/response, with question and other columns from question row (id, timestamp for example)
This would leave me with lots of rows containing a message and a response
in the following dataset for example,
id
message
other
previous_id
1
Hello would you like A or B?
2
B
1
3
Hello would you like A or B?
4
A
3
5
Hello would you like A or B?
6
B
5
7
A is a great answer. C or D?
4
8
D
7
9
Green or red?
10
image
9
11
Red
10
I'd hope to end up with
id
message
response
1
Hello would you like A or B?
B
3
Hello would you like A or B?
A
5
Hello would you like A or B?
B
7
A is a great answer. C or D?
D
8
Green or red?
Red
I have made a (somewhat) simplified version of some sample data which is at the bottom of this question for reference/use.
It uses the following structure
WITH data ( id, message, node, options, previous, attachment) AS ()
Answers can be found with select where node is null so I assumed that is the
best starting point and I can work backwards towards the question. previous
and options are json columns because that's how they are in the real data so
I left them as they were.
I have tried various means by which to get the data as I wanted but I haven't managed the recursion/unknown number of levels bit.
For example, this attempt can dig two levels deep but I couldn't coalesce the
id of the message i found because obviously both have non null values.
select COALESCE(d2.message, d3.message) as question, d.message as answer
-- select COALESCE(d2.message, d2.attachment, d3.message, d3.attachment) as question, d.message as answer
from data as d
left join data as d2 on (d.previous->>'id')::int = d2.id
left join data as d3 on (d2.previous->>'id')::int = d3.id
where d.previous->>'node' in (
SELECT node from data where options is not null group by node
)
I believe this answer https://dba.stackexchange.com/a/215125/4660 may be the
path to what I need but I've thus far been unable to get it to run as I'd like.
I think this would allow me to replace the two left joins in my above example
with say a recursive union which i can use conditions on the on clause to stop
it at the right point. Hopefully this sounds like it might be along the right
lines and someone can point me in the right direction. Something like the below
perhaps?
WITH data (
id,
message,
node,
options,
previous,
attachment
) AS (
VALUES ...
), RecursiveTable as (
select * from data d where node is null # all answers?
union all
select * from RecursiveTable where ??
)
select * from RecursiveTable
--
Basic sample dataset
WITH data (
id,
message,
node,
options,
previous,
attachment
) AS (
VALUES
-- QUESTION TYPE 1
-- pineapple questions
(1, 'Pineapple on pizza?', 'pineapple', '["Yes","No"]'::json, null::json, null),
(2, 'Pineapple on pizza?', 'pineapple', '["Yes","No"]'::json, null::json, null),
(3, 'Pineapple on pizza?', 'pineapple', '["Yes","No"]'::json, null::json, null),
(4, 'Pineapple on pizza?', 'pineapple', '["Yes","No"]'::json, null::json, null),
(5, 'Pineapple on pizza?', 'pineapple', '["Yes","No"]'::json, null::json, null),
-- pineapple answers
(6, 'No', null, null, '{"id": 1, "node": "pineapple"}'::json, null),
(7, 'Yes', null, null, '{"id": 2, "node": "pineapple"}'::json, null),
(8, 'No', null, null, '{"id": 3, "node": "pineapple"}'::json, null),
(9, 'Yes', null, null, '{"id": 4, "node": "pineapple"}'::json, null),
(10, 'No', null, null, '{"id": 5, "node": "pineapple"}'::json, null),
-- ----------------------------
-- QUESTION TYPE 2 - Previous message, then question with text + options followed by answer
--- previous messages to stuffed crust questions (we don't care about
--these but they're here to ensure we aren't accidentally getting them
--as the question in results)
(11, 'Hello', 'hello_pre_stuffed_crust', null, null::json, null),
(12, 'Hello', 'hello_pre_stuffed_crust', null, null::json, null),
(13, 'Hello', 'hello_pre_stuffed_crust', null, null::json, null),
-- stuffed crust questions
(14, 'Stuffed crust?', 'stuffed_crust', '["Crunchy crust","More cheese!"]'::json, '{"id": 11, "node": "hello_pre_stuffed_crust"}'::json, null),
(15, 'Stuffed crust?', 'stuffed_crust', '["Crunchy crust","More cheese!"]'::json, '{"id": 12, "node": "hello_pre_stuffed_crust"}'::json, null),
(16, 'Stuffed crust?', 'stuffed_crust', '["Crunchy crust","More cheese!"]'::json, '{"id": 13, "node": "hello_pre_stuffed_crust"}'::json, null),
-- stuffed crust answers
(17, 'More cheese!', null, null, '{"id": 14, "node": "stuffed_crust"}'::json, null),
(18, 'Crunchy crust', null, null, '{"id": 15, "node": "stuffed_crust"}'::json, null),
(19, 'Crunchy crust', null, null, '{"id": 16, "node": "stuffed_crust"}'::json, null),
-- ----------------------------
-- QUESTION TYPE 3
-- two part question, no text with options only image, should get text from previous
-- part 1
(20, 'What do you think of this pizza?', 'check_this_image', null, null::json, null),
(21, 'What do you think of this pizza?', 'check_this_image', null, null::json, null),
(22, 'What do you think of this pizza?', 'check_this_image', null, null::json, null),
-- part two
(23, null, 'image', '["Looks amazing!","Not my cup of tea"]'::json, '{"id": 20, "node": "check_this_image"}'::json, 'https://images.unsplash.com/photo-1544982503-9f984c14501a'),
(24, null, 'image', '["Looks amazing!","Not my cup of tea"]'::json, '{"id": 21, "node": "check_this_image"}'::json, 'https://images.unsplash.com/photo-1544982503-9f984c14501a'),
(25, null, 'image', '["Looks amazing!","Not my cup of tea"]'::json, '{"id": 22, "node": "check_this_image"}'::json, 'https://images.unsplash.com/photo-1544982503-9f984c14501a'),
-- two part answers
(26, 'Looks amazing!', null, null, '{"id": 23, "node": "image"}'::json, null),
(27, 'Not my cup of tea', null, null, '{"id": 24, "node": "image"}'::json, null),
(28, 'Looks amazing!', null, null, '{"id": 25, "node": "image"}'::json, null),
-- ----------------------------
-- QUESTION TYPE 4
-- no text, just options straight after responding to something else - options for text value would be options, or image
-- directly after question 3 was answered, previous message was user message - but we don't have text here - just an image and options
(29, null, 'which_brand', '["Dominos","Papa Johns"]'::json, '{"id": 27}'::json, 'https://peakstudentmediadotcom.files.wordpress.com/2018/11/vs.jpg'),
(30, null, 'which_brand', '["Dominos","Papa Johns"]'::json, '{"id": 28}'::json, 'https://peakstudentmediadotcom.files.wordpress.com/2018/11/vs.jpg'),
(31, null, 'which_brand', '["Dominos","Papa Johns"]'::json, '{"id": 29}'::json, 'https://peakstudentmediadotcom.files.wordpress.com/2018/11/vs.jpg')
)
SELECT * from data
You can use WIT HRECURSIVE to achieve your goal. You just need to specify when to stop the recursion and find a way to select only those records, where the recursion did not produce any additional rows for.
Have a look here:
WITH RECURSIVE comp (
id, message, node, options, previous, attachment,
id2, message2, node2, options2, previous2, attachment2,
rec_depth
) AS (
SELECT
t.id, t.message, t.node, t.options, t.previous, t.attachment,
null::integer AS id2, null::text AS message2, null::text AS node2, null::json AS options2, null::json AS previous2, null::text AS attachment2,
0
FROM data t
WHERE t.node IS NULL
UNION ALL
SELECT
c.id, c.message, c.node, c.options, c.previous, c.attachment,
prev.id, prev.message, prev.node, prev.options, prev.previous, prev.attachment,
c.rec_depth + 1
FROM comp c
INNER JOIN data prev ON prev.id = ((COALESCE(c.previous2, c.previous))->>'id')::int
WHERE prev.node IS NOT NULL -- do not reach back to the next answer
AND c.message2 IS NULL -- do not reach back beyond a message with text (the question text)
), data (id, message, node, options, previous, attachment) AS (
VALUES [...]
) SELECT
c.id2 AS question_id, c.id AS answer_id
FROM comp c
WHERE
NOT EXISTS(
SELECT 1
FROM comp c2
WHERE c2.id = c.id
AND c2.rec_depth > c.rec_depth
)
comp holds before the recursion only the "answers" (this is the part above UNION ALL). Then, in the first recursion step, they are joined with the predecesors. In the second step, another new record is created per answer-predecessor pair, where the predecessor replaces itself with its predecessor. This is done, until the "base-condition" (the joined partner is a record with message aka question text or the next partner is a record without node aka an answer) is reached (this means until no new records get created).
As we also compute the recursion depth (rec_depth) of each row, we can finally check that we use only those records generated per answer with the maximal recursion depth.
The second WITH statement can and should of course be removed and you should reference your real table in the WITH RECURSIVE part.
I chose to only select the ids of the answer and the corresponding question, but the WITH RECURSIVE is already built in a way, that you can use all of the columns.
Further reading in the docs:
https://www.postgresql.org/docs/13/sql-select.html#SQL-WITH
https://www.postgresql.org/docs/13/queries-with.html
I'm using SQL SERVER 2014 and I have this query which needs to be rebuilt to be more efficient in what it is trying to accomplish.
As an example, I created this schema and added data to it so we could replicate the problem. You can try it at rextester (http://rextester.com/AIYG36293)
create table Dogs
(
Name nvarchar(20),
Owner_ID int,
Shelter_ID int
);
insert into Dogs values
('alpha', 1, 1),
('beta', 2, 1),
('charlie', 3, 1),
('beta', 1, 2),
('alpha', 2, 2),
('charlie', 3, 2),
('charlie', 1, 3),
('beta', 2, 3),
('alpha', 3, 3);
I want to find out which Shelter has these set of owner and dog name combinations and it must be exact. This is the query I'm using right now (this is more or less what query Entity Framework generated but with some slight changes to make it simpler):
SELECT DISTINCT
Shelter_ID
FROM Dogs AS [Extent1]
WHERE ( EXISTS (SELECT
1 AS [C1]
FROM [Dogs] AS [Extent2]
WHERE [Extent1].[Shelter_ID] = [Extent2].[Shelter_ID] AND [Extent2].[Name] = 'charlie' AND [Extent2].[Owner_ID] = 1
)) AND ( EXISTS (SELECT
1 AS [C1]
FROM [dbo].[Dogs] AS [Extent3]
WHERE [Extent1].[Shelter_ID] = [Extent3].[Shelter_ID] AND [Extent3].[Name] = 'beta' AND [Extent3].[Owner_ID] = 2
)) AND ( EXISTS (SELECT
1 AS [C1]
FROM [dbo].[Dogs] AS [Extent4]
WHERE [Extent1].[Shelter_ID] = [Extent4].[Shelter_ID] AND [Extent4].[Name] = 'alpha' AND [Extent4].[Owner_ID] = 3
))
This query is able to get what I need but I want to know if there is any simpler way of querying it. Because in my actual use case, I have more than just 3 combinations to worry about, it could get up to some crazy combinations like 1000 or more. So just imagine having 1000 subqueries in there so, well, yeah you get the point. When I try querying with that many I get an error saying:
The query processor ran out of internal resources and could not
produce a query plan. This is a rare event and only expected for
extremely complex queries or queries that reference a very large
number of tables or partitions.
NOTE
One solution I tried was using a Pivot to flatten the data and although the query becomes simpler since it would then be just a simple WHERE clause with a number of AND statements but when at some point I get to a higher number number of combinations then I exceed the limit for the allowable max row size and get this error when creating my temporary table to store the flatten data:
Cannot create a row of size 10514 which is greater than the allowable
maximum row size of 8060.
I appreciate any help or thoughts on this matter.
Thanks!
Count them.
WITH dogSet AS (
SELECT *
FROM (
VALUES ('charlie',1),('beta',2),('alpha',3)
) ts(Name,Owner_ID)
)
SELECT Shelter_ID
FROM Dogs AS [Extent1]
JOIN dogSet ts ON ts.Name= [Extent1].name and ts.Owner_ID = [Extent1].Owner_ID
GROUP BY Shelter_ID
HAVING count(*) = (SELECT count(*) n FROM dogSet)
I am running PostgreSQL 9.1.9 x64 with PostGIS 2.0.3 under Windows Server 2008 R2.
I have a table:
CREATE TABLE field_data.trench_samples (
pgid SERIAL NOT NULL,
trench_id TEXT,
sample_id TEXT,
from_m INTEGER
);
With some data in it:
INSERT INTO field_data.trench_samples (
trench_id, sample_id, from_m
)
VALUES
('TR01', '1000001', 0),
('TR01', '1000002', 5),
('TR01', '1000003', 10),
('TR01', '1000004', 15),
('TR02', '1000005', 0),
('TR02', '1000006', 3),
('TR02', '1000007', 9),
('TR02', '1000008', 14);
Now, what I am interested in is finding the difference (distance in metres in this example) between a record's "from_m" and the "next" "from_m" for that trench_id.
So, based on the data above, I'd like to end up with a query that produces the following table:
pgid, trench_id, sample_id, from_m, to_m, interval
1, 'TR01', '1000001', 0, 5, 5
2, 'TR01', '1000002', 5, 10, 5
3, 'TR01', '1000003', 10, 15, 5
4, 'TR01', '1000004', 15, 20, 5
5, 'TR02', '1000005', 0, 3, 3
6, 'TR02', '1000006', 3, 9, 6
7, 'TR02', '1000007', 9, 14, 5
8, 'TR02', '1000008', 14, 19, 5
Now, you are likely saying "wait, how do we infer an interval length for the last sample in each line, since there is no "next" from_m to compare to?"
For the "ends" of lines (sample_id 1000004 and 1000008) I would like to use the identical interval length of the previous two samples.
Of course, I have no idea how to tackle this in my current environment. Your help is very much appreciated.
Here is how you get the difference, using the one previous example at the end (as shown in the data but not explained clearly in the text).
The logic here is repeated application of lead() and lag(). First apply lead() to calculate the interval. Then apply lag() to calculate the interval at the boundary, by using the previous interval.
The rest is basically just arithmetic:
select trench_id, sample_id, from_m,
coalesce(to_m,
from_m + lag(interval) over (partition by trench_id order by sample_id)
) as to_m,
coalesce(interval, lag(interval) over (partition by trench_id order by sample_id))
from (select t.*,
lead(from_m) over (partition by trench_id order by sample_id) as to_m,
(lead(from_m) over (partition by trench_id order by sample_id) -
from_m
) as interval
from field_data.trench_samples t
) t
Here is the SQLFiddle showing it working.
I am in a position where I want multiple counts from a single table based on different combination of conditions.
The table has 2 flags: A & B.
I want count for following criteria on same page:
A is true (Don't care about B)
A is false (Don't care about B)
A is true AND B is true
A is false AND B is true
A is true AND B is false
A is false AND B is false
B is true (Don't care about A)
B is false (Don't care about A)
I want all above count on same page. Which of following will a good approach for this:
Query for count on that table for each condition. [That is firing 8 queries every time user gives the command.]
Query for list of data from database and then count values for appropriate conditions on UI.
Which option should I choose? Do you know any other alternative for this?
Your table essentially looks like this (The ID column is redundant, but I expect you have other data in your actual table anyway.):
CREATE TABLE `stuff` (
`id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`a` TINYINT(3) UNSIGNED NOT NULL DEFAULT '0',
`b` TINYINT(3) UNSIGNED NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
)
Some sample data:
INSERT INTO `stuff` (`id`, `a`, `b`) VALUES (1, 0, 0);
INSERT INTO `stuff` (`id`, `a`, `b`) VALUES (2, 0, 1);
INSERT INTO `stuff` (`id`, `a`, `b`) VALUES (3, 1, 0);
INSERT INTO `stuff` (`id`, `a`, `b`) VALUES (4, 1, 1);
This query (in mysql, I'm not sure about other DBMS) should produce the results you want.
select
count(if (a = 1, 1, NULL)) as one,
count(if (a = 0, 1, NULL)) as two,
count(if (a = 1 && b = 1, 1, NULL)) as three,
count(if (a = 0 && b = 1, 1, NULL)) as four,
count(if (a = 1 && b = 0, 1, NULL)) as five,
count(if (a = 0 && b = 0, 1, NULL)) as six,
count(if (b = 1, 1, NULL)) as seven,
count(if (b = 0, 1, NULL)) as eight
from stuff
group by null
With the sample, simple data above, the query generates:
one, two, three, four, five, six, seven, eight
2 , 2 , 1, 1, 1, 1, 2, 2
Notes:
group by null
This just causes every row ro be in the group.
count(...)
This function counts all the NON null values in the group, which is why we use the if(...) to return null if the condition is not met.
Create a query that already does the counting. At least with SQL this is not hard.
In my opinion 2nd option is better as you are querying only once. Firing 8 Queries to DB might later impact on performance.
Databases are designed to give you the data you want. In almost all cases, asking for what you want, is quicker than asking for everything and calculate or filter yourself. I'd say, you should blindly go for option 1 (ask what you need) and if it really does not work consider option 2 (or something else).
If every flag is true or false (no null values.) You don't need 8 queries, 4 would be enough.
Get the total
A true (don't care about B)
B true (don't care about A)
A and B true
'A true and B false' is second minus fourth, (A true) - (A and B true). And 'A and B false' = total - A true - B true + A and B true. Look for Inclusion exclusion principle for more information.
MySQL provides a string function named FIELD() which accepts a variable number of arguments. The return value is the location of the first argument in the list of the remaining ones. In other words:
FIELD('d', 'a', 'b', 'c', 'd', 'e', 'f')
would return 4 since 'd' is the fourth argument following the first.
This function provides the capability to sort a query's results based on a very specific ordering. For my current application there are four statuses that I need to manager: active, approved, rejected, and submitted. However, if I simply order by the status column, I feel the usability of the resulting list is lessened since rejected and active status items are more important than submitted and approved ones.
In MySQL I could do this:
SELECT <stuff> FROM <table> WHERE <conditions> ORDER BY FIELD(status, 'rejected', 'active','submitted', 'approved')
and the results would be ordered such that rejected items were first, followed by active ones, and so on. Thus, the results were ordered in decreasing levels of importance to the visitor.
I could create a separate table which enumerates this importance level for the statuses and then order the query by that in descending order, but this has come up for me a few times since switching to MS SQL Server so I thought I'd inquire as to whether or not I could avoid the extra table and the somewhat more complex queries using a built-in function similar to MySQL's FIELD().
Thank you,
David Kees
Use a CASE expression (SQL Server 2005+):
ORDER BY CASE status
WHEN 'active' THEN 1
WHEN 'approved' THEN 2
WHEN 'rejected' THEN 3
WHEN 'submitted' THEN 4
ELSE 5
END
You can use this syntax for more complex evaluation (including combinations, or if you need to use LIKE)
ORDER BY CASE
WHEN status LIKE 'active' THEN 1
WHEN status LIKE 'approved' THEN 2
WHEN status LIKE 'rejected' THEN 3
WHEN status LIKE 'submitted' THEN 4
ELSE 5
END
For your particular example your could:
ORDER BY CHARINDEX(
',' + status + ',',
',rejected,active,submitted,approved,'
)
Note that FIELD is supposed to return 0, 1, 2, 3, 4 where as the above will return 0, 1, 10, 17 and 27 so this trick is only useful inside the order by clause.
A set based approach would be to outer join with a table-valued-constructor:
LEFT JOIN (VALUES
('rejected', 1),
('active', 2),
('submitted', 3),
('approved', 4)
) AS lu(status, sort_order)
...
ORDER BY lu.sort_order
I recommend a CTE (SQL server 2005+).
No need to repeat the status codes or create the separate table.
WITH cte(status, RN) AS ( -- CTE to create ordered list and define where clause
SELECT 'active', 1
UNION SELECT 'approved', 2
UNION SELECT 'rejected', 3
UNION SELECT 'submitted', 4
)
SELECT <field1>, <field2>
FROM <table> tbl
INNER JOIN cte ON cte.status = tbl.status -- do the join
ORDER BY cte.RN -- use the ordering defined in the cte
Good luck,
Jason
ORDER BY CHARINDEX(','+convert(varchar,status)+',' ,
',rejected,active,submitted,approved,')
just put a comma before and after a string in which you are finding the substring index or you can say that second parameter.
and first parameter of charindex is also surrounded by ,