MS SQL Server: Logic to find out groups of all dependent records from table having source and destinations items - sql

I am using MS SQL Server DB. I have a specific need to find out groups of interdependent items. Visualize below a scenario where we have two items in each row, one is source item and other is destination item. Any item can be source of any item, and same for destination as well. We have two column in the table, 'Source' and 'Destination'. Let's consider 10 values in the table as below:
Source | Destination
A | B
B | C
C | D
E | A
D | E
X | Y
Y | Z
Z | X
P | Q
R | S
My requirement is to get distinct groups of items with Source and destination. Meaning, my query should return below result with 4 rows (grouped items in comma separated form):
RowNum| Result
1| A,B,C,D,E
2| X,Y,Z
3| P,Q
4| R,S
Here, the level of hierarchy can be upto n number. In my example, I kept the first group of 5 items (A to B, B to C, C to D, D to E and E to A - Means 5 different items are involved in this group). But the data may have more items as well, in single group. Also, cyclical records are possible (X to , Y to Z and Z to X)
I can achieve this using nested WHILE Loops. But, as we have thousands of records, the nested WHILE Loop scripts takes too much time to execute. I am iterating one loop on each record of the table and then there is an inner loop which will take outer loop's record and will compare it with all other records.
Can anybody suggest a better way or algorithm to achieve this?
Any help on this would be appreciated.

Related

PostgreSQL data transformation - Turn rows into columns

I have a table whose structure looks like the following:
k | i | p | v
Notice that the key (k) is not unique, there are no keys, nothing. Each key can have multiple attributes (i = 0, 1, 2, ...) which can be of different types (p) and have different values (v). One attribute type may also appear multiple times (p(i-1) = p(i)).
What I want to do is pick certain attribute types and their corresponding values and place them in the same row. For example I want to have:
k | attr_name1 | attr_name2
I have managed to make a query that does this and works for all keys (k) for which attr_name1 and attr_name2 appear in the column p of the initial table:
SELECT DISTINCT ON (key) fn.k AS key, fn.v AS attr_name1, a.v AS attr_name2
FROM Table fn
LEFT JOIN Table a ON fn.k = a.k
AND a.p = 'attr_name2'
WHERE fn.p = 'attr_name1'
I would like, however, to take into account the case where a certain key has no attribute named attr_name1 and insert a NULL value into the corresponding column of the new table. I am not sure how to achieve that. I have no issue using multiple queries or intermediate tables etc, but there are quite a lot of rows in the table and I need something that scales to millions of rows.
Any help would be appreciated.
Example:
k i p v
1 0 a 10
1 1 b 12
1 2 c 34
1 3 d 44
1 4 e 09
2 0 a 11
2 1 b 13
2 2 d 22
2 3 f 34
Would turn into (assuming I am only interested in columns a, b, c):
k a b c
1 10 12 34
2 11 13 NULL
I would use conditional aggregation. That is, an aggregate function around a CASE expression.
SELECT
k,
MAX(CASE WHEN p='a' THEN v END) AS a,
MAX(CASE WHEN p='b' THEN v END) AS b,
MAX(CASE WHEN p='c' THEN v END) AS c
FROM
your_table
GROUP BY
k
This presumes that (k, p) is unique. If there are duplicate keys, this will clearly find the one v with the highest value (for each (k,p))
As a general rule this kind of pivoting makes the data harder to process in SQL. This is often done for display purposes because humans find this easier to read. However, from a software engineering perspective, such formatting should not be done in the data layer; be careful that by doing this you don't actually make your future life harder.

Pulling previous cell value using conditional lag function

I am trying to condense down a data table which has separate rows for a particular ID: one row has an intent string and the following rows have one or more log strings. There can be more than one set of intents/logs for each ID. I want to pull down the intent string cells in a separate column so they are listed on the same row/s as the associated log strings.
I've "tried" LAG(tobi_intent, 1,0) OVER (ORDER BY datevalue) as AssociatedIntent
but firstly, this isn't valid code, and secondly, wouldn't ensure that the associated intent and logs are for the same ID.
Can anyone advise on the correct sql code to get the output below?
expected table output:
ID log intent associated_intent
1 x
1 b x
1 a x
1 u
1 f u
2 x
2 f x
5 e
5 a e
5 s e

Working of Merge in SAS (with IN=)

I have two dataset data1 and data2
data data1;
input sn id $;
datalines;
1 a
2 a
3 a
;
run;
data data2;
input id $ sales x $;
datalines;
a 10 x
a 20 y
a 30 z
a 40 q
;
run;
I am merging them from below code:
data join;
merge data1(in=a) data2(in=b);
by id;
if a and b;
run;
Result: (I was expecting an Inner Join result which is not the case)
1 a 10 x
2 a 20 y
2 a 30 z
2 a 40 w
Result from proc sql inner join.
proc sql;
select data1.id,sn,sales,x from data2 inner join data1 on data1.hh_id;
quit;
Result: (As expected from an inner join)
a 1 10 x
a 1 20 y
a 1 30 z
a 1 40 w
a 2 10 x
a 2 20 y
a 2 30 z
a 2 40 w
b 3 10 x
b 3 20 y
b 3 30 z
b 3 40 w
I want to know the concept and STEP BY STEP working of merge statement in SAS with In= and proving the above result.
PS: I have read this, and it says
An obvious use for these variables is to control what kind of 'merge'
will occur, using if statements. For example, if
ThisRecordIsFromYourData and ThisRecordIsFromOtherData; will make SAS
only include rows that match on the by variables from both input data
sets (like an inner join).
which I guess, (like an Inner Join) is not always the case.
Basically, this is a result of the difference in how the SAS data step and SQL process their respective join/merges.
SQL creates a separate record for each possible combination of keys. This is a Cartesian Product (at the key level).
SAS data step, however, process merges very differently. MERGE is really nothing more than a special case of SET. It still processes rows iteratively, one at a time - it never goes back, and never has more than one row from any dataset in the PDV at once. Thus, it cannot create a Cartesian product in its normal process - that would require random access, which the SAS datastep doesn't do normally.
What it does:
For each unique BY value
Take the next record from the left side dataset, if one exists with that BY value
Take the next record from the right side dataset, if one exists with that BY value
Output a row
Continue until both datasets are exhausted for that BY value
With BY values that yield unique records per value on either side (or both), it is effectively identical to SQL. However, with BY values that yield duplicates on BOTH sides, you get what you have there: a side-by-side merge, and if one runs out before the other, the values from the last row of the shorter dataset (for that by value) are more-or-less copied down. (They're actually RETAINED, so if you overwrite them with changes, they will not reset on new records from the longer dataset).
So, if left has 3 records and right has 4 records for key value a, like in your example, then you get data from the following records (assuming you don't alter the data after):
left right
1 1
2 2
3 3
3 4

How do I appropriately use a wildcard to select columns and build a new field in Access 2010?

This post is related in several aspects to the following:
Selecting all columns that start with XXX using a wildcard?
I am currently using Access 2010. I would like to add new columns to my table, based off values of the other columns.
Current table (Table #1):
Row | PlaceID | FoodItem1_10 | FoodItem1_02 | FoodItem2_10 | FoodItem2_02
001 Park Y N Y N
002 Library Y N Y N
003 Museum Y N Y N
Where:
Item1_10....ItemN_10 is a field where a value of 'Y' (for Yes) is assigned if, at a particular location, they sell that food item only 10 months of the year. Otherwise, the value is 'N' for No.
Item1_02....ItemN_02 is a field where a value of 'Y' is assigned if, at a particular location, they sell that food item only 02 months of the year. Otherwise, the value is 'N' for No.
I want to add columns to Table #1, and have it look as follows:
Desired new table (Table #2):
Row | PlaceID | FoodItem1_10 | FoodItem1_02 | FoodItem2_10 | FoodItem2_02 | AnyItems_10months | AnyItems_02months
001 Park Y N Y N Y N
002 Library Y Y Y N Y Y
003 Museum Y N Y N Y N
Where:
AnyItems_10months is a field that captures whether or not a particular place sells any items for a 10 month period. This field takes the values 'Y' for when, in any column, the particular place has a value of 'Y' for columns Item1_10 ..... ItemN_10.
AnyItems_02months is a field that captures whether or not a particular place sells any items for a 02 month period. This field takes the values 'Y' for when, in any column, the particular place has a value of 'Y' for columns Item1_02 ..... ItemN_02.
What I have been trying:
Since my columns follow a particular naming pattern, I thought it would be best to use a wildcard to generate my two new columns as such:
Obstacle
-Access does not accept my expression.
Why don't you just hard-code it into a query? You're not going to be able to make a field like that in a table without reading the .Fields property of the table. It would get really messy. If you're always going to do it the same way, doing it in a query is going to be the easiest way.

SQL - postgres - shortest path in graph - recursion

I have a table which contains the edges from node x to node y in a graph.
n1 | n2
-------
a | a
a | b
a | c
b | b
b | d
b | c
d | e
I would like to create a (materialized) view which denotes the shortest number of nodes/hops a path contains to reach from x to node y:
n1 | n2 | c
-----------
a | a | 0
a | b | 1
a | c | 1
a | d | 2
a | e | 3
b | b | 0
b | d | 1
b | c | 1
b | e | 2
d | e | 1
How should I model my tables and views to facilitate this? I guess I need some kind of recursion, but I believe that is pretty difficult to accomplish in SQL. I would like to avoid that, for example, the clients need to fire 10 queries if the path happens to contain 10 nodes/hops.
This works for me, but it's kinda ugly:
WITH RECURSIVE paths (n1, n2, distance) AS (
SELECT
nodes.n1,
nodes.n2,
1
FROM
nodes
WHERE
nodes.n1 <> nodes.n2
UNION ALL
SELECT
paths.n1,
nodes.n2,
paths.distance + 1
FROM
paths
JOIN nodes
ON
paths.n2 = nodes.n1
WHERE
nodes.n1 <> nodes.n2
)
SELECT
paths.n1,
paths.n2,
min(distance)
FROM
paths
GROUP BY
1, 2
UNION
SELECT
nodes.n1,
nodes.n2,
0
FROM
nodes
WHERE
nodes.n1 = nodes.n2
Also, I am not sure how good it will perform against larger datasets. As suggested by Mark Mann, you may want to use a graph library instead, e.g. pygraph.
EDIT: here's a sample with pygraph
from pygraph.algorithms.minmax import shortest_path
from pygraph.classes.digraph import digraph
g = digraph()
g.add_node('a')
g.add_node('b')
g.add_node('c')
g.add_node('d')
g.add_node('e')
g.add_edge(('a', 'a'))
g.add_edge(('a', 'b'))
g.add_edge(('a', 'c'))
g.add_edge(('b', 'b'))
g.add_edge(('b', 'd'))
g.add_edge(('b', 'c'))
g.add_edge(('d', 'e'))
for source in g.nodes():
tree, distances = shortest_path(g, source)
for target, distance in distances.iteritems():
if distance == 0 and not g.has_edge((source, target)):
continue
print source, target, distance
Excluding the graph building time, this takes 0.3ms while the SQL version takes 0.5ms.
Expanding on Mark's answer, there are some very reasonable approaches to explore a graph in SQL as well. In fact, they'll be faster than the dedicated libraries in perl or python, in that DB indexes will spare you the need to explore the graph.
The most efficient of index (if the graph is not constantly changing) is a nested-tree variation called the GRIPP index. (The linked paper mentions other approaches.)
If your graph is constantly changing, you might want to adapt the nested intervals approach to graphs, in a similar manner that GRIPP extends nested sets, or to simply use floats instead of integers (don't forget to normalize them by casting to numeric and back to float if you do).
Rather than computing these values on the fly, why not create a real table with all interesting pairs along with the shortest path value. Then whenever data is inserted, deleted or updated in your data table, you can recalculate all of the shortest path information. (Perl's Graph module is particularly well-suited to this task, and Perl's DBI interface makes the code straightforward.)
By using an external process, you can also limit the number of recalculations. Using PostgreSQL triggers would cause recalculations to occur on every insert, update and delete, but if you knew you were going to be adding twenty pairs of points, you could wait until your inserts were completed before doing the calculations.