SQL: Finding a subgraph - sql

I have a graph network stored in an SQL server. The graph network ( collection of labeled, undirected and connected graphs) is stored in Vertex-Edge mapping scheme (i.e there are 2 tables..one for vertices and one for edges) :
Vertices ( graphID , vertexID, vertexLabel )
Edges ( graphID , sourceVertex , destinationVertex ,edgeLabel )
I am looking for a simple way of counting a particular subgraph in this network. For example: I would like to find how many instances of "A-B-C" are present in this network : "C-D-A-B-C-E-A-B-C-F". I have a few ideas on how this can be done in say Java or C++ ...but I have no clue how to approach this problem using SQL. any ideas?
A little background: I'm no student..this is a small project I would like to pursue. I do a lot of social media analysis (in memory) but have little experience mining graphs against an SQL database.

my idea is to create a stored procedure which input is a string like 'A-B-C' or a precreated table with vertices in proper order ('A', 'B', 'C'). So you will have a loop and step by step you should walk through the path 'A-B-C'. For this you need a temp table for vertices on current step:
1)step 0
#currentLabel = getNextVertexLabel(...) --need to decide how to do this
select
*
into #v
from Vertices
where
vertexLabel = #currentLabel
--we need it later
select
*
into #tempV
from #v
where
0 <> 0
2)step i
#currentLabel = getNextVertexLabel(...)
insert #tempV
select
vs.*
from #v v
join Edges e on
e.SourceVertex = v.VertexID
and e.graphID = v.graphID
join Vertices vs on
e.destinationVertex = vs.VertexID
and e.graphID = vs.graphID
where
vs.vertexLabel = #currentLabel
truncate table #v
insert #v
select * from #tempV
truncate table #tempV
3)after loop
You result will store at #v. So the number of subgraphs will be:
select count(*) from #v

Related

with-constrained consecutive updates

Please assume I have built a query in MS Sqlserver, it has the following structure:
WITH issues_a AS
(
SELECT a_prop
FROM ds_X x
)
, issues_b AS
(
SELECT key
, z.is_flagged as is_flagged
, some_prop
FROM ds_Z z
JOIN issues_a i_a
ON z.a_diff = i_a.a_prop
)
-- {{ run }}
UPDATE samples
SET error =
CASE
WHEN i_b.some_prop IS NULL THEN '#1 ...'
WHEN UPPER(i_b.is_flagged) != 'Y' THEN '#2 ...'
END
FROM samples s
left join issues_b i_b ON s.key = i_b.key;
Now I want enhance the whole thing, updating one more table in a consecutive way by enclosing parts of the query in BEGIN TRANSACTION and COMMIT, but don't get my head around the how of it. Tried enclosing the whole expression with the transaction parenthesis, but that didn't bring me any further.
Are there any other ways to achieve the above task - even without concatenating the consecutive updates in a transactional manner, though better it would be?
For abbreviation the task again: WITH <...>(...), <...>(...) UPDATE <... Using data from latter WITH> UPDATE <... using data from latter WITH>?
Hope you don't mind my poor grammar...

OrientDB select Vertex, Edge pairs from query

In an OrientDb graph database, I'm trying to get some information about Vertex, Edge pairs.
For example, consider the following case:
V1 ---E1---> V2
---E2---> V3 --E3--> V2
I would like to have as result the following 3 rows;
V1, E1
V1, E2
V3, E3
I've tried the following:
select label, flatten(out.label) from V
select label from (select flatten(out) from V)
select label, flatten(out) from V
select flatten(out) from V
select $current, label from (traverse out from V while $depth <= 1) where $depth = 1
But none of these solutions seem to return what I want. How can I return Vertex, Edge pairs?
What you are trying to do is actually extremely simple with OrientDB, it seems you are overthinking the issue.
Let's create your example:
V1 ---E1---> V2
---E2---> V3 --E3--> V2
In OrientDB, you would do this as follows:
/* Create nodes */
CREATE CLASS Node EXTENDS V
CREATE PROPERTY Node.name STRING (MANDATORY TRUE)
CREATE VERTEX Node SET name = 'V1'
CREATE VERTEX Node SET name = 'V2'
CREATE VERTEX Node SET name = 'V3'
/* Create edges */
CREATE CLASS Link EXTENDS E
CREATE PROPERTY Link.name STRING (MANDATORY TRUE)
CREATE EDGE Link
FROM (SELECT FROM Node WHERE name = 'V1')
TO (SELECT FROM Node WHERE name = 'V2')
SET name = 'E1'
CREATE EDGE Link
FROM (SELECT FROM Node WHERE name = 'V1')
TO (SELECT FROM Node WHERE name = 'V3')
SET name = 'E2'
CREATE EDGE Link
FROM (SELECT FROM Node WHERE name = 'V3')
TO (SELECT FROM Node WHERE name = 'V2')
SET name = 'E3'
This creates the following graph:
Now a little explanation of how to query in OrientDB. Let's say you load one vertex: SELECT * FROM Node WHERE name = 'V1'. Then, to load other information, you use:
To load all incoming vertices (skipping the edges): in()
To load all incoming vertices of class Link (skipping the edges): in('Link')
To load all incoming edges: inE()
To load all incoming edges of class Link: inE('Link')
To load all outgoing vertices (skipping the edges): out()
To load all outgoing vertices of class Link (skipping the edges): out('Link')
To load all outgoing edges: outE()
To load all outgoing edges of class Link: outE('Link')
So in your case, you want to load all the vertices and their outgoing edges, so we do:
SELECT name, outE('Link') FROM Node
Which loads the name of the vertices and a pointer to the outgoing edges:
If you would like to have a list of the names of the outgoing edges, we simply do:
SELECT name, outE('Link').name FROM Node
Which gives:
Which is exactly what you asked for in your question. As you can see, this is extremely simple to do in OrientDB, you just need to realize that OrientDB is smarter than you think :)
FLATTEN operator works alone, because get a field and let it to become the result. I don't understand what you want to do. Can you write the expected output please?
The CYPHER syntax, as used in Neo4j finally rescued me.
start n=node(*) MATCH (n)-[left]->(n2)<-[right]-(n3) WHERE n.type? ='myType' AND left.line > right.line - 1 AND left.line < right.line + 1 RETURN n, left, n2, right, n3
The node n is the pivoting element, on wich an filter can be provided, just as on each other step within the path. For me it was important to select a further step depending on an other part of the path.
With OrientDb I couldnt find a way to relate the properties to each other easily.

Data structure for efficient multi-parameters search

I have collection of multidimensional object (e.g class Person = {age : int , height : int, weight : int etc...}).
I need to query the collection with queries where some dimensions are fixed and the rest unspecified (e.g getallPersonWith {age = c , height = a} or getAllPersonWith {weigth = d}...)
Right now i have a multimap with {age, Height,...} (e.g all dimension that can be fixed) -> List : Person.To perform a query i first compute the set of keys that verify the query, then merge the corresponding list from the map.
Is there anything better, in terms of query speed ? in particular is there anything closer to using one sorted list by dimension (which i believe to be the fastest solutions, but too cumbersome to manage:) )
Just to be clear, i am not looking for an sql query.
For your purpose you can have a look at:
http://code.google.com/p/cqengine/
Should get you in the right direction
You mean something like:
SELECT * FROM person p
WHERE gender = 'F'
AND age >=18
AND age < 30
AND weight > 60 -- metric measures here !!
AND weight < 70
AND NOT EXISTS (
SELECT * from couple c
WHERE c.one = p.id OR c.two=p.id
);
Why do you think I use SQL?

Help with a complex join query

Keep in mind I am using SQL 2000
I have two tables.
tblAutoPolicyList contains a field called PolicyIDList.
tblLossClaims contains two fields called LossPolicyID & PolicyReview.
I am writing a stored proc that will get the distinct PolicyID from PolicyIDList field, and loop through LossPolicyID field (if match is found, set PolicyReview to 'Y').
Sample table layout:
PolicyIDList LossPolicyID
9651XVB19 5021WWA85, 4421WWA20, 3314WWA31, 1121WAW11, 2221WLL99 Y
5021WWA85 3326WAC35, 1221AXA10, 9863AAA44, 5541RTY33, 9651XVB19 Y
0151ZVB19 4004WMN63, 1001WGA42, 8587ABA56, 8541RWW12, 9329KKB08 N
How would I go about writing the stored proc (looking for logic more than syntax)?
Keep in mind I am using SQL 2000.
Select LossPolicyID, * from tableName where charindex('PolicyID',LossPolicyID,1)>0
Basically, the idea is this:
'Unroll' tblLossClaims and return two columns: a tblLossClaims key (you didn't mention any, so I guess it's going to be LossPolicyID) and Item = a single item from LossPolicyID.
Find matches of unrolled.Item in tblAutoPolicyList.PolicyIDList.
Find matches of distinct matched.LossPolicyID in tblLossClaims.LossPolicyID.
Update tblLossClaims.PolicyReview accordingly.
The main UPDATE can look like this:
UPDATE claims
SET PolicyReview = 'Y'
FROM tblLossClaims claims
JOIN (
SELECT DISTINCT unrolled.LossPolicyID
FROM (
SELECT LossPolicyID, Item = itemof(LossPolicyID)
FROM unrolling_join
) unrolled
JOIN tblAutoPolicyList
ON unrolled.ID = tblAutoPolicyList.PolicyIDList
) matched
ON matched.LossPolicyID = claims.LossPolicyID
You can take advantage of the fixed item width and the fixed list format and thus easily split LossPolicyID without a UDF. I can see this done with the help of a number table and SUBSTRING(). unrolling_join in the above query is actually tblLossClaims joined with the number table.
Here's the definition of unrolled 'zoomed in':
...
(
SELECT LossPolicyID,
Item = SUBSTRING(LossPolicyID,
(v.number - 1) * #ItemLength + 1,
#ItemLength)
FROM tblLossClaims c
JOIN master..spt_values v ON v.type = 'P'
AND v.number BETWEEN 1 AND (LEN(c.LossPolicyID) + 2) / (#ItemLength + 2)
) unrolled
...
master..spt_values is a system table that is used here as the number table. Filter v.type = 'P' gives us a rowset with number values from 0 to 2047, which is narrowed down to the list of numbers from 1 to the number of items in LossPolicyID. Eventually v.number serves as an array index and is used to cut out single items.
#ItemLength is of course simply LEN(tblAutoPolicyList.PolicyIDList). I would probably also declared #ItemLength2 = #ItemLength + 2 so it wasn't calculated every time when applying the filter.
Basically, that's it, if I haven't missed anything.
If the PolicyIDList field is a delimited list, you have to first separate the individual policy IDs and create a temporary table with all of the results. Next up, use an update query on the tblLossClaims with 'where exists (select * from #temptable tt where tt.PolicyID = LossPolicyID).
Depending on the size of the table/data, you might wish to add an index to your temporary table.

Selecting elements that don't exist

I am working on an application that has to assign numeric codes to elements. This codes are not consecutives and my idea is not to insert them in the data base until have the related element, but i would like to find, in a sql matter, the not assigned codes and i dont know how to do it.
Any ideas?
Thanks!!!
Edit 1
The table can be so simple:
code | element
-----------------
3 | three
7 | seven
2 | two
And I would like something like this: 1, 4, 5, 6. Without any other table.
Edit 2
Thanks for the feedback, your answers have been very helpful.
This will return NULL if a code is not assigned:
SELECT assigned_codes.code
FROM codes
LEFT JOIN
assigned_codes
ON assigned_codes.code = codes.code
WHERE codes.code = #code
This will return all non-assigned codes:
SELECT codes.code
FROM codes
LEFT JOIN
assigned_codes
ON assigned_codes.code = codes.code
WHERE assigned_codes.code IS NULL
There is no pure SQL way to do exactly the thing you want.
In Oracle, you can do the following:
SELECT lvl
FROM (
SELECT level AS lvl
FROM dual
CONNECT BY
level <=
(
SELECT MAX(code)
FROM elements
)
)
LEFT OUTER JOIN
elements
ON code = lvl
WHERE code IS NULL
In PostgreSQL, you can do the following:
SELECT lvl
FROM generate_series(
1,
(
SELECT MAX(code)
FROM elements
)) lvl
LEFT OUTER JOIN
elements
ON code = lvl
WHERE code IS NULL
Contrary to the assertion that this cannot be done using pure SQL, here is a counter example showing how it can be done. (Note that I didn't say it was easy - it is, however, possible.) Assume the table's name is value_list with columns code and value as shown in the edits (why does everyone forget to include the table name in the question?):
SELECT b.bottom, t.top
FROM (SELECT l1.code - 1 AS top
FROM value_list l1
WHERE NOT EXISTS (SELECT * FROM value_list l2
WHERE l2.code = l1.code - 1)) AS t,
(SELECT l1.code + 1 AS bottom
FROM value_list l1
WHERE NOT EXISTS (SELECT * FROM value_list l2
WHERE l2.code = l1.code + 1)) AS b
WHERE b.bottom <= t.top
AND NOT EXISTS (SELECT * FROM value_list l2
WHERE l2.code >= b.bottom AND l2.code <= t.top);
The two parallel queries in the from clause generate values that are respectively at the top and bottom of a gap in the range of values in the table. The cross-product of these two lists is then restricted so that the bottom is not greater than the top, and such that there is no value in the original list in between the bottom and top.
On the sample data, this produces the range 4-6. When I added an extra row (9, 'nine'), it also generated the range 8-8. Clearly, you also have two other possible ranges for a suitable definition of 'infinity':
-infinity .. MIN(code)-1
MAX(code)+1 .. +infinity
Note that:
If you are using this routinely, there will generally not be many gaps in your lists.
Gaps can only appear when you delete rows from the table (or you ignore the ranges returned by this query or its relatives when inserting data).
It is usually a bad idea to reuse identifiers, so in fact this effort is probably misguided.
However, if you want to do it, here is one way to do so.
This the same idea which Quassnoi has published.
I just linked all ideas together in T-SQL like code.
DECLARE
series #table(n int)
DECLARE
max_n int,
i int
SET i = 1
-- max value in elements table
SELECT
max_n = (SELECT MAX(code) FROM elements)
-- fill #series table with numbers from 1 to n
WHILE i < max_n BEGIN
INSERT INTO #series (n) VALUES (i)
SET i = i + 1
END
-- unassigned codes -- these without pair in elements table
SELECT
n
FROM
#series AS series
LEFT JOIN
elements
ON
elements.code = series.n
WHERE
elements.code IS NULL
EDIT:
This is, of course, not ideal solution. If you have a lot of elements or check for non-existing code often this could cause performance issues.