Let's say I have 4 vertex classes: V1,V2,V3,V4
And also 3 edge classes: E1,E2,E3
Then instances of them are (possibly) connected like this:
V1 --E1--> V2
V2 --E2--> V3
V2 --E3--> V4
V3 --E3--> V4
So, graph-wise something like:
V1---E1---V2
| \
E2 E3
| \
V3---E3---V4
With directions shown above.
I'm now interested in paths over the exact edges shown from V1 to V4 (There might be other edges between them as well that we don't know about, so only the edge types already mentioned are ok.)
To check if one of the paths from V1 to V4 exists (rather V4 will be returned if path exists):
SELECT EXPAND(out('E1').out('E3')) FROM V1 WHERE id = <someIdThatV1Has>
To check if the other path exists (rather V4 will be returned if path exists):
SELECT EXPAND(out('E1').out('E2').out('E3')) FROM V1 WHERE id = <someIdThatV1Has>
The only interest I have is to know if ONE of the two paths exists. I would like to do this with one query.
Question
How can I merge these two queries to one query to find out if one of the two paths exists?
(If possible, a general answer to how to merge different traversal queries in OrientDB along with an explicit answer would be highly appreciated.)
Thanks!
Try with unionAll
select expand($c)
let $a = ( SELECT EXPAND(out('E1').out('E3')) FROM V1 WHERE id = <someIdThatV1Has>),
$b = ( SELECT EXPAND(out('E1').out('E2').out('E3')) FROM V1 WHERE id = <someIdThatV1Has>),
$c = unionAll( $a, $b )
You can look the documentation at the following link
http://orientdb.com/docs/2.1/SQL.html#select-from-multiple-targets
Related
Looked all over Azure Data Explorer documentation for migration scenarios and I didn't manage to find an article on this.
What I'm trying to do is to apply migration to incoming data and I thought of putting it in the Update Policy. I don't know if this is a good idea or not, let me know. Aside for this, I don't know if what I'm doing is good enough or if it could be made better.
I have table Target and table Source. Source has a dynamic Payload column and I'm mapping that column to the table Target IF it has a certain property. I did it as such:
let new_data = Source
| where Payload.Name == 'NameImLookingFor'
;
let good_data = new_data
| where isnull(Payload.DeprecatedField)
| project
FieldA = todouble(Payload.FieldA),
FieldB = todouble(Payload.FieldB),
FieldC = todouble(Payload.FieldC)
;
let migrated_data = new_data
| where isnotnull(Payload.DeprecatedField)
| project
FieldA = iff(toint(Payload.DeprecatedField)==0,todouble(Payload.DeprecatedFieldValue), Payload.UndefinedMemeber),
FieldB = iff(toint(Payload.DeprecatedField)==1,todouble(Payload.DeprecatedFieldValue), Payload.UndefinedMemeber),
FieldC = iff(toint(Payload.DeprecatedField)==2,todouble(Payload.DeprecatedFieldValue), Payload.UndefinedMemeber)
;
good_data
| union migrated_data
I have some questions and incertitudes:
iff must have an else value specified. I want it to be null, but that type doesn't exist so I'm using Payload. some field that I'm sure it doesn't exists on the object so I have an empty value. Is this good enough? Could it be better?
I'm calling that iff 3 times, could a function be made for it? If yes, how and where? Should I place that in the update policy also or define it somewhere else?
Could it be done in a single query? I looked into case statement but I didn't feel like it would make my life easier.
Thanks.
Using an update policy is valid (though, ideally, you'd fix the data at its source, if possible, before ingestion into Kusto/ADX).
You could replace your logic with the following:
Source
| where Payload.Name == 'NameImLookingFor'
| extend df = toint(Payload.DeprecatedField)
| project FieldA = case(isnull(df), todouble(Payload.FieldA), case(df == 0, todouble(Payload.DeprecatedFieldValue), double(null))),
FieldB = case(isnull(df), todouble(Payload.FieldB), case(df == 1, todouble(Payload.DeprecatedFieldValue), double(null))),
FieldC = case(isnull(df), todouble(Payload.FieldC), case(df == 2, todouble(Payload.DeprecatedFieldValue), double(null)))
I am working on a mobile application which takes two inputs - Source Station Name and Destination Station Names. Upon receiving these two inputs, the application would then enlist the names of trains available for the given stations along with their source_arrival and destination_reach timings.
(Note: For now, I am only focusing on the unreserved local trains that operate in the state of West Bengal, India)
I am using SQLite as the RDBMS. I have following three tables as the sources of the data -
train_table (which has the details of the trains available):
station_table (which contains the details of the stations):
route_table (which contains the route details):
Now, my aim is to produce the output in the following manner as specified earlier (suppose I gave Baruipur Jn as source and Sealdah as destination):
I am unable to figure out the query needed for this. Initially, I was trying something like the following:
select r1.trainId, r1.arrival as SrcArrive, r2.arrival
as Reach from route_table r1 cross join route_table r2
where r1.trainId = r2.trainId and r1.stationId <> r2.stationId and
r1.arrival <> r2.arrival;
(Yes, without the trainName)
But I was unable to cut down the unintended source_arrival timings. However, I was able to retrieve the number of different trains available for given two stations with the following:
select _id, trainNO, trainName from train_table where _id in
(select trainId from route_table where stationId = 109
INTERSECT
select trainId from route_table where stationId = 21);
But with this, I am not able to get to the final result that I need.
This might work, try once.
select routeData.*, train_table.* from (select r1.trainId, r1.arrival as SrcArrive, r2.arrival
as Reach from route_table r1 cross join route_table r2
where r1.trainId = r2.trainId and r1.stationId <> r2.stationId and
r1.arrival <> r2.arrival) routeData inner join train_table on routeData.trainId=train_table._id;
I have redifined the selection from route table, try this updated one:
select trainName, SrcArrival, Destination from (select trainData.trainName, route.* from
(select A.trainId, A.arrival as SrcArrival, B.trainId, B.arrival as Destination from
route_table A inner join route_table B on A.trainId=B.trainId where A.stationId=109 and
B.stationId=259 and A.arrival<B.arrival) route inner join train_table trainData on
route.trainId=trainData._id) order by SrcArrival, Destination;
I have the following models:
class Collection(models.Model):
...
class Record(models.Model):
collection = models.ForeignKey(Collection, related_name='records')
filename = models.CharField(max_length=256)
checksum = models.CharField(max_length=64)
class Meta:
unique_together = (('filename', 'collection'),)
I want to perform the following query:
For each filename of Record I want to know the Collections that:
Do not provide a Record with that filename
or that provide such a Record but has a differing checksum
I have in mind an output like that:
| C1 C2 C3 <- collections
-----------+------------
file-1.txt | x
file-2.txt | x
file-3.txt | ! ! !
file-4.txt | x ! !
file-5.txt | ! ! x
x = missing
! = different checksum
What I've com up so far is that I create a query for each Collection, excluding all filenames that are within this collection but exist in others.
for collection in collections:
other_collections = [c for c in collections if c is not collection]
results[collection] = qs.filter(collection__in=other_collections).exclude(
filename__in=qs.filter(
collection=collection
).values_list('filename', flat=True)
).order_by('filename').values_list('filename', flat=True)
This somewhat solves the first part of my question, but is rather quirky and requires post-processing to get to the format I desire. And, more importantly, it does not address the checksum comparison.
Is it possible to perform the two queries in one combined step to get the results in the format I described above?
The solution would not necessarily have to use the QuerySet APIs, a fallback to raw SQL is fine by me too.
It is not possible to write a SQL query that returns a variable number of columns, although you can achieve that effect if you wrap everything in an array or JSON object.
If you know the collections, you could write SQL like this:
SELECT r.filename,
(SELECT r.checksum = r2.checksum FROM records r2 WHERE r.filename = r2.filename AND r2.collection_id = 1) AS c1,
(SELECT r.checksum = r2.checksum FROM records r2 WHERE r.filename = r2.filename AND r2.collection_id = 2) AS c2,
...
FROM records r
WHERE r.collection_id = 1
GROUP BY r.filename, r.checksum
For each filename/collection pair, you will get NULL if the collection doesn't have the record, true if the collection has it with the right checksum, or false if the collection has it with a different checksum.
I include WHERE r.collection_id = 1 because otherwise for the checksum comparison, you have to answer "different from what?"
I have the following sql data:
ID Company Name Customer Address 1 City State Zip Date
0108500 AAA Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
0108500 AAA.Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
1802600 AAA Test Company Ban, Adj.~Gorge PO Box 83 MouLaurel CA 153 09JS0025
1210600 AAA Test Company Biwel~Brce 97kehst ve Jacn CA 153 04JS0190
AAA Test, AAA.Test and AAA Test Company are considered as one company.
Since their data is messy I'm thinking either to do this:
Is there a way to search all the records in the DB wherein it will search the company name with almost the same name then re-name it to the longest name?
In this case, the AAA Test and AAA.Test will be AAA Test Company.
OR Is there a way to filter only record with company name that are almost the same then they can have option to change it?
If there's no way to do it via sql query, what are your suggestions so that we can clean-up the records? There are almost 1 million records in the database and it's hard to clean it up manually.
Thank you in advance.
You could use String matching algorithm like Jaro-Winkler. I've written an SQL version that is used daily to deduplicate People's names that have been typed in differently. It can take awhile but it does work well for the fuzzy match you're looking for.
Something like a self join? || is ANSI SQL concat, some products have a concat function instead.
select *
from tablename t1
join tablename t2 on t1.companyname like '%' || t2.companyname || '%'
Depending on datatype you may have to remove blanks from the t2.companyname, use TRIM(t2.companyname) in that case.
And, as Miguel suggests, use REPLACE to remove commas and dots etc.
Use case-insensitive collation. SOUNDEX can be used etc etc.
I think most Database Servers support Full-Text search ability, and if so there are some functions related to Full-Text search that support Proximity.
for example there is a Near function in SqlServer and here is its documentation https://msdn.microsoft.com/en-us/library/ms142568.aspx
You can do the clean-up in several stages.
Create new columns
Convert everything to upper case, remove punctuation & whitespace, then match on the first 6 to 10 characters (using self join). Assuming your table is called "vendor": add two columns, "status", "dupstr", then update as follows
/** Populate dupstr column for fuzzy match **/
update vendor v
set v.dupstr = left(upper(regex_replace(regex_replace(v.companyname,'.',''),' ','')),6)
;
Identify duplicate records
Add an index on the dupstr column, then do an update like this to identify "good" records:
/** Mark the good duplicates **/
update vendor v
set v.status = 'keep' --indicate keeper record
where
--dupes to clean up
exists ( select 1 from vendor v1 where v.dupstr = v1.dupstr
and v.id != v1.id )
and
( --keeper has longest name
length(v.companyname) =
( select max(length(v2.companyname)) from vendor v2
where v.dupstr = v2.dupstr
)
or
--keeper has latest record (assuming ID is sequential)
v.id =
( select max(v3.id) from vendor v3
where v.dupstr = v3.dupstr
)
)
group by v.dupstr
;
The above SQL can be refined to add "dupe" status to other records , or you can do a separate update.
Clean Up Stragglers
Report any remaining partial matches to be reviewed by a human (i.e. dupe records without a keeper record)
You can use SQL query with SOUDEX of DIFFRENCE
For example:
SELECT DIFFERENCE ('AAA Test','AAA Test Company')
DIFFERENCE returns 0 - 4 ( 4 = almost the same, 0 - totally diffrent)
See also: https://learn.microsoft.com/en-us/sql/t-sql/functions/difference-transact-sql?view=sql-server-2017
In an OrientDb graph database, I'm trying to get some information about Vertex, Edge pairs.
For example, consider the following case:
V1 ---E1---> V2
---E2---> V3 --E3--> V2
I would like to have as result the following 3 rows;
V1, E1
V1, E2
V3, E3
I've tried the following:
select label, flatten(out.label) from V
select label from (select flatten(out) from V)
select label, flatten(out) from V
select flatten(out) from V
select $current, label from (traverse out from V while $depth <= 1) where $depth = 1
But none of these solutions seem to return what I want. How can I return Vertex, Edge pairs?
What you are trying to do is actually extremely simple with OrientDB, it seems you are overthinking the issue.
Let's create your example:
V1 ---E1---> V2
---E2---> V3 --E3--> V2
In OrientDB, you would do this as follows:
/* Create nodes */
CREATE CLASS Node EXTENDS V
CREATE PROPERTY Node.name STRING (MANDATORY TRUE)
CREATE VERTEX Node SET name = 'V1'
CREATE VERTEX Node SET name = 'V2'
CREATE VERTEX Node SET name = 'V3'
/* Create edges */
CREATE CLASS Link EXTENDS E
CREATE PROPERTY Link.name STRING (MANDATORY TRUE)
CREATE EDGE Link
FROM (SELECT FROM Node WHERE name = 'V1')
TO (SELECT FROM Node WHERE name = 'V2')
SET name = 'E1'
CREATE EDGE Link
FROM (SELECT FROM Node WHERE name = 'V1')
TO (SELECT FROM Node WHERE name = 'V3')
SET name = 'E2'
CREATE EDGE Link
FROM (SELECT FROM Node WHERE name = 'V3')
TO (SELECT FROM Node WHERE name = 'V2')
SET name = 'E3'
This creates the following graph:
Now a little explanation of how to query in OrientDB. Let's say you load one vertex: SELECT * FROM Node WHERE name = 'V1'. Then, to load other information, you use:
To load all incoming vertices (skipping the edges): in()
To load all incoming vertices of class Link (skipping the edges): in('Link')
To load all incoming edges: inE()
To load all incoming edges of class Link: inE('Link')
To load all outgoing vertices (skipping the edges): out()
To load all outgoing vertices of class Link (skipping the edges): out('Link')
To load all outgoing edges: outE()
To load all outgoing edges of class Link: outE('Link')
So in your case, you want to load all the vertices and their outgoing edges, so we do:
SELECT name, outE('Link') FROM Node
Which loads the name of the vertices and a pointer to the outgoing edges:
If you would like to have a list of the names of the outgoing edges, we simply do:
SELECT name, outE('Link').name FROM Node
Which gives:
Which is exactly what you asked for in your question. As you can see, this is extremely simple to do in OrientDB, you just need to realize that OrientDB is smarter than you think :)
FLATTEN operator works alone, because get a field and let it to become the result. I don't understand what you want to do. Can you write the expected output please?
The CYPHER syntax, as used in Neo4j finally rescued me.
start n=node(*) MATCH (n)-[left]->(n2)<-[right]-(n3) WHERE n.type? ='myType' AND left.line > right.line - 1 AND left.line < right.line + 1 RETURN n, left, n2, right, n3
The node n is the pivoting element, on wich an filter can be provided, just as on each other step within the path. For me it was important to select a further step depending on an other part of the path.
With OrientDb I couldnt find a way to relate the properties to each other easily.