OrientDB get all connected vertices from a starting vertex

OrientDB get all connected vertices from a starting vertex - properties

I'm working with Neo4j and OrientDB and I will compare their performance and functionality regarding traversal (Java Traversal API in Neo4j and Native Fluent API in OrientDB).
For OrientDB it should be possible to start at a specific vertex and get all vertices which are reachable and also an intersection of the property access on each individual step exists. If you have for example the graph A --> B --> C --> D, where A, B and C has access = www and D has the values access = https, the vertices A, B and C should be returned.
I recently asked a question and afterwards I posted my solution, how comparison on properties is possible on Neo4j.
Neo4j node property comparison during traversal
In Neo4j it is possible to create a TraversalDescription, which describes the rules and the bahavior of the traversal. For example the following: the AccesListEvaluator will compare the property of two connected nodes, as you can see on the posted solution for above mentioned Neo4j comparison.
TraversalDescription td = db.traversalDescription().depthFirst()
.relationships(RelationshipLabel.REFERENCED_BY, Direction.OUTGOING)
.evaluator(new AccessListEvaluator());
If you execute the following code example, you will get back all reachable nodes from the startNode and you could iterate over them and extract all their properties.
td.traverse(startNode).nodes();
I need alle the nodes, because I have to extract their properties and store it in a own data structure.
Is a similar solution with the OrientDB Native Fluent API possible? I checked the manual Java Traverse, but I'm not able to deviate a working solution from the given examples. If I'm execute the following code example I just get the properties and the IN and OUT connections of the target.
for (OIdentifiable oi : new OTraverse().field("in").field("out").target(new ORecordId("#24:0"))) {
System.out.println(oi);
}
For me it looks like, that you only get back the possible connections to other nodes.
Is there any chance, with the Native API to get each vertex and the properties of each vertex?
Hopefully my question is understandable.
Thanks in advance

I think I'm working on a similar problem. I'm using python (PyOrient) to connect to OrientDB and I'm generally using the SQL syntax. I've been trying to leverage either TRAVERSE or a fetchPlan.
When I use TRAVERSE I'm able to get all of the vertices and eges with properties with a query like this in console (same result in studio):
orientdb {db=FetchPlanTesting}> traverse * from level01
----+------+---------+------------------+------------+------+------+-------------
# |#RID |#CLASS |name |in_belongsTo|out |in |out_belongsTo
----+------+---------+------------------+------------+------+------+-------------
0 |#11:0 |Level01 |Item01_at_Level01 |[size=2] |null |null |null
1 |#20:1 |belongsTo|null |null |#12:1 |#11:0 |null
2 |#12:1 |Level02 |Item02_at_Level02 |[size=2] |null |null |[size=1]
3 |#20:4 |belongsTo|null |null |#13:0 |#12:1 |null
4 |#13:0 |Level03 |Item01_at_Level03 |[size=2] |null |null |[size=1]
5 |#20:19|belongsTo|null |null |#14:7 |#13:0 |null
6 |#14:7 |Level04 |Item08_at_Level04 |[size=2] |null |null |[size=1]
7 |#20:34|belongsTo|null |null |#15:6 |#14:7 |null
8 |#15:6 |Level05 |Item07_at_Level05 |null |null |null |[size=1]
9 |#20:50|belongsTo|null |null |#15:22|#14:7 |null
10 |#15:22|Level05 |Item023_at_Level05|null |null |null |[size=1]
11 |#20:27|belongsTo|null |null |#14:15|#13:0 |null
12 |#14:15|Level04 |Item016_at_Level04|[size=2] |null |null |[size=1]
13 |#20:42|belongsTo|null |null |#15:14|#14:15|null
14 |#15:14|Level05 |Item015_at_Level05|null |null |null |[size=1]
15 |#20:58|belongsTo|null |null |#15:30|#14:15|null
16 |#15:30|Level05 |Item031_at_Level05|null |null |null |[size=1]
17 |#20:8 |belongsTo|null |null |#13:4 |#12:1 |null
18 |#13:4 |Level03 |Item05_at_Level03 |[size=2] |null |null |[size=1]
19 |#20:15|belongsTo|null |null |#14:3 |#13:4 |null
----+------+---------+------------------+------------+------+------+-------------
LIMIT EXCEEDED: resultset contains more items not displayed (limit=20)
The only property I have on any nodes is name. I seem to recall that I could filter out the belongsTo edges (or at least only look for specific classes), but I've switched to trying a fetchPlan so I can get a nested JSON blob returned instead. That query is easy enough (`, but I'm trying to figure out how best to limit the depth and not actually return the edges as part of the result.
Here's that query and result in console (same in studio):
orientdb {db=FetchPlanTesting}> SELECT #this.toJSON('fetchPlan:*:-1') from 11:0
----+------+--------------------------------------------------------------------------------------------------------------------------------------------
# |#CLASS|this
----+------+--------------------------------------------------------------------------------------------------------------------------------------------
0 |null |{"name":"Item01_at_Level01","in_belongsTo":[{"out":{"name":"Item02_at_Level02","in_belongsTo":[{"out":{"name":"Item01_at_Level03","in_bel...
----+------+--------------------------------------------------------------------------------------------------------------------------------------------
1 item(s) found. Query executed in 0.024 sec(s).

Related

Data Modeling - Slow Changing Dimension type 2: How to deal with schema change (column added)?

What is the best practice to deal with schema-changing when building a Slow Changing Dimension table?
For example, a column was added:
First state:
+----------+---------------------+-------------------+
|customerId|address |updated_at |
+----------+---------------------+-------------------+
|1 |current address for 1|2018-02-01 00:00:00|
+----------+---------------------+-------------------+
New state with new column, but every other followed column constant:
+----------+---------------------+-------------------+------+
|customerId|address |updated_at |newCol|
+----------+---------------------+-------------------+------+
|1 |current address for 1|2018-03-03 00:00:00|1000 |
+----------+---------------------+-------------------+------+
My first approach is to think that schema-changing means the row has changed. So I would add a new row to my SCD table:
+----------+---------------------+-------------------+------+-------------+-------------------+-------------------+
|customerId|address |updated_at |newCol|active_status|active_status_start|active_status_end |
+----------+---------------------+-------------------+------+-------------+-------------------+-------------------+
|1 |current address for 1|2018-02-01 00:00:00|null |false |2018-02-01 00:00:00|2018-03-03 00:00:00|
|1 |current address for 1|2018-03-03 00:00:00|1000 |true |2018-03-03 00:00:00|null |
+----------+---------------------+-------------------+------+-------------+-------------------+-------------------+
But, what if the columns were added, but for some specific row the value is null? For example, for row with customerId = 2, it is null:
+----------+---------------------+-------------------+------+
|customerId|address |updated_at |newCol|
+----------+---------------------+-------------------+------+
|2 |current address for 2|2018-03-03 00:00:00|null |
+----------+---------------------+-------------------+------+
In this case, I can take two approaches:
Consider every schema change as a row change, even for null rows (much easier to implement, but costlier from a storage perspective). It would result in:
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|customerId|address |updated_at |active_status|active_status_end |active_status_start|newCol|
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|1 |current address for 1|2018-02-01 00:00:00|false |2018-03-03 00:00:00|2018-02-01 00:00:00|null |
|1 |current address for 1|2018-03-03 00:00:00|true |null |2018-03-03 00:00:00|1000 |
|2 |current address for 2|2018-02-01 00:00:00|false |2018-03-03 00:00:00|2018-02-01 00:00:00|null |
|2 |current address for 2|2018-03-03 00:00:00|true |null |2018-03-03 00:00:00|null |
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
Do a check for every row, and if it has an actual value for this new column, add it; otherwise, don't do anything to this row (for now, I didn't come up with implementation to it, but it is much more complicated and likely to be error-prone). The result in SCD table for row 2 would be 'row has not changed':
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|customerId|address |updated_at |active_status|active_status_end |active_status_start|newCol|
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|1 |current address for 1|2018-02-01 00:00:00|false |2018-03-03 00:00:00|2018-02-01 00:00:00|null |
|1 |current address for 1|2018-03-03 00:00:00|true |null |2018-03-03 00:00:00|1000 |
|2 |current address for 2|2018-02-01 00:00:00|true |null |2018-02-01 00:00:00|null |
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
The second approuch seems more "correct", but am I right? Also, implement approuch 1 is much simpler. Approuch 2 would need some more complicated and has other trade-offs, for example:
a) What if instead of adding a columns, a columnd was droped?
b) From a query persperctive it is much more costlier.
I have done research on the subject and didn't fount this kind of situation being treated.
What is the standard approach to it? Trade-offs? Is there another approach I am missing here?
Thank you all.

Thanks for #MarmiteBomber and #MatBailie comments. Based on your comments I ended up implementing the second option, because (summary of your thoughts):
The second approach is the only one meaningful.
Implementation is a consequence of business logic, not necessarily a standard practice. In our case, we didn't need to differentiate types of nulls, so the right approach was to encapsulate known non-existing values as null, as well as unknown values, etc.
Be explicit.
The second approach also needed to add a check (is the new column present in the row?) in write time, but it saves complexity in query time, and storage. Since SCD is "slow" and this case is rare (schema changes happen, but not "every day"), adding the check in write time is better than in query time.

Getting start and end indices of string in Pandas

I have a df that looks like this:
|Index|Value|Anomaly|
---------------------
|0 |4 | |
|1 |2 |Anomaly|
|2 |1 |Anomaly|
|3 |2 | |
|4 |6 |Anomaly|
I want to get the start and end indices of the consecutive anomaly counts so in this case, it will be [[1,2],[4]]
I understand I have to use .shift and .cumsum but I am lost and I hope someone would be able to enlighten me.

Get consecutive groups taking the cumsum of the Boolean Series that checks where the value is not 'Anomoly'. Use where so that we only only take the 'Anomoly' rows. Then we can loop over the groups and grab the indices.
m = df['Anomaly'].ne('Anomaly')
[[idx[0], idx[-1]] if len(idx) > 1 else [idx[0]]
for idx in df.groupby(m.cumsum().where(~m)).groups.values()]
#[[1, 2], [4]]
Or if you want to use a much longer groupby you can get the first and last index, then drop duplicates (to deal with streaks of only 1) and get it into a list of lists. This is much slower though
(df.reset_index().groupby(m.cumsum().where(~m))['index'].agg(['first', 'last'])
.stack()
.drop_duplicates()
.groupby(level=0).agg(list)
.tolist())
#[[1, 2], [4]]

SQL Output Rows as columns

I have a table that tests an item and stores any faliures similar to:
Item|Test|FailureValue
1 |1a |"ZZZZZZ"
1 |1b | 123456
2 |1a |"MMMMMM"
2 |1c | 111111
1 |1d |"AAAAAA"
Is there a way in SQL to essential pivot these and have the failure values be output to individual columns? I know that I can already use STUFF to achieve what I want for the Test field but I would like the results as individual columns if possible.
I'm hoping to achieve something like:
Item|Tests |FailureValue1|FailureValue2|FailureValue3|Failure......
1 |1a,1b |"ZZZZZZ" |123456 |NULL |NULL ......
2 |1a,1b |"MMMMMM" |111111 |"AAAAAA" |NULL ......
Kind regards
Matt

Appropriate idea or SQL to obtain the results set

Fig 1
TxnId | TxnTypeId |BranchId |TxnNumber |LocalAmount |ItemName
--------------------------------|-----------|---------------|----------
1777486 | 101 |1099 |1804908 |65.20000000 |A
1777486 | 101 |1099 |1804908 |324.50000000 |B
1777486 | 101 |1099 |1804908 |97.20000000 |C
1777486 | 101 |1099 |1804908 |310.00000000 |D
1777486 | 101 |1099 |1804908 |48.90000000 |E
Fig 2
TxnId |TxnTypeId |BankId |Number |Check |Bank |Cash |Wallet
--------|-----------|-------|--------|-------|------|------|------
1777486 |101 |1099 |1804908 | 48.9 | 310 |389.7 |97.2
Fig 3 (Expected Output)
TxnId |BankId |ItemName |Amount |Wallet |Bank |Check |Cash
--------|-------|-----------|-------|-------|-------|-------|-------
1777486 |1099 |A |65.2 |0 0 |0 |0 |65.2
1777486 |1099 |B |324.5 |0 0 |0 |0 |324.5
1777486 |1099 |C |97.2 |97.2 |0 |0 |0
1777486 |1099 |D |48.9 |0 |0 |48.9 |0
1777486 |1099 |E |310 |0 |310 |0 |0
I have two different result set that is obtained from the different query.
Fig 1 and Fig 2.
The Result i wanted is like shown in fig 3.
Currently i do not have the flag to identify the payment mode use for each transaction(each item). I have the flag for only the complete transaction.
Fig 4
IndividualTxnPaymentDetailId| IndividualTxnId |PaymentAmount |PaymentMode
---------------------------:|:-----------------:|:-------------:|:--------------
2106163 | 1777486 |389.70000000 | Cash
2106164 | 1777486 |97.20000000 | Wallet
2106165 | 1777486 |310.00000000 | Bank
2106166 | 1777486 |48.90000000 | Check
Means if two item or more is purchased using one payment mode i do not have the proper way of identifying the payment done for each item.
Item A and B is purchased using cash as payment mode with the amount 65.2 and 324.5. Total Cash paid is 389.7
Item C is purchased using Wallet as payment mode with amount 97.2. Total Wallet amount is 97.2.
Fig 5
TxnId |LocalAmount |ItemName
--------|--------------:|:------------
1777486 |65.20000000 | A
1777486 |324.50000000 | B
1777486 |97.20000000 | C
1777486 |310.00000000 | D
1777486 |48.90000000 | E
Query by which i generated the result in Fig 4 and Fig 5
select IndividualTxnPaymentDetailId, IndividualTxnId, PaymentAmount, cc.choicecode as PaymentMode
from dbo.IndividualTxnPaymentDetail it
inner join configchoice cc on cc.configchoiceid= it.configpaymentmodeid
where IndividualTxnId = 1777486
select IndividualTxnId as TxnId, LocalAmount, CurrencyName from dbo.IndividualTxnFCYDetail where IndividualTxnId = 1777486
This is the query written to identify the transaction made through Bank. Similarly i wanted to get the transaction on all the payment mode. But could not obtain the transaction properly.
CASE
WHEN tpm.Bank - SUM(txn.LocalAmount) OVER (PARTITION BY txn.BranchId, txn.TxnNumber ORDER BY CAST(txn.ItemName AS varchar(300))) + txn.LocalAmount < 0 THEN 0
WHEN tpm.Bank - SUM(txn.LocalAmount) OVER (PARTITION BY txn.BranchId, txn.TxnNumber ORDER BY CAST(txn.ItemName AS varchar(300))) + txn.LocalAmount > txn.LocalAmount THEN txn.LocalAmount
WHEN tpm.Bank - SUM(txn.LocalAmount) OVER (PARTITION BY txn.BranchId, txn.TxnNumber ORDER BY CAST(txn.ItemName AS varchar(300))) + txn.LocalAmount > tpm.Bank THEN tpm.Bank
ELSE tpm.Bank - SUM(txn.LocalAmount) OVER (PARTITION BY txn.BranchId, txn.TxnNumber ORDER BY CAST(txn.ItemName AS varchar(300))) + txn.LocalAmount
END AS Bank,
Can you help me to get the idea or with some sql to get the result set as in fig 3.

Updated Question - Updated Responce
I read your updated question and I'm afraid the problem still stands. Neither of those queries are summing the data - they are just pulling the same already summed numbers. You would either need to get at the numbers prior to the aggregation happening -or- to have some column in your IndividualTxnPaymentDetail table that ties each row to its counterpart rows in the other table (presumably through a cross table as in - Row 1 : ItemName A, Row 1 : ItemName B, Row 2 : ItemName C, etc).
If these are simply impossible, then perhaps your approaching this the wrong way, or to put it better, perhaps you are being asked to do something that doesn't make sense - and provable so. If there is no direct relationship between these activities in the data there's not much you can be expected to do. What's more it may indicate that your organization doesn't 'think' about them that way.
These two tables seem be payments and liabilities. Perhaps consider an approach where each payment goes toward what ever the oldest outstanding balance is and are matched to the items in Fig 4 that way. Add a column to the details table to store payment toward that item. Rather than a simple Paid/Unpaid Boolean, I would store the amount of payment that has been applied toward each item or the amount still owed on each item; that way you can handle partially applied payments. As payments come in, apply them. You would likely want a similar column in the payments table too to measure the amount of each payment that you have applied; that way you can handle over-payments, and be able to know the status of things such as pending receipts in the case that payments aren't applied immediately.
I hope this helps.

Fundamental Flaw
Your question is looking to take aggregated data (in your example, the Fig 2 Cash total of 389.7) and tease out what numbers were totaled to get the sum. You can do it here since 3 of the 4 numbers in Fig 2 are unique, one-to-one matches with numbers in Fig 1 - meaning the remaining ones have to belong to each other. But imagine 100s of numbers, many or most of them sums (i.e. not one-to-one matches like most of these). Or imagine an example as simple as yours except the numbers aren't so unique (e.g. Fig 1 = (10, 10, 10, 10, 20) and Fig 2 = (10, 20, 20, 10) - it is not possible to say which ones are which) and there only needs to be two possible combinations that could be responsible for a particular sum for the results to become ambiguous.
The weakness is in Fig 2. Do you have any control over that data source? Can grab the numbers up-stream before they are totaled?
Sorry for the negative conclusion but...
I hope this helps.

The Continuing Saga
Comment: [A version of this] report has already been made ...[but] I cannot contact the person who actually wrote that thing.
Perhaps he was also asked to do something that didn't make sense but did it anyway. The math simply doesn't work. He may have written something that finds as many one-to-one matches as it can and then sort of rolls the dice on the rest of it. He may have done something like the following:
Find and eliminate all the one-to-one matches.
Take any total and subtract any item amount from it to see if it
matches any remaining item amounts(s), if so, arbitrarily pick one,
eliminate all three numbers.
Repeat this until all combinations have been tested.
But you are still potentially left with unmatched numbers, so you next need to test for sums of three numbers by:
Arbitrarily subtract any two item amounts from any of the remaining
totals.
and so on and so on, followed by testing for sums of four items and so on.

I think part of what you're looking for is buried in here:
http://www.itprotoday.com/software-development/algorithms-still-matter
it calls it 'order fulfillment' where you go through transactions, combining them until you reach a given total
I think the solution will be in multiple parts, including cursors etc.
I'm not convinced you would be able to understand or implement any solution posted. Also, I maintain that there are cases where there are ambiguous solutions.
Lastly I see you have asked 16 questions and not marked a single one as answered.

find the closest time between two tables in spark

I am using pyspark and I have two dataframes like this:
user time bus
A 2016/07/18 12:00:00 1
B 2016/07/19 12:00:00 2
C 2016/07/20 12:00:00 3
bus time stop
1 2016/07/18 11:59:40 sA
1 2016/07/18 11:59:50 sB
1 2016/07/18 12:00:05 sC
2 2016/07/19 11:59:40 sB
2 2016/07/19 12:00:10 sC
3 2016/07/20 11:59:55 sD
3 2016/07/20 12:00:10 sE
Now I want to know at which stop the user reports according to the bus number and the closest time in the second table.
For example, in table 1, user A reports at 2016/07/18 12:00:00 and he is on bus No.1, and according to the second table, there are three records of bus No.1, but the closest time is 2016/07/18 12:00:05(the third record), so the user is in sC now.
The desired output should be like this:
user time bus stop
A 2016/07/18 12:00:00 1 sC
B 2016/07/19 12:00:00 2 sC
C 2016/07/20 12:00:00 3 sD
I have transferred the time into timestamp so that the only problem is to find the closest timestamp where the bus number is eqaul.
Because I'm not familiar with sql right now, I tried to use map function to find the closest time and its stop, which means I have to use sqlContext.sql in the map function, and spark dosen't seem to allow this:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
So how can I write a sql query to get the right output?

This can be done using window functions.
from pyspark.sql.window import Window
from pyspark.sql import Row, functions as W
def tm(str):
return datetime.strptime(str, "%Y/%m/%d %H:%M:%S")
#setup data
userTime = [ Row(user="A",time=tm("2016/07/18 12:00:00"),bus = 1) ]
userTime.append(Row(user="B",time=tm("2016/07/19 12:00:00"),bus = 2))
userTime.append(Row(user="C",time=tm("2016/07/20 12:00:00"),bus = 3))
busTime = [ Row(bus=1,time=tm("2016/07/18 11:59:40"),stop = "sA") ]
busTime.append(Row(bus=1,time=tm("2016/07/18 11:59:50"),stop = "sB"))
busTime.append(Row(bus=1,time=tm("2016/07/18 12:00:05"),stop = "sC"))
busTime.append(Row(bus=2,time=tm("2016/07/19 11:59:40"),stop = "sB"))
busTime.append(Row(bus=2,time=tm("2016/07/19 12:00:10"),stop = "sC"))
busTime.append(Row(bus=3,time=tm("2016/07/20 11:59:55"),stop = "sD"))
busTime.append(Row(bus=3,time=tm("2016/07/20 12:00:10"),stop = "sE"))
#create RDD
userDf = sc.parallelize(userTime).toDF().alias("usertime")
busDf = sc.parallelize(busTime).toDF().alias("bustime")
joinedDF = userDf.join(busDf,col("usertime.bus") == col("bustime.bus"),"inner").select(
userDf.user,
userDf.time.alias("user_time"),
busDf.bus,
busDf.time.alias("bus_time"),
busDf.stop)
additional_cols = joinedDF.withColumn("bus_time_diff", abs(unix_timestamp(col("bus_time")) - unix_timestamp(col("user_time"))))
partDf = additional_cols.select("user","user_time","bus","bus_time","stop","bus_time_diff", W.rowNumber().over(Window.partitionBy("user","bus").orderBy("bus_time_diff") ).alias("rank") ).filter(col("rank") == 1)
additional_cols.show(20,False)
partDf.show(20,False)
Output:
+----+---------------------+---+---------------------+----+-------------+
|user|user_time |bus|bus_time |stop|bus_time_diff|
+----+---------------------+---+---------------------+----+-------------+
|A |2016-07-18 12:00:00.0|1 |2016-07-18 11:59:40.0|sA |20 |
|A |2016-07-18 12:00:00.0|1 |2016-07-18 11:59:50.0|sB |10 |
|A |2016-07-18 12:00:00.0|1 |2016-07-18 12:00:05.0|sC |5 |
|B |2016-07-19 12:00:00.0|2 |2016-07-19 11:59:40.0|sB |20 |
|B |2016-07-19 12:00:00.0|2 |2016-07-19 12:00:10.0|sC |10 |
|C |2016-07-20 12:00:00.0|3 |2016-07-20 11:59:55.0|sD |5 |
|C |2016-07-20 12:00:00.0|3 |2016-07-20 12:00:10.0|sE |10 |
+----+---------------------+---+---------------------+----+-------------+
+----+---------------------+---+---------------------+----+-------------+----+
|user|user_time |bus|bus_time |stop|bus_time_diff|rank|
+----+---------------------+---+---------------------+----+-------------+----+
|A |2016-07-18 12:00:00.0|1 |2016-07-18 12:00:05.0|sC |5 |1 |
|B |2016-07-19 12:00:00.0|2 |2016-07-19 12:00:10.0|sC |10 |1 |
|C |2016-07-20 12:00:00.0|3 |2016-07-20 11:59:55.0|sD |5 |1 |
+----+---------------------+---+---------------------+----+-------------+----+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas