How can I efficiently create unique relationships in Neo4j? - optimization

Following up on my question here, I would like to create a constraint on relationships. That is, I would like there to be multiple nodes that share the same "neighborhood" name, but each uniquely point to a particular city in which they reside.
As encouraged in user2194039's answer, I am using the following index:
CREATE INDEX ON :Neighborhood(name)
Also, I have the following constraint:
CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;
The following code fails to create unique relationships, and takes an excessively long period of time:
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
WITH line
WHERE line.Neighborhood IS NOT NULL
WITH line
MATCH (c:City { name : line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name : toInt(line.Neighborhood)});
Note that there is a uniqueness constraint on City, but NOT on Neighborhood (because there should be multiple ones).
Profile with Limit 10,000:
+--------------+------+--------+---------------------------+------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+--------------+------+--------+---------------------------+------------------------------+
| EmptyResult | 0 | 0 | | |
| UpdateGraph | 9750 | 3360 | anon[307], b, neighborhood, line | MergePattern |
| SchemaIndex | 9750 | 19500 | b, line | line.City; :City(name) |
| ColumnFilter | 9750 | 0 | line | keep columns line |
| Filter | 9750 | 0 | anon[220], line | anon[220] |
| Extract | 10000 | 0 | anon[220], line | anon[220] |
| Slice | 10000 | 0 | line | { AUTOINT0} |
| LoadCSV | 10000 | 0 | line | |
+--------------+------+--------+---------------------------+------------------------------+
Total database accesses: 22860
Following Guilherme's recommendation below, I implemented the helper yet it is raising the error py2neo.error.Finished. I've searched the documentation, and wasn't able to determine a work around from this. It looks like there's an open SO post about this exception.
def run_batch_query(queries, timeout=None):
if timeout:
http.socket_timeout = timeout
try:
graph = Graph()
authenticate("localhost:7474", "account", "password")
tx = graph.cypher.begin()
for query in queries:
statement, params = query
tx.append(statement, params)
results = tx.process()
tx.commit()
except http.SocketError as err:
raise err
except error.Finished as err:
raise err
collection = []
for result in results:
records = []
for record in result:
records.append(record)
collection.append(records)
return collection
main:
queries = []
template = ["MERGE (city:City {Name:{city}})", "Merge (city)<-[:IN]-(n:Neighborhood {Name : {neighborhood}})"]
statement = '\n'.join(template)
batch = 5000
c = 1
start = time.time()
# city_neighborhood_map is a defaultdict that maps city-> set of neighborhoods
for city, neighborhoods in city_neighborhood_map.iteritems():
for neighborhood in neighborhoods:
params = dict(city=city, neighborhood=neighborhood)
queries.append((statement, params))
c +=1
if c % batch == 0:
print "running batch"
print c
s = time.time()*1000
r = run_batch_query(queries, 10)
e = time.time()*1000
print("\t{0}, {1:.00f}ms".format(c, e-s))
del queries[:]
print c
if queries:
s = time.time()*1000
r = run_batch_query(queries, 300)
e = time.time()*1000
print("\t{0} {1:.00f}ms".format(c, e-s))
end = time.time()
print("End. {0}s".format(end-start))

If you want to create unique relationships you have 2 options:
Prevent the path from being duplicated, using MERGE, just like #user2194039 suggested. I think this is the simplest, and best approach you can take.
Turn your relationship into a node, and create an unique constraint on it. But it's hardly necessary for most cases.
If you're having trouble with speed, try using the transactional endpoint. I tried importing your data (random cities and neighbourhoods) through IMPORT CSV in 2.2.1, and I it was slow as well, though I am not sure why. If you send your queries with parameters to the transactional endpoint in batches of 1000-5000, you can monitor the process, and probably gain a performance boost.
I managed to import 1M rows in just under 11 minutes.
I used an INDEX for Neighbourhood(name) and a unique constraint for City(name).
Give it a try and see if it works for you.
Edit:
The transactional endpoint is a restful endpoint that allows you do execute transactions in batch. You can read about it here.
Basically, it allows you to stream a bunch of queries to the server at once.
I don't know what programming language/stack you're using, but in python, using a package like py2neo, it would be something like this:
with open("city.csv", "r") as fp:
reader = csv.reader(fp)
queries = []
template = ["MERGE (c :`City` {name: {city}})",
"MERGE (c)<-[:IN]-(n :`Neighborhood` {name: {neighborhood}})"]
statement = '\n'.join(template)
batch = 5000
c = 1
start = time.time()
for row in reader:
city, neighborhood = row
params = dict(city=city, neighborhood=neighborhood)
queries.append((statement, params))
if c % batch == 0:
s = time.time()*1000
r = neo4j.run_batch_query(queries, 10)
e = time.time()*1000
print("\t{0}, {1:.00f}ms".format(c, e-s))
del queries[:]
c += 1
if queries:
s = time.time()*1000
r = neo4j.run_batch_query(queries, 300)
e = time.time()*1000
print("\t{0} {1:.00f}ms".format(c, e-s))
end = time.time()
print("End. {0}s".format(end-start))
Helper functions:
def run_batch_query(queries, timeout=None):
if timeout:
http.socket_timeout = timeout
try:
graph = Graph(uri) # "{protocol}://{host}:{port}/db/data/"
tx = graph.cypher.begin()
for query in queries:
statement, params = query
tx.append(statement, params)
results = tx.process()
tx.commit()
except http.SocketError as err:
raise err
collection = []
for result in results:
records = []
for record in result:
records.append(record)
collection.append(records)
return collection
You will monitor how long each transaction takes, and you can tweak the number of queries per transactions, as well as the timeout.

To be sure we're on the same page, this is how I understand your model: Each city is unique and should have some number of neighborhoods pointing to it. The neighborhoods are unique within the context of a city, but not globally. So if you have a neighborhood 3 [IN] city Boston, you could also have a neighborhood 3 [IN] city Seattle, and both of those neighborhoods are represented by different nodes, even though they have the same name property. Is that correct?
Before importing, I would recommend adding an index to your neighborhood nodes. You can add the index without enforcing uniqueness. I have found that this greatly increases speeds on even small databases.
CREATE INDEX ON :Neighborhood(name)
And for the import:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
MERGE (c:City {name: line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name: toInt(line.Neighborhood)})
If you are importing a large amount of data, it may be best to use the USING PERIODIC COMMIT command to commit periodically while importing. This will reduce the memory used in the process, and if your server is memory-constrained, I could see it helping performance. In your case, with almost a million records, this is recommended by Neo4j. You can even adjust how often the commit happens by doing USING PERIODIC COMMIT 10000 or such. The docs say 1000 is the default. Just understand that this will break the import into several transactions.
Best of luck!

Related

Apache Beam Pipeline Write to Multiple BQ tables

I have a scenario where I need to do the following:
Read data from pubsub
Apply multiple Transformations to the data.
Persist the PCollection in multiple Google Big Query based on some config.
My question is how can I write data to multiple big query tables.
I searched for multiple bq writes using apache beam but could not find any solution
You can do that with 3 sinks, example with Beam Python :
def map1(self, element):
...
def map2(self, element):
...
def map3(self, element):
...
def main() -> None:
logging.getLogger().setLevel(logging.INFO)
your_options = PipelineOptions().view_as(YourOptions)
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
result_pcollection = (
p
| 'Read from pub sub' >> ReadFromPubSub(subscription='input_subscription')
| 'Map 1' >> beam.Map(map1)
| 'Map 2' >> beam.Map(map2)
| 'Map 3' >> beam.Map(map3)
)
(result_pcollection |
'Write to BQ table 1' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table1',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
(result_pcollection |
'Write to BQ table 2' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table2',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
(result_pcollection_pub_sub |
'Write to BQ table 3' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table3',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
if __name__ == "__main__":
main()
The first PCollection is the result of input from PubSub.
I applied 3 transformations in the input PCollection
Sink the result to the 3 different Bigquery table
res = Flow
=> Map 1
=> Map 2
=> Map 3
res => Sink result to BQ table 1 with `BigqueryIO`
res => Sink result to BQ table 2 with `BigqueryIO`
res => Sink result to BQ table 3 with `BigqueryIO`
In this example I used STREAMING_INSERT for ingestion to Bigquery tables, but you can adapt and change it if needed in your case.
I see the previous answers satisfy your requirement of writing the same result to multiple tables. However, I assume the below scenarios, provide a bit different pipeline.
Read data from PubSub
Filter the data based on configs (from event message keys)
Apply the different/same transformation to the filtered collections
Write results from previous collections to different BigQuery Sinks
Here, we filtered the events at early stages in the pipeline, this is helpful in:
Avoid processing the same event messages multiple times.
You can skip the messages which are not needed.
Apply relevant transformations to event messages.
Overall efficient and cost-effective system.
For example, you are processing messages from all around the world and you need to process and store the data with respect to geography - storing Europe messages in the Europe region.
Also, you need to apply transformations which are relevant to the country-specific data - add an Aadhar number to messages generated from India and Social Security number to messages generated from the USA.
And you don't want to process/store any events from specific countries - data from oceanic countries are irrelevant and not required to process/stored in our use case.
So, in this made-up example, filtering the data (based on the config) at the early stage, you will be able to store country-specific data (multiple sinks), and you don't have to process all events generated from the USA/any other region for adding an Aadhar number (event specific transformations) and you will be able to skip/drop the records or simply store them in BigQuery without applying any transformations.
If the above made-up example resembles your scenario, the sample pipeline design may look like this
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions,...
from apache_beam.io.gcp.internal.clients import bigquery
class TaggedData(beam.DoFn):
def process(self, element):
try:
# filter here
if(element["country"] == "in")
yield {"indiaelements:taggedasindia"}
if(element["country"] == "usa")
yield {"usaelements:taggedasusa"}
...
except:
yield {"taggedasunprocessed"}
def addAadhar(element):
"Filtered messages - only India"
yield "elementwithAadhar"
def addSSN(element):
"Filtered messages - only USA"
yield "elementwithSSN"
p = beam.Pipeline(options=options)
messages = (
p
| "ReadFromPubSub" >> ...
| "Tagging >> "beam.ParDo(TaggedData()).with_outputs('usa', 'india', 'oceania', ...)
)
india_messages = (
messages.india
| "AddAdhar" >> ...
| "WriteIndiamsgToBQ" >> streaming inserts
)
usa_messages = (
messages.usa
| "AddSSN" >> ...
| "WriteUSAmsgToBQ" >> streaming inserts
)
oceania_messages = (
messages.oceania
| "DoNothing&WriteUSAmsgToBQ" >> streaming inserts
)
deadletter = (
(messages.unprocessed, stage1.failed, stage2.failed)
| "CombineAllFailed" >> Flatn...
| "WriteUnprocessed/InvalidMessagesToBQ" >> streaminginserts...
)

Cypher: How to create a recursive cost query with alternates?

I have the following structure:
(:pattern)-[:contains]->(:pattern)
...basically a hierarchy of patterns that use other patterns as content. These constitute trees.
Certain patterns are generated by certain generators:
(:generator)-[:canProduce]->(:pattern)
The canProduce relationship has a cost value associated with it as a property. Multiple generators can create the same pattern.
I would like to figure out, with a query, what patterns I need to generate to produce a particular output - and which generators to choose to have the lowest cost. I started like this:
MATCH (p:pattern {name: 'preciousPattern'})-[:contains *]->(ps:pattern) RETURN ps
so far so good. The results don't contain the starting pattern, so I made this:
MATCH (p:pattern {name: 'preciousPattern'})-[:contains *]->(ps:pattern)
WITH p+collect(ps) as list
UNWIND list as patterns
RETURN patterns
That does not feel elegant, but it also does not provide the hierarchy
I can of course do a path query (MATCH path = MATCH...) but the results don't seem very useful.
Also, now I need to connect the cost from the generator relationship.
I tried this:
MATCH (p:pattern {name: 'awesome'})-[:contains *]->(ps:pattern)
WITH p+collect(ps) as list
UNWIND list as rec
CALL {
WITH rec
MATCH (rec)-[r:canGenerate]-(g:generator)
return r.GenCost as GenCost, g.name AS GenName
}
return rec.name, GenCost , GenName
The problem I have now is that if any of the patterns that are part of another pattern can be generated by multiple generators, I just get double entries in the list, but what I want is separate lists for each alternative possibility, so that I can generate the cost.
This is my pattern tree:
Awesome
input1
input2
input 3
Input 3 can be generated by 2 different generators. I now get:
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input3 | 1.25 | TestGen3
input4 | 1.4 | TestGen4
What I want is this: Two lists (or n, in the general case, where I might have n possible paths), one
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input3 | 1.25 | TestGen3
and one:
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input4 | 1.4 | TestGen4
each set representing one alternative set, so that I can calculate the costs and compare.
I have no idea how to do something like that. Any suggestions?

Create variable based on value in multiple columns?

There is a rather large Stata dataset (education) with 60+ variables devoted to 'exam taken' information and a few others based on student gender, age, demographics, etc. There are tens of thousands of students (rows). Unfortunately the grades on various tests are not standard (combo of letters and numbers, and may appear in any of the 60+ columns for each student, depending on when they took the relevant exam). I'm trying to create a new variable, identifying all those who took some variation of the G40 or G41 exam at this time. The grade columns are all assigned as dx with a number, so I've started by trying the following:
gen byte event = 0
replace event = 1 if dx1 == "G40" | dx1 == "G41"| dx2 == "G40" | dx2 == "G41" | dx3 == "G40" | dx3 == "G41" | dx4 == "G40" | dx4 == "G41" | dx5 == "G40" | dx5 == "G41" & age < 12
I don't want to write out every single one of the 60+ columns each time I'm making a new variable for a new exam. Is there a faster way of doing this?
I am going to show two techniques, as one is good for the smaller code example you give and one is better for 60+ "columns" (variables!).
Just your example I would tend to write as one line
gen byte event = ( inlist("G40", dx1, dx2, dx3, dx4, dx5) | ///
inlist("G41", dx1, dx2, dx3, dx4, dx5) ) & age < 12
For 60+ such variables I would write a loop.
gen byte event = 0
foreach v of var dx* {
display "`v' " _c
replace event = 1 if inlist(`v', "G40", "G41") & age < 12
}
where for purposes of debugging, or just understanding, the output is noisier than would be customary once the operations seem routine. A standard trick with inlist() is to note that a test of the form foo == whatever is the same as a test of whatever == foo so there is often a choice about which argument is first and which other argument(s) follow.

How to ignore other instances of a word when looping it via pandas.groupby.agg?

I have a code (see below) that I used to match word occurrences per Location. My problem is that it reads all instances of the word.
FOR EXAMPLE: This is what I was hoping it to do, but the code below counted all occurrences of 'help' including 'helping' and 'helped'
tidytext2 | Location | occurrences
she used to help me | Aus | 1
help is on the way | UK | 1
Helping is a kind gift | UK | 0
She helped me when I needed it | Japan | 0
Why dont u help me? | SA | 1
Help me! Im hungry help | Rwanda | 2
words = [i[0] for i in pos_freq.most_common()]
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(i)
funs = {i: 'sum' for i in words}
groupedpos = positivedf.groupby('Location').agg(funs)
I got positive_freq.most_common() using the following codes. It returns
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import string
def process_text(text):
tokens = []
for line in text:
toks = tokenizer.tokenize(line)
toks = [t.lower() for t in toks if t.lower() not in stopwords_list]
tokens.extend(toks)
return tokens
tokenizer=TweetTokenizer()
punct = list(string.punctuation)
stopwords_list = stopwords.words('english') + punct
pos_lines = list(positivedf.tidytext2)
pos_tokens = process_text(pos_lines)
pos_freq = nltk.FreqDist(pos_tokens)
pos_freq.most_common()
[('help', 7)]
You need to use regex for this:
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(r'(?<!\S)'+i+'(?!\S)')
If you want to be case insensitive:
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(r'(?i)(?<!\S)'+i+'(?!\S)')

SQL - postgres - shortest path in graph - recursion

I have a table which contains the edges from node x to node y in a graph.
n1 | n2
-------
a | a
a | b
a | c
b | b
b | d
b | c
d | e
I would like to create a (materialized) view which denotes the shortest number of nodes/hops a path contains to reach from x to node y:
n1 | n2 | c
-----------
a | a | 0
a | b | 1
a | c | 1
a | d | 2
a | e | 3
b | b | 0
b | d | 1
b | c | 1
b | e | 2
d | e | 1
How should I model my tables and views to facilitate this? I guess I need some kind of recursion, but I believe that is pretty difficult to accomplish in SQL. I would like to avoid that, for example, the clients need to fire 10 queries if the path happens to contain 10 nodes/hops.
This works for me, but it's kinda ugly:
WITH RECURSIVE paths (n1, n2, distance) AS (
SELECT
nodes.n1,
nodes.n2,
1
FROM
nodes
WHERE
nodes.n1 <> nodes.n2
UNION ALL
SELECT
paths.n1,
nodes.n2,
paths.distance + 1
FROM
paths
JOIN nodes
ON
paths.n2 = nodes.n1
WHERE
nodes.n1 <> nodes.n2
)
SELECT
paths.n1,
paths.n2,
min(distance)
FROM
paths
GROUP BY
1, 2
UNION
SELECT
nodes.n1,
nodes.n2,
0
FROM
nodes
WHERE
nodes.n1 = nodes.n2
Also, I am not sure how good it will perform against larger datasets. As suggested by Mark Mann, you may want to use a graph library instead, e.g. pygraph.
EDIT: here's a sample with pygraph
from pygraph.algorithms.minmax import shortest_path
from pygraph.classes.digraph import digraph
g = digraph()
g.add_node('a')
g.add_node('b')
g.add_node('c')
g.add_node('d')
g.add_node('e')
g.add_edge(('a', 'a'))
g.add_edge(('a', 'b'))
g.add_edge(('a', 'c'))
g.add_edge(('b', 'b'))
g.add_edge(('b', 'd'))
g.add_edge(('b', 'c'))
g.add_edge(('d', 'e'))
for source in g.nodes():
tree, distances = shortest_path(g, source)
for target, distance in distances.iteritems():
if distance == 0 and not g.has_edge((source, target)):
continue
print source, target, distance
Excluding the graph building time, this takes 0.3ms while the SQL version takes 0.5ms.
Expanding on Mark's answer, there are some very reasonable approaches to explore a graph in SQL as well. In fact, they'll be faster than the dedicated libraries in perl or python, in that DB indexes will spare you the need to explore the graph.
The most efficient of index (if the graph is not constantly changing) is a nested-tree variation called the GRIPP index. (The linked paper mentions other approaches.)
If your graph is constantly changing, you might want to adapt the nested intervals approach to graphs, in a similar manner that GRIPP extends nested sets, or to simply use floats instead of integers (don't forget to normalize them by casting to numeric and back to float if you do).
Rather than computing these values on the fly, why not create a real table with all interesting pairs along with the shortest path value. Then whenever data is inserted, deleted or updated in your data table, you can recalculate all of the shortest path information. (Perl's Graph module is particularly well-suited to this task, and Perl's DBI interface makes the code straightforward.)
By using an external process, you can also limit the number of recalculations. Using PostgreSQL triggers would cause recalculations to occur on every insert, update and delete, but if you knew you were going to be adding twenty pairs of points, you could wait until your inserts were completed before doing the calculations.