I am working on around 50000 tweets as node having similar data as shown below.
{
"date": "2017-05-26T09:50:44.000Z",
"author_name": "djgoodlook",
"share_count": 0,
"mention_name": "firstpost",
"tweet_id": "868041705257402368",
"mention_id": "256495314",
"location": "pune india",
"retweet_id": "868039862774931456",
"type": "Retweet",
"author_id": "103535663",
"hashtag": "KamalHaasan"
}
I have tried to make relationships between tweets having same location by using following command
MATCH (a:TweetData),(b:TweetData)
WHERE a.location = b.location AND NOT a.tweet_id = b.tweet_id
CREATE (a)-[r:SameLocation]->(b)
RETURN r
And using this command I didn't able to make relationship as it is took more than 20 hour and still didn't produced the results. While for hashtag relationship it worked fine with similar command as it took around 5 minutes.
Is their any other method to make relationship or any way to optimise this query.
Yes. First, make sure you have an index on :TweetData(location), that's the most important change, since without that every single node lookup will have to scan all 50k :TweetData nodes for a common location (that's 50k ^2 lookups).
Next, it's better to ensure one node's id is less than the other, otherwise you'll get the same pairs of nodes twice, with just the order reversed, resulting in two relationships for every pair, one in each direction, instead of just the single relationship you want.
Lastly, do you really need to return all relationships? That may kill your browser, maybe return just the count of relationships added.
MATCH (a:TweetData)
MATCH (b:TweetData)
WHERE a.location = b.location AND a.tweet_id < b.tweet_id
CREATE (a)-[r:SameLocation]->(b)
RETURN count(r)
One other thing to (strongly) consider is instead of tracking common locations this way, create a :Location node instead, and link all :TweetData nodes to it.
You will need an index or unique constraint on :Location(name), then:
MATCH (a:TweetData)
MERGE (l:Location {name:a.location})
CREATE (a)-[:LOCATION]->(l)
This approach also more easily lends itself to batching, if 50k nodes at once is too much. You can just use LIMIT and SKIP after your match to a.
Related
I have multiple sets of data which are sourced from an Entity Framework code-first context (SQL CE). There's a GUI which displays the number of records in each query set, and upon changing some set condition (e.g. Date), the sets all need to recalculate their "count" value.
While every set's query is slightly different in some way, most of them share common conditions in some way. A simple example:
RelevantCustomers = People.Where(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0 && P.Type=="Customer")
RelevantSuppliers = People.Where(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0 && P.Type=="Supplier")
So the thing is, there's enough of these demanding queries, that each time the user changes some condition (e.g. SelectedDate), it takes a really long time to recalculate the number of records in each set.
I realise that part of the reason for this is the need to query through, for example, the transactions each time to check what is really the same condition for both RelevantCustomers and RelevantSuppliers.
So my question is that, given these sets share common "base conditions" which depend on the same sets of data, is there some more efficicent way I could be calculating these sets?
I was thinking something with custom generic classes like this:
QueryGroup<People>(P=>P.Transactions.Where(T=>T.Date>SelectedDate).Count>0)
{
new Query<People>("Customers", P=>P.Type=="Customer"),
new Query<People>("Suppliers", P=>P.Type=="Supplier")
}
I can structure this just fine, but what I'm finding is that it makes basically no difference to the efficiency as it still needs to repeat the "shared condition" for each set.
I've also tried pulling the base condition data out as a static "ToList()" first, but this causes issues when running into navigation entities (i.e. People.Addresses don't get loaded).
Is there some method I'm not aware of here in terms of efficiency?
Thanks in advance!
Give something like this a try: Combine "similar" values into fewer queries, then separate the results afterwards. Also, use Any() rather than Count() for exists check. Your updated attempt goes part-way, but will still result in 2x hits to the database. Also, when querying it helps to ensure that you are querying against indexed fields, and those indexes will be more efficient with numeric IDs rather than strings. (I.e. a TypeID of 1 vs. 2 for "Customer" vs. "Supplier") Normalized values are better for indexing and lead to smaller records, at the cost of extra verbose queries.
var types = new string[] {"Customer", "Supplier"};
var people = People.Where(p => types.Contains(p.Type)
&& p.Transactions.Any(t => t.Date > selectedDate)).ToList();
var relevantCustomers = people.Where(p => p.Type == "Customer").ToList();
var relevantSuppliers = people.Where(p => p.Type == "Supplier").ToList();
This results in just one hit to the database, and the Any should be more perform-ant than fetching an entire count. We split the customers and suppliers after the fact from the in-memory set. The caveat here is that any attempt to access details such as transactions etc. on customers and suppliers would result in lazy-load hits since we didn't eager load them. If you need entire entity graphs then be sure to .Include() relevant details, or be more selective on the data extracted from the first query. I.e. select anonymous types with the applicable details rather than just the entity.
I have those datas that change enough not to be in my postgres tables.
I would like to get tops out of those data.
I'm trying to figure out a way to do this considering :
Easiness of use
Performance
1. Using Hash + CRON to build ordered sets frequently
In this case, I have lot of users data stored in hash like this :
u:25463:d = { "xp":45124, "lvl": 12, "like": 15; "liked": 2 }
u:2143:d = { "xp":4523, "lvl": 10, "like": 12; "liked": 5 }
If I want to get the top 15 of the higher lvl people. I dont think I can do this with a single command. I think I'll need to SCAN the all u:x:d datas and build sorted sets out of it. Am I mistaken ?
What about performance in this case ?
2.Multiple Ordered sets
In this case, I duplicate datas.
I still have to first case but I also update datas in the differents sorted sets and I don't need to use a CRON to built them.
I feel like the best approach is the first one but what if I have 1000000 users ?
Or is there another way ?
One possibility would be to use a single sorted set + hashes.
The sorted set would just be used as a lookup, it would store the key of a user's hash as the value and their level as the score.
Any time you add a new player / update their level, you would both set the hash, and insert the item into the sorted set. You could do this in a transaction based pipeline, or a lua script to be sure they both run at the same time, keeping your data consistent.
Getting the top players would mean grabbing the top entries in the sorted set, and then using the keys from that set, to go lookup the full data on those players with the hashes.
Hope that helps.
I have a hundreds of soccer games saved in my redus database. They are saved in hashes under the key: games:soccer:data I have three z sets to clasify them into upcoming, live, and ended. All ordered by date (score). This way I can easily retrieve them depending on if will start soon, they are already happening, or they already ended. Now, i want to be able to retrieve them by league names.
I came up with two alternatives:
First alternative: save single hashes containing the game id and the league name. This way I can get all live game ids and then check each id against their respective hashes, if it matches the league name(s) i want, then i push it into an array, if not, i skip it. Finally, return the array with all game ids for the leagues i wanted.
Second alternative: create keys for each league and have live, upcoming, and ended sets for each. This way, i think, it would be faster to retrieve the game ids; however, it would be a pain to maintain each set.
If you have any other way of doing this, please let me know. I don't know if sorting would be faster and save me some memory.
I am looking for speed and low memory usage.
EDIT (following hobbs alternative):
const multi = client.multi();
const tempSet = 'users:data:14:sports:soccer:lists:temp_' + getTimestamp();
return multi
.sunionstore(
tempSet,
[
'sports:soccer:lists:leagueNames:Bundesliga',
'sports:soccer:lists:leagueNames:La Liga'
]
)
.zinterstore(
'users:data:14:sports:soccer:lists:live',
2,
'sports:lists:live',
tempSet
)
.del(tempSet)
.execAsync()
I need to set AGGREGATE MAX to my query and I have no idea how.
One way would be to use a SET containing all of the games for each league, and use ZINTERSTORE to compute the intersection between your league sets and your existing sets. You could do the ZINTERSTORE every time you query the data (it's not a horribly expensive operation unless your data is very large), or you could do it only when writing to one of the "parent" sets, or you could treat it as a sort of cache by giving it a short TTL and creating it only if it doesn't exist when you go to query it.
What's an efficient way to find all nodes within N hops of a given node? My particular graph isn't highly connected, i.e. most nodes have only degree 2, so for example the following query returns only 27 nodes (as expected), but it takes about a minute of runtime and the CPU is pegged:
MATCH (a {id:"36380_A"})-[*1..20]-(b) RETURN a,b;
All the engine's time is spent in traversals, because if I just find that starting node by itself, the result returns instantly.
I really only want the set of unique nodes and relationships (for visualization), so I also tried adding DISTINCT to try to stop it from re-visiting nodes it's seen before, but I see no change in run time.
As you said, matching the start node alone is really fast and faster if your property is indexed.
However what you are trying to do now is matching the whole pattern in the graph.
Keep your idea of your fast starting point:
MATCH (a:Label {id:"1234-a"})
once you got it pass it to the rest of the query with WITH
WITH a
then match the relationships from your fast starting point :
MATCH (a)-[:Rel*1..20]->(b)
We're developing an application based on neo4j and php with about 200k nodes, which every node has a property like type='user' or type='company' to denote a specific entity of our application. We need to get the count of all nodes of a specific type in the graph.
We created an index for every entity like users, companies which holds the nodes of that property. So inside users index resides 130K nodes, and the rest on companies.
With Cypher we quering like this.
START u=node:users('id:*')
RETURN count(u)
And the results are
Returned 1 row.Query took 4080ms
The Server is configured as default with a little tweaks, but 4 sec is too for our needs. Think that the database will grow in 1 month 20K, so we need this query performs very very much.
Is there any other way to do this, maybe with Gremlin, or with some other server plugin?
I'll cache those results, but I want to know if is possible to tweak this.
Thanks a lot and sorry for my poor english.
Finaly, using Gremlin instead of Cypher, I found the solution.
g.getRawGraph().index().forNodes('NAME_OF_USERS_INDEX').query(
new org.neo4j.index.lucene.QueryContext('*')
).size()
This method uses the lucene index to get "aproximate" rows.
Thanks again to all.
Mmh,
this is really about the performance of that Lucene index. If you just need this single query most of the time, why not update an integer with the total count on some node somewhere, and maybe update that together with the index insertions, for good measure run an update with the query above every night on it?
You could instead keep a property on a specific node up to date with the number of such nodes, where updates are done guarded by write locks:
Transaction tx = db.beginTx();
try {
...
...
tx.acquireWriteLock( countingNode );
countingNode.setProperty( "user_count",
((Integer)countingNode.getProperty( "user_count" ))+1 );
tx.success();
} finally {
tx.finish();
}
If you want the best performance, don't model your entity categories as properties on the node. In stead, do it like this :
company1-[:IS_ENTITY]->companyentity
Or if you are using 2.0
company1:COMPANY
The second would also allow you automatically update your index in a separate background thread by the way, imo one of the best new features of 2.0
The first method should also proof more efficient, since making a "hop" in general takes less time than reading a property from a node. It does however require you to create a separate index for the entities.
Your queries would look like this :
v2.0
MATCH company:COMPANY
RETURN count(company)
v1.9
START entity=node:entityindex(value='company')
MATCH company-[:IS_ENTITIY]->entity
RETURN count(company)