Neo4j Configuration for 4M Nodes 10M relationship - cypher

I am new to Neo4j and have made few graph queries with 4M nodes and 10M relationships. Till now i've completely surprised from the performance of my queries.
SCHEMA
.......
(a:user{data:1})-[:follow]->(:user)-[:next*1..10]-(:activity)
Here user with data:1 is following another 100,000 user. Each of those 100,000 users have 2-8 next nodes(lets say activity of users) attached. Now i want to fetch the activities of users till next level 3 [:next*1..3] . Each activity has property relevance number.
So now i have 100,000 *3 nodes to traverse.
CYPHER
.......
match (u:user{data:1})-[:follow]-(:user)-[:next*1..3]-(a:activity)
return a order by a.relevance desc limit 50
This query is taking 72000 ms almost every time. Since i am new to Neo4j and i am sure that i haven't done tuning of the OS.
I am using following parameters-
Initial Java Heap Size (in MB)
wrapper.java.initmemory=2000
Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=2456
Default values for the low-level graph engine
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=50M
neostore.propertystore.db.mapped_memory=90M
neostore.propertystore.db.strings.mapped_memory=130M
neostore.propertystore.db.arrays.mapped_memory=130M
Please tell me where i am doing wrong. I read all the documentation from neo4j website but the query time didn't improve.
please tell me how can i configure high performing cache? What should i do so that all the graph loads up in memory? When i see my RAM usage , it is always like 1.8 GB out of 4 Gb. I am using enterprise license on windows (Neo4j 2.0). Please help.

You are actually following not 100k * 3 but, 100k * (2-10)^10 meaning 10^15 paths.
More memory in your machine would make a lot of sense, so try to get 8 or more GB.
Then you can increase the heap, e.g. to 6GB:
wrapper.java.initmemory=6000
wrapper.java.maxmemory=6000
neo4j.properties
neostore.nodestore.db.mapped_memory=100M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=200M
neostore.propertystore.db.strings.mapped_memory=200M
neostore.propertystore.db.arrays.mapped_memory=10M
If you want to pull your data through, you would most probably want to inverse your query.
match (a:activity),(u:user{data:1})
with a,u
order by a.relevance
desc limit 100
match (followed:user)-[:next*1..3]-(a:activity)
where (followed)-[:follow]-(user)
return a
order by a.relevance
desc limit 50

Related

BigQuery Count Appears to be Processing Data

I noticed that running a SELECT count(*) FROM myTable on my larger BQ tables yields long running times, upwards of 30/40 seconds despite the validator claiming the query processes 0 bytes. This doesn't seem quite right when 500 GB queries run faster. Additionally, total row counts are listed under details -> Table Info. Am I doing something wrong? Is there a way to get total row counts instantly?
When you run a count BigQuery still needs to allocate resources (such as: slot units, shards etc). You might be reaching some limits which cause a delay. For example, the slots default per project is 2,000 units.
BigQuery execution plan provides very detail information about the process which can help you better understand the source of the delay.
One way to overcome this is to use an approximate method described in this link
This Slide by Google might also help you
For more details see this video about how to understand the execution plan

BigQuery GUI - CPU Resource Limit

Is there a way to set the CPU Resource Limit on the BigQuery with Python and GUI?
I'm getting an error of:
Query exceeded resource limits. 2147706.163729571 CPU seconds were used, and this query must use less than 46300.0 CPU seconds.
Looking at the BigQuery's Python reference page: http://google-cloud-python.readthedocs.io/en/latest/bigquery/reference.html
It looks like there's:
1. maximum_billing_tier
2. maximum_bytes_billed
That can be set, but there is no CPU second options.
You cannot set anymore maximum_billing_tier - it is obsolete and as soon as you are lower than tier 100 you are billed as if it were 1. if you exceed 100 - query just failes.
As of CPU - check concept of slots
Maximum concurrent slots per project for on-demand pricing — 2,000
The default number of slots for on-demand queries is shared among all queries in a single project. As a rule, if you're processing less than 100 GB of queries at once, you're unlikely to be using all 2,000 slots.
To check how many slots you're using, see Monitoring BigQuery Using Stackdriver. If you need more than 2,000 slots, contact your sales representative to discuss whether flat-rate pricing meets your needs.
See more at https://cloud.google.com/bigquery/quotas#query_jobs

Why does DevCenter of Datastax has row restrictions to 1000?

There is a limit of displaying 1000 rows for the tables in Datastax Devcenter. Any reason for having this option?
Because when queried as SELECT count(*) FROM tablename;
the performance from Cassandra is going to be same whether displaying 1000 records or complete records set.
DevCenter version 1.6.0 introduces result set paging which allows you to browse all the rows in your result set.
In DevCenter 1.6.0 the "with limit" value sets the paging size, e.g. the number of records to view per page and is still limited to 1000 maximum. However, now you can page forward (and back) through all of the query results.
A related new feature allows you to export all results to a file, either as CSV or INSERT statements. Right-click in the results view area and select "Export all results to File as [CSV|Insert]".
This is by design; consider it as a safeguard that prevents you from potentially fetching thousands or millions of rows by accident, which, among other problems, could have a serious impact on your network's bandwidth usage.
When you run the query in Datastax DevCenter 1.6 it displays the 1000 record in result as selected limit but if you export the same result to CSV it will give you all the record which you are looking for.
I run Datastax Devcenter 1.4.
I run the query with a limit and it provides me the actual count.
But LIMIT is limited to maximum value of signed integer (2147483647)
select count(*) from users LIMIT 2147483647;-- ALLOW FILTERING;

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
Perform transformation on 2 of the fields, and write the row to BigQuery.
The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
DoFn input is a String, and its output is a BigQuery TableRow.
The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
5 workers (5 vCPUs) took ~17 mins
5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?
You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?
BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.
We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.
Net-net increasing instances is not going to get you much more throughput right now.
Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)
Make sense?

Cassandra secondary index get_indexed_slices timing out

I am using Cassandra 0.8 with 2 secondary indexes for columns like "DeviceID" and "DayOfYear". I have these two indexes in order to retrieve data for a device within a range of dates. Whenever I get a date filter, I convert this into DayOfYear and search using indexed slices using .net Thrift API. Currently I cannot upgrade the DB as well.
My problem is I usually do not have any issues retrieving rows using the get_indexed_slices query for the current date (using current day of year). But whenever I query for yesterday's day of year (which is one of the indexed column), I get a time out for the first time I make the query. And most of the times, it returns when I query the second time and 100% during the third time.
Both these columns are created as double data type in the column family and I generally get 1 record per minute. I have 3 nodes in the cluster and the nodetool reports suggest that the nodes are up and running, though the load distribution report from nodetool looks like this.
Starting NodeTool
Address DC Rack Status State Load Owns
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 7.59 GB 51.39%
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 394.24 MB 3.81%
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 4.42 GB 44.80%
and my configuration in YAML is as below.
hinted_handoff_enabled: true
max_hint_window_in_ms: 3600000 # one hour
hinted_handoff_throttle_delay_in_ms: 50
partitioner: org.apache.cassandra.dht.RandomPartitioner
commitlog_sync: periodic
commitlog_sync_period_in_ms: 120000
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 24
sliced_buffer_size_in_kb: 64
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: true
snapshot_before_compaction: false
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true
rpc_timeout_in_ms: 50000
index_interval: 128
Is there something I may be missing? Are there any problems in the config?
Duplicate your data in another column family where the key is your search data. Row slice are mutch faster
Personally I never got to use secondary index in production environments. Or I had problems with timeout, or the speed of data retrieve by secondary index was lower that the amount of data inserted. I think that it is related with not sequentially reading data and HD seek time.
If you come from a relational model, playOrm is just as fast and you can be relational on a noSQL store BUT you just need to partition your extremely large tables. IF you do that, you can then use "scalable JQL" to do your stuff
#NoSqlQuery(name="findJoinOnNullPartition", query="PARTITIONS t(:partId) select t FROM TABLE as t INNER JOIN t.security as s where s.securityType = :type and t.numShares = :shares")
IT also has the #ManyToOne, #OneToMany, etc. etc. annotations for a basic ORM though some things work differently in noSQL but a lot is the similar.
I finally solved my problem in a different way. In fact I realized the problem is with my data model.
The problem comes because we we come from a RDBMS background. I restructured the data model a little and now, I get responses faster.