Having trouble with indexing in Neo4j - indexing

I have a dataset with the following details:
1.4 million nodes
2.9 million relationships
15 million properties (including gender, name, subscriber_id etc)
1 relationship type (Contacted)
I've batch imported the data into the database on my machine (64 bit, 16 core, 16 GB RAM) using https://github.com/jexp/batch-import/tree/20
I'm trying to index these nodes on Subscriber_ID, but I'm not really sure what I'm doing.
I ran
start n = node(*) set n:Subscribers
My understanding is this creates a label for each of the nodes (is this correct)
Next I ran
create index on :Subscribers(SUBSCRIBER_ID)
Which I think should create an index for all nodes with the 'Subscribers' label on the property 'SUBSCRIBER_ID'. (correct?)
Now when I go to Neo4j-sh and run
neo4j-sh (?)$ schema
==> Indexes
==> ON :Subscribers(SU_SUBSCRIBER_ID) ONLINE
==>
==> No constraints
But when I run the following it says there are no indices set for the nodes.
neo4j-sh (?)$ index --indexes
==> Node indexes:
==>
==> Relationship indexes:
I have a few questions
Do I have to tell it to index the existing data? If so how do I do
that?
How can I then use the index? I've read through the
documentation but I had a bit of trouble following it.
It looks
like I can have the indexes set up when I run the batch import
script, but I can't really understand how... could someone explain
please?
Here's an example of my data:
Nodes.txt
id SU_SUBSCRIBER_ID CU_FIRST_NAME gender SU_AGE
0 123456 Ann F 56
1 832746 ? UNKNOWN -1
2 546765 Tom UNKNOWN -1
3 768345 Anges F 72
4 267854 Aoibhlinn F 38
rels.csv
start end rel counter
0 3 CONTACTED 2
1 2 CONTACTED 1
1 4 CONTACTED 1
3 2 CONTACTED 2
4 1 CONTACTED 1

schema is the right command to look at.
Cypher uses the label indexes automatically for MERGE and MATCH.
With the Java Core-API you'd use db.findNodesByLabelAndProperty(label,property,value)
You did the right thing, except for one. You could have created the labels on the nodes while doing the batch-import.
Just add a l:label field to your CSV-file containing a comma separated list of labels per node. Like shown in the readme on that branch.

Related

BST Construction with multiple choices

I encountered the following question:
The Height of a BST constructed from the following list of values 10 , 5 , 4 , 3 , 2 , 1 , 0 , 9 , 13 , 11 , 12 , 16 , 20 , 30 , 40 , 14 will be:
A) 5
B) 6
C) 7
D) 8
Now to certain point I can construct the Tree with no problem because there aren't many choices to insert the values so from the first value 10 to the eleventh value 12 my tree would go like this:
10
/ \
5 13
/ \ /
4 9 11
/ \
3 12
/
2
/
1
/
0
But after that, now I have to add the value 16 and I have two choices to make it the right child of the value 12 or to make it the right child of the value 13 and the height would differ according to this choice, so what's the right approach here?
A binary search tree is a binary tree (that is, every node has at most 2 children) where for each node in the tree, all nodes in the left subtree rooted at that node are less than the root of the subtree and all nodes in the right subtree rooted at that node are greater than the root of the subtree.
Your walkthrough so far assumes a "naive" insertion algorithm that makes no attempt at balancing the tree and inserts the nodes in sequence by recursively comparing against each node from the root. Under typical circumstances, you would not want to build such an unbalanced tree because the performance of operations devolves into O(n) time instead of the optimal O(log(n)) time, which defeats the purpose of the BST structure.
Making 16 the right child of 12 would be an invalid choice because 16 as a left child of 13 violates the BST property. By definition, every insertion must go in one location, so there are no choices to be made, at least following this naive algorithm.
Following through with your approach, it should be clear that the final height (longest root-to-leaf path) will be 7 in any case due to the unbalanced left branch.
However, if the question is to find the height of an optimal tree, then the correct answer is 5. The below illustrates a complete tree (that is, every level is full except the bottom level, and that level has all elements to the left) that can be built from the given data:
11
+---------------+
4 16
+------+ +-------+
2 9 14 30
+---+ +---+ +---+ +---+
1 3 5 10 12 13 20 40
+--
0
Take a look at self-balancing binary search tree data structures for more information on algorithms that can build such a tree.

Composite indexing using Redis in a hierarchical data model

I have a data model like this:
Fields:
counter number (e.g. 00888, 00777, 00123 etc)
counter code (e.g. XA, XD, ZA, SI etc)
start date (e.g. 2017-12-31 ...)
end date (e.g. 2017-12-31 ...)
Other counter date (e.g. xxxxx)
Current Datastructure organization is like this (root and multiple child format):
counter_num + counter_code
---> start_date + end_date --> xxxxxxxx
---> start_date + end_date --> xxxxxxxx
---> start_date + end_date --> xxxxxxxx
Example:
00888 + XA
---> Jan 10 + Jan 20 --> xxxxxxxx
---> Jan 21 + Jan 31 --> xxxxxxxx
---> Feb 01 + Dec 31 --> xxxxxxxx
00888 + ZI
---> Jan 09 + Feb 24 --> xxxxxxxx
---> Feb 25 + Dec 31 --> xxxxxxxx
00777 + XA
---> Jan 09 + Feb 24 --> xxxxxxxx
---> Feb 25 + Dec 31 --> xxxxxxxx
Today the retrieval happens in 2 ways:
//Fetch unique counter data using all the composite keys
counter_number + counter_code + date (start_date <= date <= end_date)
//Fetch all the counter codes and corresponding data matching the below conditions
counter_number + date (start_date <= date <= end_date)
What's the best way to model this in redis as I need to cache some of the frequently hit data. I feel sorted sets should do this somehow, but unable to model it.
UPDATE:
Just to remove the confusion, the ask here is not for an SQL "BETWEEN" like query. 'Coz I don't know what the start_date and end_date values are. Think they are just column names.
What I don't want is
SELECT * FROM redis_db
WHERE counter_num AND
date_value BETWEEN start_date AND end_date
What I want is
SELECT * FROM redis_db
WHERE counter_num AND
start_date <= specifc_date AND end_date >= specific_date
NOTE: The requirement is pretty much close to 2D indexing of what is proposed in Redis multi-dimensional indexing document
https://redis.io/topics/indexes#multi-dimensional-indexes
I understood the concept but unable to digest the implementation detail that is given.
I'm unlikely to get this done in time for the bounty, but what the hell...
This sounds like a job for geohashing. Geohashing is what you do when you want to index a 2-dimensional (or higher) dataset. For example, if you have a database of cities and you want to be able to quickly respond to queries like "find all the cities within 50km of X", you use geohashing.
For the purposes of this question, you can think of start_date and end_date as x and y coordinates. Normally in geohashing you're searching for points in your dataset near a particular point in space, or in a certain bounded region of space. In this case you just have a lower bound on one of the coordinates and an upper bound on the other one. But I suppose in practice the whole dataset is bounded anyway, so that's not a problem.
It would be nice if there was a library for doing this in Redis. There probably is, if you look hard enough. The newer versions of Redis have built-in geohashing functionality. See the commands starting with GEO. But it doesn't claim to be very accurate, and it's designed for the surface of a sphere rather than a flat surface.
So as far as I can see you have 3 options:
Map your search space to a small part of the sphere, preferably near the equator. Use the Redis GEO commands. To search, use GEOSPHERE on a circle covering the triangle you're trying to search, taking into account the inbuilt inaccuracy and the distortion you get by mapping onto the sphere, then filter the results to get the ones that are actually inside the triangle.
Find some 3rd-party geohashing client for Redis which works on flat space and is more accurate than GEO.
Read the rest of this answer, or some other primer on geohashing, then implement it yourself on top of Redis. This is the hardest (but most educational) option.
If you have a database that indexes data using a numerical ordering, such that you can do queries like "find all the rows/records for which z is between a and b", you can build a geohash index on top of it. Suppose the coordinates are (non-negative) integers x and y. Then you add an integer-valued column z, and index by z. To calculate z, write x and y in binary, then take alternate digits from each. Example:
x = 969 = 0 1 1 1 1 0 0 1 0 0 1
y = 1130 = 1 0 0 0 1 1 0 1 0 1 0
z = 1750214 = 0110101011010011000110
Note that the index allows you to find, for example, all records positioned with z between 0101100000000000000000 and 0101101111111111111111 inclusive. In other words, all records for which z starts with 010110. Or to put it another way, you can find all records for which x starts with 001 and y starts with 110. This set of records corresponds to a square in the 2-dimensional space we are trying to search.
Not all squares can be searched in this way. We'll call these ones searchable squares. Suppose the client sends a request for all records for which (x,y) is inside a particular rectangle. (Or a circle, or some other reasonable geometric shape.) Then you need to find a set of searchable squares which cover the rectangle. Then, for each of these squares you've chosen, query the database for records inside that square and send the results to the client. (But you'll have to filter the results, because not all the records in the square are actually in the original rectangle.)
There's a balance to be struck. If you choose a small number of large special squares, you'll probably end up covering a much larger area of the map than you need; the query to the database will return lots of extra results that you'll have to filter out. Alternatively, if you use lots of little special squares, you'll be doing lots of queries to the database, many of which will return no results.
I said above that x and y could be start_time and end_time. But actually the distribution of your dataset won't be as symmetrical as in most uses of geohashing. So the performance might be better (or worse) if you use x = end_time + start_time and y = end_time - start_time.
Because your question remains a bit vague on how you desire to query your data, it remains unclear on how to solve your question. With that in mind, however, here are my thoughts on how I might model your data:
Updated answer, detailing how to use SORTED SET
I have edited this answer to be able to store your values in a way that you can query by dynamic date ranges. This edit assumes that your database values are timestamps, as in the value is for a single time, not 2, as in your current setup.
Yes, you are correct that using Sorted Sets will be able to accomplish this. I suggest that you always use a Unix timestamp value for the score component in these sorted sets.
In case you were not already familiar with redis, let's explain indexing limitations. Redis is a simple key-value designed to quickly retrieve values by a key. Because of this design, it does not contain many features of your traditional DBMS, like indexing a column for instance.
In redis, you accomplish indexing by using a key, and the most nested key-like structures are available in HASH and SORTED SET, but you only get 2 key-like structures. In a HASH, you have the key (same as any data type), and a inner hash key, which can take the form of any string.
In a SORTED SET, you have the key (same as any data type), and a numeric value.
A HASH is nice to use to keep a grouped data organized.
A SORTED SET is nice if you want to query by a range of values. This could be a good fit for your data.
Your SORTED SET would look like the following:
key
00888:XA =>
score (date value) value
1452427200 (2016-01-10) xxxxxxxx
1452859200 (2016-01-10) yyyyxxxx
1453291200 (2016-01-10) zzzzxxxx
Let's use a more intuitive example, the 2017 Juventus roster:
To produce the SORTED SET in the table below, issue this command in your redis client:
ZADD JUVENTUS 32 "Emil Audero" 1 "Gianluigi Buffon" 42 "Mattia Del Favero" 36 "Leonardo Loria" 25 "Neto" 15 "Andrea Barzagli" 4 "Medhi Benatia" 19 "Leonardo Bonucci" 3 "Giorgio Chiellini" 40 "Luca Coccolo" 29 "Paolo De Ceglie" 26 "Stephan Lichtsteiner" 12 "Alex Sandro" 24 "Daniele Rugani" 43 "Alessandro Semprini" 23 "Dani Alves" 22 "Kwadwo Asamoah" 7 "Juan Cuadrado" 6 "Sami Khedira" 18 "Mario Lemina" 46 "Mehdi Leris" 38 "Rolando Mandragora" 8 "Claudio Marchisio" 14 "Federico Mattiello" 45 "Simone Muratore" 20 "Marko Pjaca" 5 "Miralem Pjanic" 28 "Tomás Rincón" 27 "Stefano Sturaro" 21 "Paulo Dybala" 9 "Gonzalo Higuaín" 34 "Moise Kean" 17 "Mario Mandzukic"
Jersey Name Jersey Name
32 Emil Audero 23 Dani Alves
1 Gianluigi Buffon 42 Mattia Del Favero
36 Leonardo Loria 25 Neto
15 Andrea Barzagli 4 Medhi Benatia
19 Leonardo Bonucci 3 Giorgio Chiellini
40 Luca Coccolo 29 Paolo De Ceglie
26 Stephan Lichtsteiner 12 Alex Sandro
24 Daniele Rugani 43 Alessandro Semprini
22 Kwadwo Asamoah 7 Juan Cuadrado
6 Sami Khedira 18 Mario Lemina
46 Mehdi Leris 38 Rolando Mandragora
8 Claudio Marchisio 14 Federico Mattiello
45 Simone Muratore 20 Marko Pjaca
5 Miralem Pjanic 28 Tomás Rincón
27 Stefano Sturaro 21 Paulo Dybala
9 Gonzalo Higuaín 34 Moise Kean
17 Mario Mandzukic
To query the roster by a range of jersey numbers:
ZRANGEBYSCORE JUVENTUS 1 5
Output:
1) "Gianluigi Buffon"
2) "Giorgio Chiellini"
3) "Medhi Benatia"
4) "Miralem Pjanic"
Note that the scores are not returned, however ZRANGEBYSCORE command orders the results in ASC order by score.
To add the scores, append "WITHSCORES" to the command, like so: ZRANGEBYSCORE JUVENTUS 1 5 WITHSCORES
By using ZRANGEBYSCORE, you should be able to query any key (counter number + counter code) with a date range,
producing the values in that range.
Original: Below is my original answer, recommending HASH
Based on your examples, I recommend you use a HASH.
With a hash, you would have a main key to find the hash (Ex. 00888:XA). Then within the hash, you have key -> value pairs (Ex. 2017-01-10:2017-01-20 -> xxxxxxxx). I prefer to delimit or tokenize my keys' components with the colon char :, but you can use any delimiter.
HASH follows your example data structure very well:
key
00888:XA =>
hashkey value
2017-01-10:2017-01-20 xxxxxxxx
2017-01-21:2017-01-31 yyyyxxxx
2016-02-01:2016-12-31 zzzzxxxx
key
00888:ZI =>
hashkey value
2017-01-10:2017-01-20 xxxxxxxx
2017-01-21:2017-01-31 xxxxyyyy
2016-02-01:2016-12-31 xxxxzzzz
When querying for data, instead of GET key, you would query with HGET key hashkey. Same for setting values, instead of SET key value, use HSET key hashkey value.
Example commands
HSET 00777:XA 2017-01-10:2017-01-20 xxxxxxxx
HSET 00777:XA 2017-01-21:2017-01-31 yyyyyyyy
HSET 00777:XA 2016-02-01:2016-12-31 zzzzzzzz
(Note: there is also a HMSET to simplify this into a single command)
Then:
HGET 00777:XA 2017-01-21:2017-01-31
Would return yyyyyyyy
Unless there is some specific performance consideration, or other goal for your data, I think Hashes will work great for your system.
It's also very convenient if you want to get all hashkeys or all values for a given hash, using commands like HKEYS, HVALS, or HGETALL.

gnuplot: Spurious data points in plots when using index

I'm trying to use gnuplot 4.6 patchlevel 6 to visualize some data from a file test.dat which looks like this:
#Pkg 1
type min max avg
small 1 10 5
medium 5 15 7
large 10 20 15
#Pkg 2
small 3 9 5
medium 5 13 6
large 11 17 13
(Note that the values are actually separated by tabs even though it shows as spaces here.)
My gnuplot commands are
reset
set datafile separator "\t"
plot 'test.dat' index 0 using 2:xticlabels(1) title col, '' using 3 title col, '' using 4 title col
This works fine as long as there is only a single data block in test.dat. When I add the second block spurious data points appear. Why is that and how can it be fixed?
YFTR: Using stat on the file yields only expected results. It reports two data blocks for the full file and correct values (for min, max and sum) when I specify one of the two using index
as mentioned in the comment to the question, one has to explicitly repeat the index 0 specification within all parts of the plot command as
plot 'test.dat' index 0 using 2, '' index 0 using 3, ...
otherwise '' refers to all blocks in the data file

Generating variable observations for one id to be observation for new variable of another id

I have a data set that allows linking friends (i.e. observing peer groups) and thereby one can observe the characteristics of an individual's friends. What I have is an 8 digit identifier, id, each id's friend id's (up to 10 friends), and then many characteristic variables.
I want to take an individual and create a variables that are the foreign born status of each friend.
I already have an indicator for each person that is 1 if foreign born. Below is a small example, for just one friend. Notice, MF1 means male friend 1 and then MF1id is the id number for male friend 1. The respondents could list up to 5 male friends and 5 female friends.
So, I need Stata to look at MF1id and then match it down the id column, then look over to f_born for that matched id, and finally input the value of f_born there back up to the original id under MF1f_born.
edit: I did a poor job of explaining the data structure. I have a cross section so 1 observation per unique id. Row 1 is the first 8 digit id number with all the variables following over the row. The repeating id numbers are between the friend id's listed for each person (mf1id for example) and the id column. I hope that is a bit more clear.
Kevin Crow wrote vlookup that makes this sort of thing pretty easy:
use http://www.ats.ucla.edu/stat/stata/faq/dyads, clear
drop team y
rename (rater ratee) (id mf1_id)
bys id: gen f_born = mod(id,2)==1
net install vlookup
vlookup mf1_id, gen(mf1f_born) key(id) value(f_born)
So, Dimitriy's suggestion of vlookup is perfect except it will not work for me. After trying vlookup with both my data set, the UCLA data that Dimitriy used for his example, and a toy data set I created vlookup always failed at the point the program attempts to save a temp file to my temp folder. Below is the program for vlookup. Notice its sets tempfile file, manipulates the data, and then saves the file.
*! version 1.0.0 KHC 16oct2003
program define vlookup, sortpreserve
version 8.0
syntax varname, Generate(name) Key(varname) Value(varname)
qui {
tempvar g k
egen `k' = group(`key')
egen `g' = group(`key' `value')
local k = `k'[_N]
local g = `g'[_N]
if `k' != `g' {
di in red "`value' is unique within `key';"
di in red /*
*/ "there are multiple observations with different `value'" /*
*/ " within `key'."
exit 9
}
preserve
tempvar g _merge
tempfile file
sort `key'
by `key' : keep if _n == 1
keep `key' `value'
sort `key'
rename `key' `varlist'
rename `value' `generate'
save `file', replace
restore
sort `varlist'
joinby `varlist' using `file', unmatched(master) _merge(`_merge')
drop `_merge'
}
end
exit
For some reason, Stata gave me an error, "invalid file," at the save `file', replace point. I have a restricted data set with requirments to point all my Stata temp files to a very specific folder that has an erasure program sweeping it every so often. I don't know why this would create a problem but maybe it is, I really don't know. Regardless, I tweaked the vlookup program and it appears to do what I need now.
clear all
set more off
capture log close
input aid mf1aid fborn
1 2 1
2 1 1
3 5 0
4 2 0
5 1 0
6 4 0
7 6 1
8 2 .
9 1 0
10 8 1
end
program define justlinkit, sortpreserve
syntax varname, Generate(name) Key(varname) Value(name)
qui {
preserve
tempvar g _merge
sort `key'
by `key' : keep if _n ==1
keep `key' `value'
sort `key'
rename `key' `varlist'
rename `value' `generate'
save "Z:\Jonathan\created data sets\justlinkit program\fchara.dta",replace
restore
sort `varlist'
joinby `varlist' using "Z:\Jonathan\created data sets\justlinkit program\fchara.dta", unmatched(master) _merge(`_merge')
drop `_merge'
}
end
// set trace on
justlinkit mf1aid, gen(mf1_fborn) key(aid) value(fborn)
sort aid
list
Well, this fixed my problem. Thanks to all who responded I would not have figured this out without you.

How to proceed with my Spark / Scala project

I am new to Spark and Scala. I am working on a Scala project where I will have data access from SQL Server.
There is a table in SQL Server has info about clothes. itemCode is the primary key and several attributes with Boolean value 0/1 - Designer, Exclusive, Handloom and several other columns having attributes of the product etc.
Code Designer Exclusive Handloom
A 1 0 1
B 1 0 0
C 0 0 1
D 0 1 0
E 0 1 0
F 1 0 1
G 0 1 0
H 0 0 0
I 1 1 1
J 1 1 1
K 0 0 1
L 0 1 0
M 0 1 0
N 1 1 0
O 0 1 1
P 1 1 0
and the list continues.
I have to select a collection of 32 items out of 320 items that have ATLEAST:
8 Designer, 8 Exclusive, 8 Handloom, 8 WeddingStyle, 8 PartyStyle,
8 Silk, 8 Georgette
I had solved the problem in MS Excel solver (it uses Gradient Descent algo) by adding an extra column and using sumproduct function between added column and required columns. So, the problem was solved there and it took around 1 minute 30 seconds for the same.
Also, the problem can be solved by writing an SQL query with 32 joins (so many), for example, if i want to select 6 items out of those 16 above with atleast 4 items designer, 4 exclusive, 4 handloom, the query would be like in my post: MYSQL - Select rows fulfilling many count conditions
In production, I have to fetch 32 rows like this way, So my question is how do I proceed further with the project.
I am working on Scala IDE for Eclipse, and have added spark mllib there. I have fetched data via JDBC and stored in a dataframe, and the created a temporary table:
dataFrame.registerTempTable("Data")
There is a class optimizer in mllib optimization that uses gradient descent (like excel solver does) to solve problems. But, that is for machine learning and takes as input training data.
I am not able to understand how do I proceed with my project. Can i use mllib, or use a better simple version of the sql with sparkSQL. I need serious help.
I'd recommend you to use https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#creating-dataframes rather than MLLib.
I solved this problem through linear programming. I have now used lpsolver library for java in my scala project. It is giving almost the same result as in excel solver.