Solandra to replace our Lucene + RDBMS? - lucene

Currently we are using a combination of SQL Server and Lucene to index some relational data about domain names. We have a Domain table, and about 10 other various other tables for histories of different metrics we calculate and store about domains. For example:
Domain
Id BIGINT
Domain NVARCHAR
IsTracked BIT
SeoScore
Id BIGINT
DomainId BIGINT
Score INT
Timestamp DATETIME
We are trying to include all the domains from major zone files in our database, so we are looking at about 600 million records eventually, which seems like it's going to be a bit of a chore to scale in SQL Server. Given our reliance on Lucene to do some pretty advanced queries, Solandra seems like it may be a nice fit. I am having a hard time not thinking about our data in relational database terms.
The SeoScore table would map one to many Domains (one record for each time we calculated the score). I'm thinking that in Solandra terms, the best way to achieve this would be use two indexes, one for Domain and one for SeoScore.
Here are the querying scenarios we need to achieve:
A 'current snapshot' of the latest metrics for each domain (so the latest SeoScore for a given domain. I'm assuming we would find the Domain records we want first, and then run further queries to get the latest snapshot of each metric separately.
Domains with SeoScores not having been checked since x datetime, and having IsTracked=1, so we would know which ones need to be recalculated. We would need some sort of batching system here so we could 'check out' domains and run calculations on them without duplicating efforts.
Am I way off track here? Would we be right in basically mapping our tables to separate indexes in solandra in this case?
UPDATE
Here's some JSON notation of what I'm thinking:
Domains : { //Index
domain1.com : { //Document ID
Middle : "domain1", //Field
Extension : "com",
Created : '2011-01-01 01:01:01.000',
ContainsDashes : false,
ContainsNumbers : false,
IsIDNA : false,
},
domain2.com {
...
}
}
SeoScores : { //Index
domain1.com { //Document ID
'2011-02-01 01:01:01.000' : {
SeoScore: 3
},
'2011-01-01 01:01:01.000' : {
SeoScore: -1
}
},
domain2.com {
...
}
}

For SeoScores you might want to consider using virtual cores:
https://github.com/tjake/Solandra/wiki/ManagingCores
This lets you partition the data by domain so you can have SeoScores.domain1 and make each document the represent one timestamp.
The rest sounds fine.

Related

How to properly store a JSON object into a Table?

I am working on a scenario where I have invoices available in my Data Lake Store.
Invoice example (extremely simplified):
{
"business_guid":"b4f16300-8e78-4358-b3d2-b29436eaeba8",
"ingress_timestamp": 1523053808,
"client":{
"name":"Jake",
"age":55
},
"transactions":[
{
"name":"peanut",
"amount":100
},
{
"name":"avocado",
"amount":2
}
]
}
All invoices are stored in ADLS, and can be queried. But, It is my desire to provide access to the same data inside an ALD DB.
I am not an expert on unstructed data: I have RDBMS background. Taking that into consideration, I can only think of 2 possible scenarios:
2/3 tables - invoice, client (could be removed) and transaction. In this scenario, I would have to create an invoice ID to be able to build relationships between those tables
1 table - client info could be normalized into invoice data. But, transactions could (maybe) be defined as an SQL.ARRAY<SQL.MAP<string, object>>
I have mainly 3 questions:
What is the correct way of doing so? Solution 1 seems much better structured.
If I go with solution 1, how do I properly create an ID (probably GUID)? Is it acceptable to require ID creation when working with ADL?
Is there another solution I am missing here?
Thanks in advance!
This type of question is a bit like do you prefer your sauce on the pasta or next to the pasta :). The answer is: it depends.
To answer your 3 questions more seriously:
#1 has the benefit of being normalized that works well if you want to operate on the data separately (e.g., just clients, just invoices, just transactions) and want to the benefits of normalization, get the right indexing, and are not limited by the rowsize limits (e.g., your array of map needs to fit into a row). So I would recommend that approach unless your transaction data is always small and you always access the data together and mainly search on the column data.
U-SQL per se has no understanding of the hierarchy of the JSON document. Thus, you would have to write an extractor that turns your JSON into rows in a way that it either gives you the correlation of the parent to the child (normally done by stepwise downwards navigation with cross apply) and use the key value of the parent data item as the foreign key, or have the extractor generate the key (as int or guid).
There are some sample JSON extractors on the U-SQL GitHub site (start at http://usql.io) that can get you started with the JSON to rowset conversion. Note that you will probably want to optimize the extraction at some point to be JSON Reader based so you process larger docs without loading it into memory.

How to grab all nested records in the most efficient way. Rails

Say a user has many trees and trees have many apples. Say a method takes in a user's id. I eventually want the data to look something like this:
[
{ tree_id1: [apple1, apple2, apple3] },
{ tree_id2: [apple4, apple5, apple6] },
{ tree_id3: [apple9, apple8, apple7] }
]
So I'll eventually have to iterate through the user's trees and associate the apples with it. But I don't want to have to hit the database to retrieve the tree's apples every time. That seems like it'd be an N+1 (get the user's trees, then all of the tree's apples).
What can I do to retrieve the necessary records efficiently so that I can organize the records in the data format that I need?
When you make N+1 queries... why is it so bad? Where is the database in relation to app code when you say are on heroku? Is it a network call?
You can make use of eager loading
users = User.includes(trees: :apples)
Instead of firing N + 1 queries it will fire only three queries
1) Select * from users
2) Select * from trees where user_id IN [ids_of_users_fetch_from_above_query]
2) Select * from apples where apple_id IN [ids_of_trees_fetch_from_above_query]
So now when you write
users.first.apples
No query is fired as the records are already eager loaded when you fetched users
And It is the feature of Rails
You can overcome the N + 1 by applying eager loading
you can do it by using includes like the blow:
User.includes(trees: :apples)

Modelling NoSQL database (when converting from SQL database)

I have a SQL database that I want to convert to a NoSQL one (currently I'm using RavenDB)
Here are my tables:
Trace:
ID (PK, bigint, not null)
DeploymentID (FK, int, not null)
AppCode (int, not null)
Deployment:
DeploymentID (PK, int, not null)
DeploymentVersion (varchar(10), not null)
DeploymentName (nvarchar(max), not null)
Application:
AppID (PK, int, not null)
AppName (nvarchar(max), not null)
Currently I have these rows in my tables:
Trace:
ID: 1 , DeploymentID: 1, AppCode: 1
ID: 2 , DeploymentID: 1, AppCode: 2
ID: 3 , DeploymentID: 1, AppCode: 3
ID: 3 , DeploymentID: 2, AppCode: 1
Deployment:
DeploymentID: 1 , DeploymentVersion: 1.0, DeploymentName: "Test1"
DeploymentID: 2 , DeploymentVersion: 1.0, DeploymentName: "Test2"
Application:
AppID: 1 , AppName: "Test1"
AppID: 2 , AppName: "Test2"
AppID: 3 , AppName: "Test3"
My question is: HOW should I build my NoSQL document model ?
Should it look like:
trace/1
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
"Application": "Test1"
}
trace/2
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
"Application": "Test2"
}
trace/3
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
"Application": "Test3"
}
trace/4
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test2" } ],
"Application": "Test1"
}
And what if Deployment 1 gets changed ? Should I go by each document and change the data?
And when should I use references in NoSQL ?
Document databases such as Raven are not relational databases. You CANNOT first build the database model and then later on decide on various interesting ways of querying it. Instead, you should first determine what access patterns you want to support, and then design the document schemas accordingly.
So in order to answer your question, what we really need to know is how you intend to use the data. For example, displaying all traces ordered by time is a distinctly different scenario than displaying traces associated a specific deployment or application. Each one of those requirements will dictate a different design, as will supporting them both.
This in itself may be useful information to you (?), but I suspect you want more concrete answers :) So please add some additional details on your intended usage.
There are a few "do" and "don'ts" when deciding on a strategy:
DO: Optimize for the common use-cases. There is often a 20/80 breakdown where 20% of the UX drives 80% of the load - the homepage/landing page of web apps is a classic example. First priority is to make sure that these are as efficient as possible. Make sure that your data model allows either A) loading those in either a single IO request or B) is cache-friendly
DONT: don't fall into the dreaded "N+1" trap. This pattern occurs when you data model forces you to make N calls in order to load N entities, often preceded by an additional call to get the list of the N IDs. This is a killer, especially together with #3...
DO: Always cap (via the UX) the amount of data which you are willing to fetch. If the user has 3729 comments you obviously aren't going to fetch them all at once. Even it it was feasible from a database perspective, the user experience would be horrible. Thats why search engines use the "next 20 results" paradigm. So you can (for example) align the database structure to the UX and save the comments in blocks of 20. Then each page refresh involves a single DB get.
DO: Balance the Read and Write requirements. Some types of systems are read-heavy and you can assume that for each write there will be many reads (StackOverflow is a good example). So there it makes sense to make writes more expensive in order to gain benefits in read performance. For example, data denormalization and duplication. Other systems are evenly balanced or even write heavy and require other approaches
DO: Use the dimension of TIME to your advantage. Twitter is a classic example: 99.99% of tweets will never be accessed after the first hour/day/week/whatever. That opens all kinds of interesting optimization possibilities in the your data schema.
This is just the tip of the iceberg. I suggest reading up a little on column-based NoSQL systems (such as Cassandra)
How you model your documents depends mostly on your application and it's domain. From there, the document model can be refined by understanding your data access patterns.
Blindly attempting to map a relational data model to a non-relational one is probably not a good idea.
UPDATE: I think Matt got the main idea of my point here. What I am trying to say is that there is no prescribed method (that I am aware of anyway) to translate a relational data model (like a normalized SQL Schema) to a non-relational data model (like a document model) without understanding and considering the domain of the application. Let me elaborate a bit here...
After looking at your SQL schema, I have no idea what a trace is besides a table that appears to join Applications and Deployments. I also have no idea how your application typically queries the data. Knowing a little about this makes a difference when you model your documents, just as it would make a difference in the way you model your application objects (or domain objects).
So the document model suggested in your question may or may not work for you application.

neo4j count nodes performance on 200K nodes and 450K relations

We're developing an application based on neo4j and php with about 200k nodes, which every node has a property like type='user' or type='company' to denote a specific entity of our application. We need to get the count of all nodes of a specific type in the graph.
We created an index for every entity like users, companies which holds the nodes of that property. So inside users index resides 130K nodes, and the rest on companies.
With Cypher we quering like this.
START u=node:users('id:*')
RETURN count(u)
And the results are
Returned 1 row.Query took 4080ms
The Server is configured as default with a little tweaks, but 4 sec is too for our needs. Think that the database will grow in 1 month 20K, so we need this query performs very very much.
Is there any other way to do this, maybe with Gremlin, or with some other server plugin?
I'll cache those results, but I want to know if is possible to tweak this.
Thanks a lot and sorry for my poor english.
Finaly, using Gremlin instead of Cypher, I found the solution.
g.getRawGraph().index().forNodes('NAME_OF_USERS_INDEX').query(
new org.neo4j.index.lucene.QueryContext('*')
).size()
This method uses the lucene index to get "aproximate" rows.
Thanks again to all.
Mmh,
this is really about the performance of that Lucene index. If you just need this single query most of the time, why not update an integer with the total count on some node somewhere, and maybe update that together with the index insertions, for good measure run an update with the query above every night on it?
You could instead keep a property on a specific node up to date with the number of such nodes, where updates are done guarded by write locks:
Transaction tx = db.beginTx();
try {
...
...
tx.acquireWriteLock( countingNode );
countingNode.setProperty( "user_count",
((Integer)countingNode.getProperty( "user_count" ))+1 );
tx.success();
} finally {
tx.finish();
}
If you want the best performance, don't model your entity categories as properties on the node. In stead, do it like this :
company1-[:IS_ENTITY]->companyentity
Or if you are using 2.0
company1:COMPANY
The second would also allow you automatically update your index in a separate background thread by the way, imo one of the best new features of 2.0
The first method should also proof more efficient, since making a "hop" in general takes less time than reading a property from a node. It does however require you to create a separate index for the entities.
Your queries would look like this :
v2.0
MATCH company:COMPANY
RETURN count(company)
v1.9
START entity=node:entityindex(value='company')
MATCH company-[:IS_ENTITIY]->entity
RETURN count(company)

Elastic Search documents sorting, indexing issue

I have 9000 documents in my ElasticSearch index.
I want to sort by an analyzed string field, so, in order to do that i knew ( through Google ) that i must update the mapping to make the field not-analyzed so i can sort by this field and i must re-index the data again to reflect the change in mapping.
The re-indexing process consumed about 20 minutes on my machine.
The strange thing is that the re-indexing process consumed about 2 hours on a very powerful production server.
I checked the memory status and the processor usage on that server and everything was normal.
What i want to know is:
Is there a way to sort documents by an analyzed, tokenized field without re-indexing the whole documents?
If i must re-index the whole documents, then why does it take such huge time to re-index the documents on the server ?? or how to trace the slowness reason on that server?
As long as the field is stored in _source, I'm pretty sure you could use a script to create a custom fields everytime you search.
{
"query" : { "query_string" : {"query" : "*:*"} },
"sort" : {
"_script" : {
"script" : "<some sorting field>",
"type" : "number",
"params" : {},
"order" : "asc"
}
}
}
This has the downside of re-evaluating the sorting script on the server side each time you search, but I thing it solves (1).