Elastic Search documents sorting, indexing issue - lucene

I have 9000 documents in my ElasticSearch index.
I want to sort by an analyzed string field, so, in order to do that i knew ( through Google ) that i must update the mapping to make the field not-analyzed so i can sort by this field and i must re-index the data again to reflect the change in mapping.
The re-indexing process consumed about 20 minutes on my machine.
The strange thing is that the re-indexing process consumed about 2 hours on a very powerful production server.
I checked the memory status and the processor usage on that server and everything was normal.
What i want to know is:
Is there a way to sort documents by an analyzed, tokenized field without re-indexing the whole documents?
If i must re-index the whole documents, then why does it take such huge time to re-index the documents on the server ?? or how to trace the slowness reason on that server?

As long as the field is stored in _source, I'm pretty sure you could use a script to create a custom fields everytime you search.
{
"query" : { "query_string" : {"query" : "*:*"} },
"sort" : {
"_script" : {
"script" : "<some sorting field>",
"type" : "number",
"params" : {},
"order" : "asc"
}
}
}
This has the downside of re-evaluating the sorting script on the server side each time you search, but I thing it solves (1).

Related

Unnesting a big quantity of columns in BigQuery and BigTable

I have a table in BigTable, with a single column family, containing some lead data. I was following the Google Cloud guide to querying BigTable data from BigTable (https://cloud.google.com/bigquery/external-data-bigtable) and so far so good.
I've crated the table definition file, like the docs required:
{
"sourceFormat": "BIGTABLE",
"sourceUris": [
"https://googleapis.com/bigtable/projects/{project_id}/instances/{instance_id}/tables/{table_id}"
],
"bigtableOptions": {
"readRowkeyAsString": "true",
"columnFamilies": [
{
"familyId": "leads",
"columns": [
{
"qualifierString": "Id",
"type": "STRING"
},
{
"qualifierString": "IsDeleted",
"type": "STRING"
},
...
]
}
]
}
}
But then, things started to go south...
This is how the BigQuery "table" ended up looking:
Each row is a rowkey and inside each column there's a nested cell, where the only value I need is the value from leads.Id.cell (in this case)
After a bit of searching I found a solution to this:
https://stackoverflow.com/a/70728545/4183597
So in my case it would be something like this:
SELECT
ARRAY_TO_STRING(ARRAY(SELECT value FROM UNNEST(leads.Id.cell)), "") AS Id,
...
FROM xxx
The problem is that I'm dealing with a dataset with more than 600 columns per row. It is unfeasible (and impossible, given BigQuery's subquery limits) to repeat this process more than 600 times per row/query.
I couldn't think of a way to automate this query or even think about other methods to unnest this many cells (my SQL knowledge stops here).
Is there any way to do a unesting like this for 600+ columns, with an SQL/BigQuery query? Preferable in a more efficient way? If not, I'm thinking of doing a daily batch process, using a simple Python connector from BigTable to BigQuery, but I'm afraid of the costs this will incur.
Any documentation, reference or idea will be greatly appreciated.
Thank you.
In general, you're setting yourself up for a world of pain when you try to query a NoSQL database (like BigTable) using SQL. Unnesting data is a very expensive operation in SQL because you're effectively performing a cross join (which is many-to-many) every time UNNEST is called, so trying to do that 600+ times will give you either a query timeout or a huge bill.
The BigTable API will be way more efficient than SQL since it's designed to query NoSQL structures. A common pattern is to have a script that runs daily (such as a Python script in a Cloud Function) and uses the API to get that day's data, parse it, and then output that to a file in Cloud Storage. Then you can query those files via BigQuery as needed. A daily script that loops through all the columns of your data without requiring extensive data transforms is usually cheap and definitely less expensive than trying to force it through SQL.
That being said, if you're really set on using SQL, you might be able to use BigQuery's JSON functions to extract the nested data you need. It's hard to visualize what your data structure is without sample data, but you may be able to read the whole row in as a single column of JSON or a string. Then if you have a predictable path for the values you are looking to extract, you could use a function like JSON_EXTRACT_STRING_ARRAY to extract all of those values into an array. A Regex function could be used similarly as well. But if you need to do this kind of parsing on the whole table in order to query it, a batch job to transform the data first will still be much more efficient.

Changing index in Neo4j affects the search in cypher

I am using Neo4j 2.0 .
I have a label PROD. All nodes having label PROD have a property name. I have created an index on name like this :
CREATE INDEX ON :PROD(name)
After creating an index if I change value of the property name from "old" to "new", the following query works fine in a small dataset used for testing but not in our production data with 700 nodes having label PROD (where totally there are around a million nodes with other labels).
MATCH (n:PROD) WHERE n.name="new" RETURN n;
I have also created legacy index on the same field and after dropping and re-indexing the node on modification, it works perfectly fine on both test and production datasets.
Is there a way to ensure the indices are updated? What am I doing wrong? Why does the above query fail for large dataset?
You can use the :schema command in Neo4j browser or schema in neo4j-shell. The output should indicate which indexes have been created and which of them are already online.
Additionally there's a programmatic way to wait until index population has finished.
Consider upgrading to 2.1.3, indexing has improved since 2.0.

Cannot insert new value to BigQuery table after updating with new column using streaming API

I'm seeing some strange behaviour with my bigquery table, I've just created added a new column to a table, it looks good on the interface and getting the schema via the api.
But when adding a value to the new column I get the following error:
{
"insertErrors" : [ {
"errors" : [ {
"message" : "no such field",
"reason" : "invalid"
} ],
"index" : 0
} ],
"kind" : "bigquery#tableDataInsertAllResponse"
}
I'm using the java client and streaming API, the only thing I added is:
tableRow.set("server_timestamp", 0)
Without that line it works correctly :(
Do you see anything wrong with it (the name of the column is server_timestamp, and it is defined as an INTEGER)
Updating this answer since BigQuery's streaming system has seen significant updates since Aug 2014 when this question was originally answered.
BigQuery's streaming system caches the table schema for up to 2 minutes. When you add a field to the schema and then immediately stream new rows to the table, you may encounter this error.
The best way to avoid this error is to delay streaming rows with the new field for 2 minutes after modifying your table.
If that's not possible, you have a few other options:
Use the ignoreUnknownValues option. This flag will tell the insert operation to ignore unknown fields, and accept only those fields that it recognizes. Setting this flag allows you to start streaming records with the new field immediately while avoiding the "no such field" error during the 2 minute window--but note that the new field values will be silently dropped until the cached table schema updates!
Use the skipInvalidRows option. This flag will tell the insert operation to insert as many rows as it can, instead of failing the entire operation when a single invalid row is detected. This option is useful if only some of your data contains the new field, since you can continue inserting rows with the old format, and decide separately how to handle the failed rows (either with ignoreUnknownValues or by waiting for the 2 minute window to pass).
If you must capture all values and cannot wait for 2 minutes, you can create a new table with the updated schema and stream to that table. The downside to this approach is that you need to manage multiple tables generated by this approach. Note that you can query these tables conveniently using TABLE_QUERY, and you can run periodic cleanup queries (or table copies) to consolidate your data into a single table.
Historical note: A previous version of this answer suggested that users stop streaming, move the existing data to another table, re-create the streaming table, and restart streaming. However, due to the complexity of this approach and the shortened window for the schema cache, this approach is no longer recommended by the BigQuery team.
I was running into this error. It turned out that I was building the insert object like i was in "raw" mode but had forgotten to set the flag raw: true. This caused bigQuery to take my insert data and nest it again under a json: {} node.
In otherwords, I was doing this:
table.insert({
insertId: 123,
json: {
col1: '1',
col2: '2',
}
});
when I should have been doing this:
table.insert({
insertId: 123,
json: {
col1: '1',
col2: '2',
}
}, {raw: true});
the node bigquery library didn't realize that it was already in raw mode and was then trying to insert this:
{
insertId: '<generated value>',
json: {
insertId: 123,
json: {
col1: '1',
col2: '2',
}
}
So in my case the errors were referring to the fact that the insert was expecting my schema to have 2 columns in it (insertId and json).

Modelling NoSQL database (when converting from SQL database)

I have a SQL database that I want to convert to a NoSQL one (currently I'm using RavenDB)
Here are my tables:
Trace:
ID (PK, bigint, not null)
DeploymentID (FK, int, not null)
AppCode (int, not null)
Deployment:
DeploymentID (PK, int, not null)
DeploymentVersion (varchar(10), not null)
DeploymentName (nvarchar(max), not null)
Application:
AppID (PK, int, not null)
AppName (nvarchar(max), not null)
Currently I have these rows in my tables:
Trace:
ID: 1 , DeploymentID: 1, AppCode: 1
ID: 2 , DeploymentID: 1, AppCode: 2
ID: 3 , DeploymentID: 1, AppCode: 3
ID: 3 , DeploymentID: 2, AppCode: 1
Deployment:
DeploymentID: 1 , DeploymentVersion: 1.0, DeploymentName: "Test1"
DeploymentID: 2 , DeploymentVersion: 1.0, DeploymentName: "Test2"
Application:
AppID: 1 , AppName: "Test1"
AppID: 2 , AppName: "Test2"
AppID: 3 , AppName: "Test3"
My question is: HOW should I build my NoSQL document model ?
Should it look like:
trace/1
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
"Application": "Test1"
}
trace/2
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
"Application": "Test2"
}
trace/3
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test1" } ],
"Application": "Test3"
}
trace/4
{
"Deployment": [ { "DeploymentVersion": "1.0", "DeploymentName": "Test2" } ],
"Application": "Test1"
}
And what if Deployment 1 gets changed ? Should I go by each document and change the data?
And when should I use references in NoSQL ?
Document databases such as Raven are not relational databases. You CANNOT first build the database model and then later on decide on various interesting ways of querying it. Instead, you should first determine what access patterns you want to support, and then design the document schemas accordingly.
So in order to answer your question, what we really need to know is how you intend to use the data. For example, displaying all traces ordered by time is a distinctly different scenario than displaying traces associated a specific deployment or application. Each one of those requirements will dictate a different design, as will supporting them both.
This in itself may be useful information to you (?), but I suspect you want more concrete answers :) So please add some additional details on your intended usage.
There are a few "do" and "don'ts" when deciding on a strategy:
DO: Optimize for the common use-cases. There is often a 20/80 breakdown where 20% of the UX drives 80% of the load - the homepage/landing page of web apps is a classic example. First priority is to make sure that these are as efficient as possible. Make sure that your data model allows either A) loading those in either a single IO request or B) is cache-friendly
DONT: don't fall into the dreaded "N+1" trap. This pattern occurs when you data model forces you to make N calls in order to load N entities, often preceded by an additional call to get the list of the N IDs. This is a killer, especially together with #3...
DO: Always cap (via the UX) the amount of data which you are willing to fetch. If the user has 3729 comments you obviously aren't going to fetch them all at once. Even it it was feasible from a database perspective, the user experience would be horrible. Thats why search engines use the "next 20 results" paradigm. So you can (for example) align the database structure to the UX and save the comments in blocks of 20. Then each page refresh involves a single DB get.
DO: Balance the Read and Write requirements. Some types of systems are read-heavy and you can assume that for each write there will be many reads (StackOverflow is a good example). So there it makes sense to make writes more expensive in order to gain benefits in read performance. For example, data denormalization and duplication. Other systems are evenly balanced or even write heavy and require other approaches
DO: Use the dimension of TIME to your advantage. Twitter is a classic example: 99.99% of tweets will never be accessed after the first hour/day/week/whatever. That opens all kinds of interesting optimization possibilities in the your data schema.
This is just the tip of the iceberg. I suggest reading up a little on column-based NoSQL systems (such as Cassandra)
How you model your documents depends mostly on your application and it's domain. From there, the document model can be refined by understanding your data access patterns.
Blindly attempting to map a relational data model to a non-relational one is probably not a good idea.
UPDATE: I think Matt got the main idea of my point here. What I am trying to say is that there is no prescribed method (that I am aware of anyway) to translate a relational data model (like a normalized SQL Schema) to a non-relational data model (like a document model) without understanding and considering the domain of the application. Let me elaborate a bit here...
After looking at your SQL schema, I have no idea what a trace is besides a table that appears to join Applications and Deployments. I also have no idea how your application typically queries the data. Knowing a little about this makes a difference when you model your documents, just as it would make a difference in the way you model your application objects (or domain objects).
So the document model suggested in your question may or may not work for you application.

Solandra to replace our Lucene + RDBMS?

Currently we are using a combination of SQL Server and Lucene to index some relational data about domain names. We have a Domain table, and about 10 other various other tables for histories of different metrics we calculate and store about domains. For example:
Domain
Id BIGINT
Domain NVARCHAR
IsTracked BIT
SeoScore
Id BIGINT
DomainId BIGINT
Score INT
Timestamp DATETIME
We are trying to include all the domains from major zone files in our database, so we are looking at about 600 million records eventually, which seems like it's going to be a bit of a chore to scale in SQL Server. Given our reliance on Lucene to do some pretty advanced queries, Solandra seems like it may be a nice fit. I am having a hard time not thinking about our data in relational database terms.
The SeoScore table would map one to many Domains (one record for each time we calculated the score). I'm thinking that in Solandra terms, the best way to achieve this would be use two indexes, one for Domain and one for SeoScore.
Here are the querying scenarios we need to achieve:
A 'current snapshot' of the latest metrics for each domain (so the latest SeoScore for a given domain. I'm assuming we would find the Domain records we want first, and then run further queries to get the latest snapshot of each metric separately.
Domains with SeoScores not having been checked since x datetime, and having IsTracked=1, so we would know which ones need to be recalculated. We would need some sort of batching system here so we could 'check out' domains and run calculations on them without duplicating efforts.
Am I way off track here? Would we be right in basically mapping our tables to separate indexes in solandra in this case?
UPDATE
Here's some JSON notation of what I'm thinking:
Domains : { //Index
domain1.com : { //Document ID
Middle : "domain1", //Field
Extension : "com",
Created : '2011-01-01 01:01:01.000',
ContainsDashes : false,
ContainsNumbers : false,
IsIDNA : false,
},
domain2.com {
...
}
}
SeoScores : { //Index
domain1.com { //Document ID
'2011-02-01 01:01:01.000' : {
SeoScore: 3
},
'2011-01-01 01:01:01.000' : {
SeoScore: -1
}
},
domain2.com {
...
}
}
For SeoScores you might want to consider using virtual cores:
https://github.com/tjake/Solandra/wiki/ManagingCores
This lets you partition the data by domain so you can have SeoScores.domain1 and make each document the represent one timestamp.
The rest sounds fine.