how to incrementally update nested objects in elasticsearch? - indexing

I have 2 document types (in normal form in relational db):
1: post (with title, text and author fields)
2: comment (with text, author, post_id fields)
I have only one type in elastic (post) that aggregate each post with all comments on them in nested form.
I want to index posts with comments on it as nested objects for decreasing response time of queries but it will increase cost of indexing significantly if I reindex whole "post" document every time a new "comment" added, how can I handle it efficiently? It is acceptable for me to have data of comments with 1h delay.
In fact it is three question:
1- how can I update a post document with only added comment data. (without need to reconstruct whole post document and send it to elastic)
2- how can I aggregate index commands that was related to a document and send it as a one single command to elastic?
3- Is river plugin a solution for these? is it index comments without need to reconstruct whole post document? is it aggregate all updates related to one document and apply it with one index request?

I think this post answers your questions:
elastic search, is it possible to update nested objects without updating the entire document?
Having multiple items to update can be done using the bulk api
There is no river that can help you not to reindex the whole document. With nested documents you always reindex the complete document. If this happens a lot and becomes a problem, parent-child mappings are the way to go.

Related

Is amazons dynamoDb good for comment like system

I am deciding to use MySql or NoSQL for a forumn site. I am very new to the whole NoSQL idea and reading the documentation I have noticed that an 'item' can't be larger than 64kb this includes attributes and values. From the way I understand it the only way I can think of implementing this would be something like this:
Posts: [
{ user: 'gates', text: 'first post'
comment: [
{ user: 'jim', text: 'comment1',
user: 'bobby', text:'comment2'
}
]
},
{ user: 'jim', text: 'second post' }
]
The problem here is if there are a lot of comments on a post, it would exceed the 64kb limit for the item. Is there another of way of implementing this "post/comment relationship" which would allow to scale nicely?
You can model it in many ways, the best two i can think of are:
Use one table (let say "forumapp"):
Post
HASH -> <uuid>
RANGE -> 0
Comment
HASH -> <uuid of the post>
RANGE -> 1..n (comments will be ordered by RANGE Key by default)
now querying with the hash you can get the post and all its comments. Because they have different ranges they are different items, so you can have how many comments you want.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html
No practical limit for non-indexed tables.
For a table with local secondary indexes, there is a limit on item collection sizes: For every distinct hash key value, the total sizes of all table and index items cannot exceed 10 GB. Depending on your item sizes, this may constrain the number of range keys per hash value. For more information, see Item Collection Size Limit.
Use 2 tables
Table post
HASH -> uuid
Table Comments
HASH ->
RANGE -> comment date (so you get results ordered by date)
The difference between the two approaches is that with the first you issue 1 Query
operation specifying only the hash and you get your post together with all the comments.
(As long as the whole thing is <1MB, then you have to get the next 1MB with another api call and so on). While with the second approach you need to perform 1 GetItem operation
for the post and 1 Query operation for comments. Since you have only 5 secondary indexes
available for 1 table you may want to have 2 tables. But generally it really depends on
what access you think your application will do on the data...
To better understand the differences between SQL and NoSQL, is worth the time to check, this Martin Fowler presentation: https://www.youtube.com/watch?v=qI_g07C_Q5I
If you are used to SQL it might be tricky to get an app working on a NoSQL database, because
you have to think seriously on how you will be retrieving your data BEFORE you store it, or you will simply not be able to read it as you like (no SQL to rescue..). Some times you will have to build your own indexes in a separate table, slow down a bit writes to make reads faster. You should be doing it even with SQL databases but if you fail to model a relational model you will get at best a slow application, if you fail to model a NoSQL model you will get stuck.
For instance if you have to retrieve all the posts that have been tagged with tag A and tag B,
with a single query operation, you will have to maintain some kind of index in another table that given an hash calculated on A+B will give you all the ranges (post identifiers) that match your query. You might be thinking of secondary indexes, but you can only use them INSIDE an hash value. So you first give the hash and only then you filter with a secondary index...
If you are using nodejs checkout: https://github.com/aaaristo/dyngodb.
A post "document" with its comments attached is not a good use case if you're considering DynamoDB (It's not a document-db).
What you can do is create a Posts table and Comments table.
You will need to think about the questions you will be asking, in order to optimize the database.
For instance, if you have a rating for comments, you could have top comments hashes in the Post table and only fetch "all comments" from the Comments table on demand.
Final note, I think it's easier to set up a post/comment system in SQL (check PostgreSQL, don't limit yourself to MySQL). It's powerful, it will work great, WordPress is using MySQL, why not you..

Pulling 50 RSS feeds and checking uniqueness of posts against database [Efficiency Advice]

I want to pull several RSS feeds into a database as efficiently as possible.
My site will scrape 50 RSS feeds every 4 hours - I only want to add unique posts to the database. I am a bit stuck on how to check the post uniqueness. I am thinking I can compare the URL value and when the first post is not unique I can break out of the loop.
So my question is - is it better to pull all of the posts and check each feeds posts against an array until one duplicate comes up (and then break). Or is it better to search the database for the first post from each RSS feed - then only store that in an array and check the newly scraped posts against it.
I guess while I'm here I might as well ask if I should create a table for each separate feed. I am only storing 5 posts at a time so for now it is only one table and I just have a column to identify each feed.
Thanks.
Let the DBMS handle uniqueness: insert all the potential records using where not exists to screen out the existing ones.
Nothing in what you said suggests to me that you need a table for each feed.
No one can answer an efficiency question without knowing the DBMS in question. As a rule of thumb, though, you're better off with bigger chunks of work and fewer transactions. Storing 5 rows at a time may be a lot slower than, say, 5000 rows at a time.

SQL - How to tag data?

I need to store short tags (A01, B34, etc) in a SQL table, and make sure their indexed. Creating an INT column for each letter in the alphabet is not possible, because entries can have multiple 'A' tags for example.
First I stored them as a long string, seperated with spaces (for example "A01 B34"). But this requires a LIKE% query, which does a fulltable scan and ignores any indexes. So i'm looking for alternatives.
I now use SQLite FTS (text search) to search for these tags, but this requires a special table to store the tags in, and fetching results with JOIN queries, and all kinds of other stuff I'd rather like to avoid.
My requirements are pretty simple: I need to store millions of short strings, each with their own tags, and do simple searches for these tags.
Is my current approach (doing FTS on the tags) the fastest? Or is it better to use a NoSQL database for this kind of data?
I will share my experience how I have done it in my previous startup Pageflakes Community site. At Pageflakes, user created content is tagged. You can see an example from here:
http://www.pageflakes.com/Community/Content/Flakes.aspx?moduleKey=4796
Each widget, pagecast has a collection of tags. When someone searches, we give the tags highest priority, then the title, then the description of the item.
Assuming you have a Content table like this:
Content (id, title, description)
First of all, you need to create a table for all unique tags.
Tag (id, name (unique), countOfItems)
Then you need to map the tag to content rows.
TagContentMap (tagid, contentid)
You will now ask, for each new content, I have to insert in three tables. Not always. You insert in Tag table only when you have a new tag. Most of the time, people choose existing tags. After couple of months of tagging, users should have exhausted unique tags. From then, 99% of the time users pick some existing tag. So, that removes one insert for you. So, you only have one additional insert.
Moreover, insert is always significantly lower than select. Most likely you will have 99% read, 1% write.
Unless you introduce these two tables, you can never have a UI where users can click on a tag and see all the content tagged with that particular tag. If you have no need for such feature, then of course you can just add a "tags" column on the Content table itself and store the tags in comma delimited format.
Now the most important point - how to produce the best search result. On the content table, we have a varchar field called "SearchData". This field is first populated with the tag names, then the title, then the description. So,
SearchData = tag names comma delimited + newline + title + newline + description.
Then you use SQL Server's Full text indexing to index the SearchData column only, not any other field in the Content table.
Does this work for you?
You do not give us a lot of details to go on, but your design seems to be all wrong. It is not in third normal form.
#Joshua, pls goo on term "normalization". Currently your data is denormalized. Denormalization is possible thing. but after normalization and as some kind of perfomance hack. Currently your design seems to be wrong.
As an example you should have insetad of 1 table 3 tables:
some_records (id, column1, ..., columnN)
tags (id, title)
some_records_tags (some_record_id, tag_id)
It's a classic design pattern in DBMS. And NoSQL here not needed.
As other users have pointed out, the data is not well normalized. I'll assume that this is intentional and there is some very large (100s of gb or tb size requirement or huge throughput requirement that you haven't mentioned). But before you start down any path, you should understand exactly what your requirements are: how often do you write versus read, what are the latency requirements for writes and reads, and you have to include index maintenance in your calculations.
If you have a significant perf requirement, you might try building a near-line index system on top of what you currently have. I've used this technique in the past for large throughput requirement systems. The idea is basically that for writes, you make them as small and quick as possible, and create a batch process to come back and add the data into a secondary search table that will get it into a form that is capable of being searched. The benefit is your writes can be done quickly, and if you choose your clustered index well the reads for the batch processing can be done very efficiently. In addition, you can segment the data into different servers as necessary to support higher search throughput. The major drawback is that updates are not instantaneously reflected in search results.
If you write into a table like:
table data (id binary(6), ..., timestamp datetime, tags varchar(256))
and have a secondary table:
table search (tag char(3), dataId binary(6))
You can create a batch process to come around take the last BATCH_SIZE (1000 maybe) records and splitting the tags column on a space and inserting/deleting the tags into/from the search table. You keep a variable/row somewhere with the last timestamp value you've collected from and start from there at the next batch interval. Finally, if deletes are important, each batch interval will need to find the set of records no longer in the data table. Alternately you could choose a tombstone table if your data table is too large or you can concurrently perform deletes against data and search if they happen infrequently enough.
Things to watch out for with batch processing is making the batch size too big and taking table locks when updating the search table. Also, you have to watch out for duplicate timestamps. And of course, when writing/updating the data table it is necessary to always update the timestamp.

How to deal with constantly changing data and SOLR indexes?

Afternoon guys,
I'm using a SOLR index for searching through items on my site. The search results contain an average rating of the item and an amount of comments the item has. The results can be sorted by both rating and num of comments.
But obviously with the solr index, these numbers aren't updated until the db (2million~ rows) is reindexed (done nightly, probably).
What would you guys think is the best way to approach this?
Well, i think you should change your db - index sync policy:
First approach: when commiting database changes also post changes (a batch of them) to indexes. You should write a mapper tier to map your domain objects to solr docs (remember, persists and if it goes ok then index -this works fine for us ;-)). If you want to achieve near real-time index updates you should see solutions like zoey (linkedin lucene-based searching framework)
Second approach: take a look around delta import (and program more frecuently index updates).

Feed aggregator using hbase. How to design the schema?

I am working on a project involving monitoring a large number of rss/atom feeds. I want to use hbase for data storage and I have some problems designing the schema. For the first iteration I want to be able to generate an aggregated feed (last 100 posts from all feeds in reverse chronological order).
Currently I am using two tables:
Feeds: column families Content and Meta : raw feed stored in Content:raw
Urls: column families Content and Meta : raw post version store in Content:raw and the rest of the data found in RSS stored in Meta
I need some sort of index table for the aggregated feed. How should I build that? Is hbase a good choice for this kind of application?
Question update: Is it possible( in hbase) to design a schema that could efficiently answer to queries like the one listed bellow?
SELECT data FROM Urls ORDER BY date DESC LIMIT 100
Peter Rietzler answer on hbase-user mail list:
Hi
In our project we are handling event
lists where we have similar
requirements. We do ordering by
choosing our row keys wisely. We use
the following key for our events (they
should be ordered by time in ascending
order):
eventListName/yyyyMMddHHmmssSSS-000[-111]
where eventListName is the name of the
event list and 000 is a three digit
instance id to disambiguate between
different running instances of
application, and -111 is optional to
disambiguate events that occured in
the same millisecond on one instance.
We additionally insert and artifical
row for each day with the id
eventListName/yyyyMMddHHmmssSSS
This allows us to start scanning at
the beginning of each day without
searching through the event list.
You need to be aware of the fact that
if you have a very high load of
inserts, then always one hbase region
server is busy inserting while the
others are idle ... if that's a
problem for you, you have to find
different keys for your purpose.
You could also use an HBase index
table but I have no experience with it
and I remember an email on the mailing
list that this would double all
requests because the API would first
lookup the index table and then the
original table ??? (please correct me
if this is not right ...)
Kind regards, Peter
Thanks Peter.