Why not assign multiple types in an ElasticSearch index for logging, rather than multiple indices? - indexing

I am currently researching some data storage strategies with ElasticSearch and wonder why for storing logs, this page indicates:
A standard format is to assign a new index for each day.
Would it not make more sense to create an index (database) with a new type a name (table) per day?
I am looking at this from the point of view of each index is tied to a different web application.
In another scenario, a web app uses one index. One of the types within that index is used for logging (what we currently do with SQL Server). Is this a good approach?

Interesting idea and, yes, you could probably do that. Why use multiple indices instead? If having control over things like shard-to-node allocation (maybe you want all of 2015 stored on one set of nodes, 2014, another), filter cache size, and similar is important, you lose that by going to a single index/multi-mapping approach. For very high volume applications, that control might be significant. YMMV.
With regard to the "each index is tied to a different web application" sentiment, aliases can (and are) used to collect multiple physical indices under a single searchable umbrella; you create one index per day/week/whatever, say, logs-20150730, logs-20150731... and assign the logs alias to all of the indices in the series. Net effect is the same as having a single "index".
Nice part of the alias approach is that purging/pruning old data is trivial; just delete the index when its contents age out of whatever your data retention policy is. With multi-mappings, you'd have to delete the requisite mapping within the index (do-able, but pretty I/O intrusive, since you'd likely be shoving stuff around inside every shard the mapping was distributed through.)

Related

Why use multiple ElasticSearch indices for one web application?

In asking a questions relating to using ES for web applications, suggestions have been made to have one index for things like user profiles, another index for data, etc., and several other ones for logs.
Having these all on a cluster with several web applications, this seems like things could get messy or disorganized.
In that case, are people using one cluster per application? I am a bit confused because when I read articles about indexing logs, they seem to refer to storing the data in multiple indices, rather than types within an index.
Secondly, why not have one index per app, with types for logs, user profiles, data, etc.?
Is there some benefit to using multiple indices rather than many types within an index for a web application?
-- UPDATE --
To add to this, the comments in this question, Elastic search, multiple indexes vs one index and types for different data sets?, don't seem to go far enough in explaining why:
data retention: for application log/metric data, use different indexes
if you require different retention period
Is that recommended because it's just simpler to delete an entire index rather than a type within an index? Does it have to do with the way the data is stored then space recovered after deleting the data?
I found the primary reason for creating multiple indices that satisfies my quest for an answer in ElasticSearch's pagination documentation:
To understand why deep paging is problematic, let’s imagine that we
are searching within a single index with five primary shards. When we
request the first page of results (results 1 to 10), each shard
produces its own top 10 results and returns them to the requesting
node, which then sorts all 50 results in order to select the overall
top 10.
Now imagine that we ask for page 1,000—results 10,001 to 10,010.
Everything works in the same way except that each shard has to produce
its top 10,010 results. The requesting node then sorts through all
50,050 results and discards 50,040 of them!
You can see that, in a distributed system, the cost of sorting results
grows exponentially the deeper we page. There is a good reason that
web search engines don’t return more than 1,000 results for any query.

Should I create multiple tables, or even databases for multiple users of a CRM

I'm working on creating an application best described as a CRM. There is a relatively complex table structure, and I'm thinking about allowing users to do a fair bit of customization (adding fields and the like). One concern is that I will be reaching a certain level of scale almost immediately. We have about 50,000 individual users who will be coming online within about nine months of launch. So I want to build to last.
I'm thinking about two and maybe even three options.
One table set with a userID column on everything and with a custom attributes table created by creating a table which indexes custom attributes, then another table which has their values, which can then be joined to the existing contact records for the user. -- From what I've read, this seems like the right option, but I keep feeling like it's not. It seems like once these tables start reaching the millions of records searching for just one users records in every query is going to become a database hog.
For each user account recreate the table set, preened with a unique identifier (the userID for example.) Then rather than using a WHERE userID=? everywhere I can use a FROM ?_contacts. For attributes I could then have a custom attributes table where users could add additional columns for custom attributes. -- This feels like the simplest way to go, though, of course when I decide to change the database structure there would be a migration from hell.
The third option, which I'm pretty confident is wrong, but for that reason alone I can not rule out, is that a new database should be created for each user with all the requisite tables.
Am I crazy? Is option one really the best?
The first method is the best. Create individual userId's and then you can assign specific roles to them. A database retrieval time indeed depends on the number of records too. But, there is a trade-off where you can write efficient sql queries to fetch data. Well, according to this site, you will probably won't run out of memory or run into concurrency issues, because with a good server, the performance ought to be good, provided that you are efficient in writing queries.
If you recreate table sets, you will just end up creating lots of tables and can make the indexing slow which is a bad practice. Whereas if you opt of relational database scheme rather than an ordinary database scheme, and normalize the database and datatables for improving efficiency.
Creating a new database for each and every user, just sums up the complexity from both the above statements resulting in a shabby and disorganized database access. Because, if you decide to run individual instances of databases for every single user, you would just end up consuming your servers physical resources like RAM and CPU usage which will affect the service quality of all the other users.
Take up option 1. Assign separate userIds and assign them roles and privileges where needed. That is more efficient than the other two methods.

How to organize primary keys for good locality?

I have a table for users and a table for documents. Documents have exactly one user as an owner, and for the application I'm building, I know that I will typically be accessing a group of documents associated with a single given user.
Let's say the average user has K documents, and certain common queries fetch all of the documents for a given user. I don't want my database (PostgreSQL) to have to do K disk seeks (on average) to fetch all the documents for a user. Ideally, the documents would be stored in contiguous blocks so that fetches would only require a few seeks.
Is it possible (and reasonable) to organize the document table schema to create such locality? I know that no-SQL implementations do this all the time? E.g. the BigTable paper talks about how row keys for web tables are assigned by URL, except that the url is reversed, e.g. com.cnn.www, so that all the pages for CNN are located near eachother in the data store. It doesn't appear possible to something similar in Postgres because the tables cannot be index-organized, although it might be possible in MySQL w/ InnoDB. This post comes to a similar conclusion.
The command you're looking for is CLUSTER, but it has drawbacks. It completely rewrites the table when you run it, which requires a lock on it, so you may only want to do this when traffic is low. Also, Postgres will do nothing to keep rows in that order during INSERTs and UPDATEs, so your data will tend to fragment as the table is written to and you may have to recluster it regularly.
What you can also do is set a low fillfactor on the table, so that UPDATEs are more likely to keep a given row on the same page. This should prevent some fragmentation, which just leaves INSERTs, but with a low fillfactor INSERTs will tend to be placed on newer pages, and these will probably be commonly accessed enough to be kept in RAM. I'm making assumptions about your usage patterns which may be wrong, but regardless, your best course of action is probably to just recluster whenever you see I/O start to become a problem.
Finally, there's also a tool called pg_repack that can cluster a table without taking such a heavy lock, in a similar manner to how CREATE INDEX CONCURRENTLY works, but it's a third-party tool, so you'll want to experiment with it before running in production.

Redis full text search : reverse indexing or sunspot?

I have 3,5 millions records (readonly) actually stored in a MySQL DB that I would want to pull out to Redis for performance reasons. Actually, I've managed to store things like this into Redis :
1 {"type":"Country","slug":"albania","name_fr":"Albanie","name_en":"Albania"}
2 {"type":"Country","slug":"armenia","name_fr":"Arménie","name_en":"Armenia"}
...
The key I use here is the legacy MySQL id, so with some Ruby glue, I can break as less things as possible in this existing app (and this is a serious concern here).
Now the problem is when I need to perform a search on the keyword "Armenia", inside the value part. Seems like there's only two ways out :
Either I multiplicate Redis index :
id => JSON values (as shown above)
slug => id (reverse indexing based on the slug, that could do the basic search trick)
finally, another huge index specifically for autocomplete, as shown in this post : http://oldblog.antirez.com/post/autocomplete-with-redis.html
Either I use sunspot or some full text search engine (unfortunatly, I actually use ThinkingSphinx which is too much tied to MySQL :-(
So, what would you do ? Do you think the MySQL to Redis move of a single table is even a good idea ? I'm afraid of the Memory footprint those gigantic Redis key/values could take on a 16GB RAM Server.
Any feedback on a similar Redis usage ?
Before I start with a real answer, I wanted to mention that I don't see a good reason for you to be using Redis here. Based on what types of use cases it sounds like you're trying to do, it sounds like something like elasticsearch would be more appropriate for you.
That said, if you just want to be able to search for a few different fields within your JSON, you've got two options:
Auxiliary index that points field_key -> list_of_ids (in your case, "Armenia" -> 1).
Use Lua on top of Redis with JSON encoding and decoding to get at what you want. This is way more flexible and space efficient, but will be slower as your table grows.
Again, I don't think either is appropriate for you because it doesn't sound like Redis is going to be a good choice for you, but if you must, those should work.
Here's my take on Redis.
Basically I think of it as an in-memory cache that can be configured to only store the least recently used data (LRU). Which is the role I made it to play in my use case, the logic of which may be applicable to helping you think about your use case.
I'm currently using Redis to cache results for a search engine based on some complex queries (slow), backed by data in another DB (similar to your case). So Redis serves as a cache storage for answering queries. All queries either get served the data in Redis or the DB if it's a cache-miss in Redis. So, note that Redis is not replacing the DB, but merely being an extension via cache in my case.
This fit my specific use case, because the addition of Redis was supposed to assist future scalability. The idea is that repeated access of recent data (in my case, if a user does a repeated query) can be served by Redis, and take some load off of the DB.
Basically my Redis schema ended up looking somewhat like the duplication of your index you outlined above. I used sets and sortedSets to create "batches / sets" of redis-keys, each of which pointed to specific query results stored under a particular redis-key. And in the DB, I still had the complete data set and an index.
If your data set fits on RAM, you could do the "table dump" into Redis, and get rid of the need for MySQL. I could see this working, as long as you plan for persistent Redis storage and plan for the possible growth of your data, if this "table" will grow in the future.
So depending on your actual use case and how you see Redis fitting into your stack, and the load your DB serves, don't rule out the possibility of having to do both of the options you outlined above (which happend in my case).
Hope this helps!
Redis does provide Full Text Search with RediSearch.
Redisearch implements a search engine on top of Redis. This also enables more advanced features, like exact phrase matching, auto suggestions and numeric filtering for text queries, that are not possible or efficient with traditional Redis search approaches.

Postgres SQL: Best way to check for new data in a database I don't control

For an application I am writing, I need to be able to identify when new data is inserted into several tables of a database.
The problem is two fold, this data will be been inserted many times per minute into sometimes very large databases (and I need to be sensitive to demand / database polling issues) and I have no control of the application creating this data (so as far as I know, I can't use the notify / listen functionality available within postgres for exactly this kind of task*).
Any suggestion regarding a good strategy would be much appreciated.
*I believe the application controlling this data is using the notify / listen functionality itself, but I haven't a clue how (if at all possible) to know what the "channel" it uses externally and if it is ever able to latch on to that.
Generally, you need something in the table that you can use to determine newness, and there are a few approaches.
A timestamp column would let you use the date but you'd still have the application issue of storing a date outside of your database, and data that isn't in the database means another realm of data to manage. Yuck.
A tracking table that stored last update/insert timestamps on a per-table basis could give you what you want. You'd want to use a trigger to maintain the last-DML timestamp.
A solution you don't want to use is a serial (integer) id that comes from nextval, for any purpose than uniqueness. The standard/common mistake is to presume serial keys will be contiguous (they're not) or monotonic (they're not).