The HT does not rehash.
We use a simple division method as Hash-function.
We assume the Hash-function is efficient at equally distributing the entries.
The goal is to have O(1) insertion, deletion and find.
The optimal number of buckets is a compromise between memory consumption and hash collisions, for intended usage patterns.
For example, if something is very frequently used you might limit the size of the hash table to half the size of a CPU's cache to reduce the chance of "cache miss accessing hash table"; and this can be faster than using a larger hash table (with worse cache misses and lower chance of hash collisions). Alternatively; if it's used infrequently (and therefore you expect cache misses regardless of hash table size) then a larger size is more likely to be optimal.
Of course real systems have multiple caches (L1, L2, L3) plus virtual memory translation caches (TLBs) plus RAM limits (plus swap space limits); real software has more than just one hash table competing for resources in the memory hierarchy; and often the software developers have no idea what other processes might be running (competing for physical RAM, polluting caches, etc) or what any end user's hardware is (sizes of caches, etc). All of this makes it virtually impossible to determine "optimal" with any method (including extensive benchmarking).
The only practical option is to take an educated guess based on various assumptions (about usage, the amount of data and how good the hashing function will be in practice, the CPU, the other things that might be using CPUs and memory, ...); and make the source code configurable (e.g. #define HASH_TABLE_SIZE ..) so you can easily re-assess the guess later.
As far as I know, wide-column cannot be applicable.
But is there a difference in efficiency to put big data into the node?
I'd like to put an index to distinguish the value and want to know the efficiency.
"There are few efficiency considerations when putting big data into nodes (accurately property). Search filtering may become slower due to search scope and size increases. Changes may result in overhead such as wal log.
Although it's difficult to determine how much big data you have, but I think you should save it as a file and save its description as property type. Information saved in the property can reduce access expense by the creation of a seperate property index. "
Quoting the Spark DataFrames, Datasets and SQL manual:
A handful of Hive optimizations are not yet included in Spark. Some of
these (such as indexes) are less important due to Spark SQL’s
in-memory computational model. Others are slotted for future releases
of Spark SQL.
Being new to Spark, I'm a bit baffled by this for two reasons:
Spark SQL is designed to process Big Data, and at least in my use
case the data size far exceeds the size of available memory.
Assuming this is not uncommon, what is meant by "Spark SQL’s
in-memory computational model"? Is Spark SQL recommended only for
cases where the data fits in memory?
Even assuming the data fits in memory, a full scan over a very large
dataset can take a long time. I read this argument against
indexing in in-memory database, but I was not convinced. The example
there discusses a scan of a 10,000,000 records table, but that's not
really big data. Scanning a table with billions of records can cause
simple queries of the "SELECT x WHERE y=z" type take forever instead
of returning immediately.
I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.
I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?
Indexing input data
The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.
Indexing Distributed Data Structures:
standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.
That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.
Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.
Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.
In general, the utility of indexes is questionable at best. Instead, data partitioning is more important. They are very different things, and just because your database of choice supports indexes doesn't mean they make sense given what Spark is trying to do. And it has nothing to do with "in memory".
So what is an index, anyway?
Back in the days when permanent storage was crazy expensive (instead of essentially free) relational database systems were all about minimizing usage of permanent storage. The relational model, by necessity, split a record into multiple parts -- normalized the data -- and stored them in different locations. To read a customer record, maybe you read a customer table, a customerType table, take a couple of entries out of an address table, etc. If you had a solution that required you to read the entire table to find what you want, this is very costly, because you have to scan so many tables.
But this is not the only way to do things. If you didn't need to have fixed-width columns, you can store the entire set of data in one place. Instead of doing a full-table scan on a bunch of tables, you only need to do it on a single table. And that's not as bad as you think it is, especially if you can partition your data.
40 years later, the laws of physics have changed. Hard drive random read/write speeds and linear read/write speeds have drastically diverged. You can basically do 350 head movements a second per disk. (A little more or less, but that's a good average number.) On the other hand, a single disk drive can read about 100 MB per second. What does that mean?
Do the math and think about it -- it means if you are reading less than 300KB per disk head move, you are throttling the throughput of your drive.
Seriouusly. Think about that a second.
The goal of an index is to allow you to move your disk head to the precise location on disk you want and just read that record -- say just the address record joined as part of your customer record. And I say, that's useless.
If I were designing an index based on modern physics, it would only need to get me within 100KB or so of the target piece of data (assuming my data had been laid out in large chunks -- but we're talking theory here anyway). Based on the numbers above, any more precision than that is just a waste.
Now go back to your normalized table design. Say a customer record is really split across 6 rows held in 5 tables. 6 total disk head movements (I'll assume the index is cached in memory, so no disk movement). That means I can read 1.8 MB of linear / de-normalized customer records and be just as efficient.
And what about customer history? Suppose I wanted to not just see what the customer looks like today -- imagine I want the complete history, or a subset of the history? Multiply everything above by 10 or 20 and you get the picture.
What would be better than an index would be data partitioning -- making sure all of the customer records end up in one partition. That way with a single disk head move, I can read the entire customer history. One disk head move.
Tell me again why you want indexes.
Indexes vs ___ ?
Don't get me wrong -- there is value in "pre-cooking" your searches. But the laws of physics suggest a better way to do it than traditional indexes. Instead of storing the customer record in exactly one location, and creating a pointer to it -- an index -- why not store the record in multiple locations?
Remember, disk space is essentially free. Instead of trying to minimize the amount of storage we use -- an outdated artifact of the relational model -- just use your disk as your search cache.
If you think someone wants to see customers listed both by geography and by sales rep, then make multiple copies of your customer records stored in a way that optimized those searches. Like I said, use the disk like your in memory cache. Instead of building your in-memory cache by drawing together disparate pieces of persistent data, build your persistent data to mirror your in-memory cache so all you have to do is read it. In fact don't even bother trying to store it in memory -- just read it straight from disk every time you need it.
If you think that sounds crazy, consider this -- if you cache it in memory you're probably going to cache it twice. It's likely your OS / drive controller uses main memory as cache. Don't bother caching the data because someone else is already!
But I digress...
Long story short, Spark absolutely does support the right kind of indexing -- the ability to create complicated derived data from raw data to make future uses more efficient. It just doesn't do it the way you want it to.
I am trying to understand how indexing can be optimized on elasticsearch. Let me clarify my needs;
I have two indices rigth now. Lets say, indexA and indexB ( Two indices can be seen approximately same size)
I have 6 machines dedicated to elasticsearch (we can say exactly the same hardware)
The most important part of my elasticsearch usage is on writing since I am doing heavy writing on real time.
So my question is, how I can I optimize the writing operation using those 6 machines ?
Should I separate machines into two part like 3 machines for indexA and 3 machines for indexB ?
or
Should I use all of 6 machines in order to index indexA and indexB ?
and
What else should I need to give attention in order to optimize write operations ?
Thank you in advance
It depends, but let me take to a direction as per your problem statement which led to following assumptions:
you want to do more write operations (not worried about search performance)
both the indices are in the same cluster
in future more systems can get added
For better indexing performance first thing is you may want to have single shard for your index (unless you are using routing). But since you have 6 servers having single shard will be waste of resources so you can assign 3 shard to each of indexA and indexB. This is for current scenario but it is recommended to have ~10 shards(for future scalibility and your data size dependent)
Turn off the replica (if possible as index requests wait for the replicas to respond before returning). Though in production environment it is highly recommended to have at least one replica for high availability.
Set refresh rate to "-1" or at least to a larger figure say "30m". (You will lose NRT search if you do so but as you have mentioned you are concerned about indexing)
Turn of index warmers if you have any.
avoid using "doc_values" for your field mapping. (though it is beneficial for reducing memory footprint during search time it will increase your index time as it prepares field values during indexing)
If possible/not required disable "norms" in your mapping
Lastly read this.
Word of caution: some of the approach above will impact your search performance.
My domain object has 20 properties(columns, attributes, whatever you call it) and simple relationships. I want to index 5 properties for full-text search and 3 for sorting. There might be 100,000 records.
To keep my application simple, I want to store the fields in a Lucene index file to avoid introducing a database. Will there be a performance problem?
Depending on how you access stored fields, they may all be loaded into memory (basically, if you use a FieldCache everything will be cached into memory after the first use). And if you have a gig of storage which is taking up memory, that's a gig less to use for your actual index.
Depending on how much memory you have, this may be a performance enhancement, or a performance detriment.