I have two huge regions REGION-A and REGION-B (200 Million rows in each region)
Now I am checking to see if both the region are identical or not.
Also, i want to return the mismatch. Is there any way in Gemfire to achieve this?
No, there's no way to achieve this in GemFire out of the box.
You'll need to implement the logic by yourself iterating through the keys and do the value comparison manually, either inside a custom function, or from client side.
Another approach, maybe less impacting in terms of performance, would be to take a snapshot of both regions and compare the contents offline.
Hope this helps.
Cheers.
Related
Suppose I am to search against two types [cars] and [buildings], and I would want the results to be separated. Is there a way one can group results by types?
I understand one simple way will be to query each types separately, but for other use cases one may actually need to query tens or hundreds of types together. Is there a native way or hacky way(like using sort) to achieve this?
This type of grouping behavior is (currently) not available in elasticsearch. It has been a long standing request:
https://github.com/elasticsearch/elasticsearch/issues/256
There are two approaches that can help, both of which are far from perfect, but may be good enough for some use cases.
Client side aggregation. Request a lot more results than you plan on displaying and the then bucket those.
Using multi-query. This allows you to easily pass down some number of queries in a single batch, but will have potential scaling problems if the number of queries gets to large.
This is one feature that Solr has that elasticsearch doesn't, but I have never tried it. I used a similar feature with Autonomy IDOL years back, but the performance was abysmal.
If you want the results separated in groups of documents, you're going to have to restructure your documents, since, elasticsearch is focused on finding matching documents. You might get around this by designing a document that has child documents then you can query for matches on the parent document that represents your type.
I guess there might be some common field (let's say it's [price]) if you want to search against different types. Then it would be reasonable to add some different type like [price_aggregator] and put into it fields [type] and [price]. And then you could easily build your query against just one type. This requires some additional work while indexing and more memory to store index but it's much performant when you search.
I just begun work with redis. I have problem when work with it. I have a list of users.
I have a page that displays a list of users, in that page I have pagination, sorting, filter by name address... How can I design key-value redis for easy use?
Redis is not exactly suited for an SQL-alike usage. What I mean is that usually you get data the way you put data in Redis.
Having a list of users with pagination, if you don't need too much filtering, or just limited filtering, can be a good use case using a sorted set data type where you have your user IDs as values, and the unix time as score. If you need another listing sorted by a different field, you'll likely need an additional sorted set, and so forth.
As far as filtering is concerned, you may do it server-side getting ranges from the sorted set and removing the non-matching items if they are sparse. However you can see how this will not scale if your filter selects 10 elements out of millions.
So the applicability of Redis in your use case depends on the exact details, and in general it looks like you may want a database more suitable for complex queries, even if you are likely going to pay the performance price.
For example, in order to provide an effective way to query repsondents answers to a dynamic questionnaire, where responses are stored in a keyword/response pair.
I am aware that there may be some latency in updating the catalogue/text index as new entries are added, but this may not matter if reporting/querying is not a real time concern. (i.e. performed at some later date)
So in answer to my own question, the transactional aspect of this doesnt actually matter, does it?
I would distinguish between data consistency in selected storage and gap between data arrival and appearing in search results for the user as you might use external or even remote search solutions for your application as the index update might take some significant time depends on the case.
I am writing an application and using MySQL to return the difference between 2 dates in MySQL should MySQL do this or should I actually let PHP handle it?
I also just need the sum of all the results I am getting back should I return them and add them up on the php side or is there a way to add all the results together on the MySQL Server side?
It depends somewhat on the application, but in general, I'd push it to the PHP, because normally you're building a web site for multiple concurrent accesses; why put the calculation into the database and potentially have a bottle neck?
I think that you have two separate cases here. In the case where you are returning two values and performing a calculation on them, then doing that on the front end probably makes the most sense as long as it's not a complex calculation that requires significant business logic. If it does involve complex or specialized business logic then you should have a central place for that logic, whether it is in a business layer or in the database, so that it is done consistently. If you're just finding the difference between two dates or something, then just do it on the front end.
In the second case, where you are summing values, that sounds like something that should probably be done in the database. Networks tend to be much more of a bottleneck than modern day databases on today's hardware. Save sending a bunch of rows over the network just to add them up if you can just do it in the database.
There's no good answer to that. I personally do as much as possible in SQL but mostly because I can, not because I should.
And yes, you can ask MySQL to calculate a sum:
SELECT id, SUM(price) FROM items GROUP BY id WITH ROLLUP
The last result, the one with id equals NULL, is going to contain sum for all the rows.
If it's thousands of different numbers I'd try to do it on the database side. What Charlie said is pretty much the usual. I often do calculations in a database and add them as an additional column just in case I needed to do server side sorting, but this is obviously not your case.
I have a table that contains maybe 10k to 100k rows and I need varying sets of up to 1 or 2 thousand rows, but often enough a lot less. I want these queries to be as fast as possible and I would like to know which approach is generally smarter:
Always query for exactly the rows I need with a WHERE clause that's different all the time.
Load the whole table into a cache in memory inside my app and search there, syncing the cache regularly
Always query the whole table (without WHERE clause), let the SQL server handle the cache (it's always the same query so it can cache the result) and filter the output as needed
I'd like to be agnostic of a specific DB engine for now.
with 10K to 100K rows, number 1 is the clear winner to me. If it was <1K I might say keep it cached in the application, but with this many rows, let the DB do what it was designed to do. With the proper indexes, number 1 would be the best bet.
If you were pulling the same set of data over and over each time then caching the results might be a better bet too, but when you are going to have a different where all the time, it would be best to let the DB take care of it.
Like I said though, just make sure you index well on all the appropriate fields.
Seems to me that a system that was designed for rapid searching, slicing, and dicing of information is going to be a lot faster at it than the average developers' code. On the other hand, some factors that you don't mention include the location or potential location of the database server in relation to the application - returning large data sets over slower networks would certainly tip the scales in favor of the "grab it all and search locally" option. I think that, in the 'general' case, I'd recommend querying for just what you want, but that in special circumstances, other options may be better.
I firmly believe option 1 should be preferred in an initial situation.
When you encounter performance problems, you can look on how you could optimize it using caching. (Pre optimization is the root of all evil, Dijkstra once said).
Also, remember that if you would choose option 3, you'll be sending the complete table-contents over the network as well. This also has an impact on performance .
In my experience it is best to query for what you want and let the database figure out the best way to do it. You can examine the query plan to see if you have any bottlenecks that could be helped by indexes as well.
First of all, let us dismiss #2. Searching tables is data servers reason for existence, and they will almost certainly do a better job of it than any ad hoc search you cook up.
For #3, you just say 'filter the output as needed" without saying where that filter is been done. If it's in the application code as in #2, than, as with #2, than you have the same problem as #2.
Databases were created specifically to handle this exact problem. They are very good at it. Let them do it.
The only reason to use anything other than option 1 is if the WHERE clause itself is huge (i.e. if your WHERE clause identifies each row individually, e.g. WHERE id = 3 or id = 4 or id = 32 or ...).
Is anything else changing your data? The point about letting the SQL engine optimally slice and dice is a good one. But it would be surprising if you were working with a database and do not have the possibility of "someone else" changing the data. If changes can be made elsewhere, you certainly want to re-query frequently.
Trust that the SQL server will do a better job of both caching and filtering than you can afford to do yourself (unless performance testing shows otherwise.)
Note that I said "afford to do" not just "do". You may very well be able to do it better but you are being paid (presumably) to provide functionality not caching.
Ask yourself this... Is spending time writing cache management code helping you fulfil your requirements document?
if you do this:
SELECT * FROM users;
mysql should perform two queries: one to know fields in the table and another to bring back the data you asked for.
doing
SELECT id, email, password FROM users;
mysql only reach the data since fields are explicit.
about limits: always ss best query the quantity of rows you will need, no more no less. more data means more time to drive it