distribute by sort by /cluster by clear cut idea - hive

I'm not getting a clear cut idea about distribute by sort by /cluster by in hive
according to my understanding ,multiple reducers are used when we use distribute by sort by /cluster by in hive, so that sorting happens faster
but why is that the reducers are needed in sorting of columns ,sorting can be done by a map and it does not involve any aggregate function
does it have any relation with clustered by sorted by that we use when we create a table
the problem I'm facing is,
select * from order_items cluster by order_item_order_id limit 10;
for the above query ,reducer count is not changed even though i use the command,
set mapreduce.job.reduce=4
it still remains 1
you can see here ,even though your change reducer count ,it still remains 1
though there is a post related to this , the answer given there does not clear my doubts..
thanks in advance....

Related

sdiff - limit the result set to X items

I want to get the diff of two sets in redis, but I don't need to return the entire array, just 10 items for example. Is there any way to limit the results?
I was thinking something like this:
SDIFF set1 set2 LIMIT 10
If not, are there any other options to achieve this in a performant way, considering that set1 can be millions of objects and set2 is much much smaller (hundreds).
More info would be helpful on what you want to achieve. Something like this might require you to duplicate your data. Though I don’t know if it’s something you want.
An option is chunking them.
Create a set with a unique generated id that can hold a max of 10 items
Create a sorted set like so…
zadd(key, timestamp, chunkid)
where your timestamp is a unix time and the chunkid is the key the connects to the set. The key can be the name of whatever you would like it to be or it could also be a uniquely generated id.
Use zrange to grab a specific one
(Repeat steps 1-3 for the second set)
Once you have your 1 result from both your sorted sets “zset”. You can now do your sdiff by using the chunkid.
Note that there is advantages and disadvantages in doing this. Like more connection consumption (if calling from a a client), and the obvious being a little more processing. Though it will help immensely if you put this in a lua script.
Hope this helps or at least gives you an idea on how to model your data. Though if this is critical data you might need to use a automated script of some sort to move your data around to meet the modeling requirement.

Alphabetical index with millions of rows in redis

For my application, I need an alphabetical index on a set with millions of rows.
When I use a sorted set, and give all members the same score, the result looks perfect.
Performance is also great, with a test set of 2 million rows, the last third does not perform noticably less than the first third of the set.
However, I need to query those results. For example, get the first (max) 100 items that start with "goo". I played around with zscan and sort, but it does not give me a working and performant result.
Since redis is very fast when inserting a new member to the sorted set, it must be technically possible to immediately (well, very quickly) go to the right memory location. I suppose redis uses some kind of quicksort mechanism to accomplish this.
But.. I don't seem to get the result when I just want to query the data, and not write to it.
We use replicated slaves for read actions, and we prefer the (default) read-only config switch. So creating a dummy key and deleting it afterward (however unelegant) is not really an option.
I'm stuck a bit, and I'm thinking about writing a ZLEX command in redis-server itself. Which I could use like this:
HELP "ZLEX" -> (ZLEX set score startswith)
-- Query the lexicographical index of a sorted set, supplying a 'startswith' string.
127.0.0.1:12345> ZLEX myset 0 goo LIMIT 0 100
1) goo
2) goof
3) goons
4) goozer
What are your thoughts? Am I missing something in the standard redis commands?
We're using Redis 2.8.4 x64 on Debian.
Kind regards, TW
Edits:
Note:
Related issue: indexing-using-redis-sorted-sets -> At least the name I gave to ZLEX seems to conform with Antirez' (Salvatore's) standards. As of 24-1-2014, I'm working on implementing ZLEX. It seems to be the easiest and most straight-forward solution for this use case, and Antirez could merge it into the main branch for everyone's benefit.
I've implemented ZLEX.
Here are the full specs.
You can grab the new functionality from here: github tw-bert
I also posted a pull request to Antirez here.
Kind regards, TW
Have you had a look at this ?
It can be useful depending on the length of the field by which you sort, this method requires b*(a^2) keys, where a is the length of the field , and b is amount of rows for this field.

Solr sort different criteria for each subset

We are using Apache SOLR for full text search. We have specific requirement for sorting the search results - basically when querying for data, we need 2 sets of data - A and B, but each set should have its own sorting criteria and we cannot make 2 different calls. We can get 2 sets by using an OR condition, but how do we sort each set differently ? To illustrate, if :
Set A = {3,1,2}
Set B = {8,5,9}
So, the expected response can have set A returned in ascending order {1,2,3} but the set B can be returned in descending order {9,8,5}
I believe the default sort in SOLR will sort the entire results sets. Any suggestions or if the question is not clear,let me know.
You can possibly achieve this using FieldCollapsing
You might need to do a little more work - i.e. have a display order field(could be an integer) so that Solr knows one field that it needs to sort by.
Next you could use a query like this -
&q=*&group=true&group.field=set&group.sort=display_order
I would recommend keeping the logic such as this out of Solr, it isn't meant to be a substitute for Relational Databases, and getting it to do complex SQL like operation (while some are possible) is going to be tricky.
By the way there is an open issue in Solr's JIRA that addresses batch processing of multiple queries. Which means, when it is merged into a release, you could fire n different queries to fetch these sets in one call to Solr.
If you are keen to have SOLR perform this task for you, the patch is available in the JIRA card, you could create a build for yourself and let us all know how it goes :)

Prevent output for query --destination_table command

Is there a way to prevent screen output for the query --destination_table?
I wan to move data sets through the workflow, but not necessarily see the all the rows
bug on job_73d3dffab7974d9db360f5c31a3a9fa7
This is a known issue, we'll fix it in the next version of bq. To work around, you can add --max_rows=0. This only changes the number of rows that get sent back, not the number of rows that get returned by the query (you can use LIMIT N for that in the query).

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.