I am fairly new to hadoop and I need help modelling a map reduce job.
I have two groups of files: GroupA and GroupB. The structure of both group of files is same: key,value in each line. Group A and B have same set of keys. However the values in the two groups different properties. The files are sufficiently large and hence the hadoop option.
The task is to combine the properties from group A and group B for each individual key into a third property for that key and then afterwards sum up the third property for all the keys.
Now, on the first glance what seems like is: Map -> collect the key -value pairs from both groupfiles Combine-partition-sort-shuffle -> group the entries of same key into same partition, so they fall to same reducer(handled by hadoop internally) reduces -> combine the same key values into the third property and writes its batches into the output files.
I am not sure how to model the third step of adding up the third property across the keys. One way I can think of is to have another map-red job after this one which can take this files and combine them through one reducer into the result. Is this the right way of modelling ? Is there any other way, I can model this ? Is it possible to have consecutive reducers along the lines of something like this - map -> red -> red ?
The model in hadoop would go something like having two map reduce models triggered one after the other. If we use spark over hadoop there is something called count which can be invoked after the map-reduce to get the final output.
Related
The task is this: I have several models and after each processing of the model I would like to increase the count of processing by 1 in a separate table (always containing 1 row). Please give me a hint on how best doing it.
I think this is a job for post_hook but I'd like a concrete example.
I wrote one code to store graph in redisgraph. Initially it is storing single graph but if I execute the same code second time then it is storing the same graph in database without replacing the previous graph.So, now I am getting two same graphs in a single key in the database.I don't want any duplicate graph or any duplicate node that means if I execute same code again it should replace previous graph.How will I do that?
If your code consists of a series of CREATE commands (whether through Cypher or one of the RedisGraph clients), running it twice will duplicate all of your data. This is not to say that the key stores two graphs; rather, it is one graph with every entity repeated.
If you would like to replace an existing graph, you should delete the existing graph first. You can delete a graph using a Redis command:
DEL [graph key]
Or a RedisGraph command:
GRAPH.DELETE [graph key]
The two are functionally identical.
Conversely, if you want to update an existing graph without introducing duplicates, you should use the MERGE clause as described in the RedisGraph documentation.
You can Use MERGE clause to prevent inserting duplicate data.
Below is the query to remove duplicate records from existing data
MATCH (p:LabelName)
WITH p.id as id, collect(p) AS nodes
WHERE size(nodes) > 1
UNWIND nodes[1..] AS node
DELETE node
MERGE will do like a find or create.
If your node, edge or path does not exist it will create it.
That's the recommended way to avoid duplicate entities if they are not permitted.
We would like to compute on a large, partition-able dataset or 'products' in Ignite (100.000+ products, each linked to a large amount of extra data in different caches). We require several use cases:
1) Launch a compute job, limited to a large number (100's) of products, with a strong focus on responsiveness (<200ms). We can use the product ID as an affinity key to collocate all extra data with the products. But affinityRun only allows a single key to be specified, which would mean we need to launch 100's of compute jobs. Ideally we would be able to do an affinityRun on the entire set of product IDs at once, and let Ignite distribute the compute job to the relevant nodes, but we struggle to find a way to do this. (The compute job would then use local queries only on those compute nodes.)
2) Launch a compute job over the entire space of products in an efficient manner. We could launch the compute job on each compute node and use local queries, but that would no longer give us the benefits of falling back to backup partitions in case a primary partition is unavailable. This is an extreme case of problem number 1, just with a huge (all) number of product IDs as input.
We've been brainstorming about this for a while now, but it seems like we're missing something. Any ideas?
There is a version of affinityRun that takes a partition number as a parameter. Distribute your task per partition and each node on the receiving end will be processing data residing in that partition number (just run a scan query for the partition). In case of failure, you'll just restart the process for a partition and can filter out already processed items with a custom logic.
Affinity job is nothing but the one which execute on the data node where key/value resides.
There are several ways to send job to particular node and not only affinity key. for example, you can send based on consistentID and in 2.4.10(if I remember correctly), they added way to query backup explicitly.
Regarding your scenario, I can think of below solution-
SqlFieldsQuery query = new SqlFieldsQuery("select productID from CacheTable").setLocal(true);
You can prepare affinity job with above SQL where you will select all products(from that node only) and iterate over them and do all queries locally only to get all products information like this. Send that job to required node and do your computation and reduce the result and return to client.
I want all nodes in a cluster to have equal number data load. With
default Affinity function it is not happening.
As of now, we have 3 nodes. We use group ID as affinity key, and we have 3
group IDs (1, 2 and 3). And we limit cache partitions to group IDs. Overall
nodes=group IDs=cache partitions. So that each node have equal number of
partitions.
Will it be okay to write custom Affinity function? And
what will we lose doing so? Did anyone write custom Affinity function?
The affinity function doesn't guarantee an even distribution across all nodes. It's statistical... and three values isn't really enough to make sure the data is "fairly" distributed.
So, yes, writing a new affinity function would work. The downsides being you need to make it fast (it's called a lot) and you'd be hard-coding it to your current node topology. What happens when you choose to add a new node? What happens when a node fails? Also, you'd be potentially putting all your data into three partitions which make it harder to scale out (one of the main advantages of Ignite's architecture).
As an alternative, I'd look at your data model. Splitting your data into three chunks is too coarse for things to work automatically.
I've to inner join two relational tables extracted from Oracle.
Actually i want to perform 1-to-1 join to get one row per primary key with aggegated in list values from the second table. So before joining 1-to-1 two tables i have to reduce all my rows by key to a 1 with values kept in the list.
Here is the illustration of what i need:
[![tables aggregation][1]][1]
And here i've met a problem which is when to stop aggegation for my key and pass aggegated entity to the next step. Spark offers solutions for that by providing window intervals and watermaking for late data. And so assumption for keeping data consistency is the time it receives the data. It is feasible and applicable for infinite datasets but in my case i exactly know the count of aggegations for each key. For exampe for customer_id 1000 i know exactly that there are only 3 products and after i've aggegated 3 products i know that i can stop aggegation now and go to the next streaming step in my pipeline. How can this solution be implemented using Spark and streaming? I know there is reduceByKeyAndWindow operation but in my case i need something like reduceByKeyAndWindowByCount.
Count will be stored in a static dataset or simply store it in a row as an additional data.
Finally we've decided to switch from streaming to core spark with batch processing because we have finite dataset and that thing works well for our use case. We've came to a conclusion that spark streaming was designed for processing continuous (which was actually obvious only from it's naming) datasets. And thats why we have only window intervals by time and watermarks to correct network or other delays during transportation. We've also found our design with counters ugly, complex and in the other words bad. It is a live example of a bad design and such growing complexity was a marker that we were moving in the wrong direction and were trying to use a tool for a purpose it was not designed for.