How to write this Pig query? - apache-pig

I have a many-to-many mapping table between two collections. Each row in the mapping table represents a possible mapping with a weight score.
mapping(id1, id2, weight)
Query: Generate one to one mapping between id1 and id2. Use lowest weight to remove duplicate mappings. If there is tie, output any arbitrary one.
Example input:
(1, X, 1)
(1, Y, 2)
(2, X, 3)
(2, Y, 1)
(3, Z, 2)
Output
(1, X)
(2, Y)
(3, Z)
1 and 2 are both mapped to X and Y. We pick mapping (1, X) and (2, Y) because they have the lowest weight.

I will assume that you are only interested in mappings where the weight is the lowest of any mapping involving id1, and also the lowest of any mapping involving id2. For example, if you additionally had the mapping (2, Y, 4), it would not conflict with (1, X, 1). I will exclude such mappings because the weight is smaller than (1, Y, 2) and (2, X, 3), which were disqualified.
My solution proceeds as follows: find the minimum mapping weight for each id1, and then join that into the mapping relation for future reference. Use a nested foreach to go through each id2: use ORDER and LIMIT to select the record with the smallest weight for that id2, and then only keep it if the weight is also the minimum for that id1.
Here is the full script, tested on your input:
mapping = LOAD 'input' AS (id1:chararray, id2:chararray, weight:double);
id1_weights =
FOREACH (GROUP mapping BY id1)
GENERATE group AS id1, MIN(mapping.weight) AS id1_min_weight;
mapping_with_id1_mins =
FOREACH (JOIN mapping BY id1, id1_weights BY id1)
GENERATE mapping::id1, id2, weight, id1_min_weight;
accepted_mappings =
FOREACH (GROUP mapping_with_id1_mins BY id2)
{
ordered = ORDER mapping_with_id1_mins BY weight;
selected = LIMIT ordered 1;
acceptable = FILTER selected BY weight == id1_min_weight;
GENERATE FLATTEN(acceptable);
};
DUMP accepted_mappings;

Solved it by using Java UDF. it's not perfect in a sense that it won't maximize the number of one-to-one mappings but it's good enough.
Pig:
d = load 'test' as (fid, iid, priority:double);
g = group d by fid;
o = foreach g generate FLATTEN(com.propeld.pig.DEDUP(d)) as (fid, iid, priority);
store o into 'output';
g2 = group o by iid;
o2 = foreach g2 generate FLATTEN(com.propeld.pig.DEDUP(o)) as (fid, iid, priority);
store o2 into 'output2';
Java UDF:
package com.propeld.pig;
import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.Algebraic;
import org.apache.pig.EvalFunc;
import org.apache.pig.PigException;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class DEDUP extends EvalFunc<Tuple> implements Algebraic{
public String getInitial() {return Initial.class.getName();}
public String getIntermed() {return Intermed.class.getName();}
public String getFinal() {return Final.class.getName();}
static public class Initial extends EvalFunc<Tuple> {
private static TupleFactory tfact = TupleFactory.getInstance();
public Tuple exec(Tuple input) throws IOException {
// Initial is called in the map.
// we just send the tuple down
try {
// input is a bag with one tuple containing
// the column we are trying to operate on
DataBag bg = (DataBag) input.get(0);
if (bg.iterator().hasNext()) {
Tuple dba = (Tuple) bg.iterator().next();
return dba;
} else {
// make sure that we call the object constructor, not the list constructor
return tfact.newTuple((Object) null);
}
} catch (ExecException e) {
throw e;
} catch (Exception e) {
int errCode = 2106;
throw new ExecException("Error executing an algebraic function", errCode, PigException.BUG, e);
}
}
}
static public class Intermed extends EvalFunc<Tuple> {
public Tuple exec(Tuple input) throws IOException {
return dedup(input);
}
}
static public class Final extends EvalFunc<Tuple> {
public Tuple exec(Tuple input) throws IOException {return dedup(input);}
}
static protected Tuple dedup(Tuple input) throws ExecException, NumberFormatException {
DataBag values = (DataBag)input.get(0);
Double min = Double.MAX_VALUE;
Tuple result = null;
for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
Tuple t = (Tuple) it.next();
if ((Double)t.get(2) < min){
min = (Double)t.get(2);
result = t;
}
}
return result;
}
#Override
public Tuple exec(Tuple input) throws IOException {
return dedup(input);
}
}

Related

optaplanner can't get the best solution, and different input orders produce different solutions

I'm trying to make a demo using optaplanner: there are some schemes, each scheme has attribute of gain and cost, and a scheme may conflict with one or more other schemes. The question is to find out a group of schemes which match following constraints:
hard constraint: selected schemea may not conflict with each other in this group
soft constraint: make the difference between total gain and total cost as high as possible
I built following code and try to resolve the question:
#PlanningEntity
#Data
#NoArgsConstructor
public class Scheme {
#PlanningId
private String id;
private int gain;
private int cost;
#PlanningVariable(valueRangeProviderRefs = {"validRange"})
// when valid is ture means this scheme will be selected into the solution group
private Boolean valid;
private Set<String> conflicts = new HashSet<>();
public void addConflict(String id) {
conflicts.add(id);
}
public Scheme(String id, int gain, int cost, String[] conflicts) {
this.id = id;
this.gain = gain;
this.cost = cost;
for (String s : conflicts) {
addConflict(s);
}
}
}
#PlanningSolution
public class SchemeSolution {
private HardSoftScore score;
private List<Scheme> schemeList;
#ProblemFactCollectionProperty
#ValueRangeProvider(id = "validRange")
public List<Boolean> getValidRange() {
return Arrays.asList(Boolean.FALSE, Boolean.TRUE);
}
#PlanningScore
public HardSoftScore getScore() {
return score;
}
public void setScore(HardSoftScore score) {
this.score = score;
}
#PlanningEntityCollectionProperty
public List<Scheme> getSchemeList() {
return schemeList;
}
public void setSchemeList(List<Scheme> schemeList) {
this.schemeList = schemeList;
}
}
And the constraint rule as below:
rule "conflictCheck"
when
Boolean(this==true) from accumulate (
$schs: List() from collect (Scheme(valid==true)),
init(boolean cfl = false;Set cfSet = new HashSet();List ids = new ArrayList()),
action(
for(int i = 0; i < $schs.size(); ++i) {
Scheme sch = (Scheme)$schs.get(i);
cfSet.addAll(sch.getConflicts());
ids.add(sch.getId());
}
for( int i = 0; i < ids.size(); ++i) {
String id = (String)ids.get(i);
if(cfSet.contains(id)) {
cfl = true;
return true;
}
}
),
result(cfl)
)
then
scoreHolder.addHardConstraintMatch(kcontext, -10000);
end
rule "bestGain"
when
$gc : Number() from
accumulate(
Scheme(valid==true, $gain : gain, $cost: cost),
sum($gain - $cost)
)
then
scoreHolder.addSoftConstraintMatch(kcontext, $gc.intValue());
end
Then I constructed three schemes as input of the test. Oddly, I found that optaplanner can't get the best solution, and different input orders produce different solutions.
When I set input as following:
private static List<Scheme> getSchemes() {
List<Scheme> ret = new ArrayList();
ret.add(new Scheme("S1", 5, 2, new String[]{"S3"}));
ret.add(new Scheme("S2", 3, 1, new String[]{"S3"}));
ret.add(new Scheme("S3", 10, 4, new String[]{"S1", "S2"}));
return ret;
}
the output is :
0hard/5soft
Scheme(id=S1, gain=5, cost=2, valid=true, conflicts=[S3])
Scheme(id=S2, gain=3, cost=1, valid=true, conflicts=[S3])
Scheme(id=S3, gain=10, cost=4, valid=false, conflicts=[S1, S2])
And when I set input as following:
private static List<Scheme> getSchemes() {
List<Scheme> ret = new ArrayList();
ret.add(new Scheme("S3", 10, 4, new String[]{"S1", "S2"}));
ret.add(new Scheme("S1", 5, 2, new String[]{"S3"}));
ret.add(new Scheme("S2", 3, 1, new String[]{"S3"}));
return ret;
}
I get the best solution and the output is :
0hard/6soft
Scheme(id=S3, gain=10, cost=4, valid=true, conflicts=[S1, S2])
Scheme(id=S1, gain=5, cost=2, valid=false, conflicts=[S3])
Scheme(id=S2, gain=3, cost=1, valid=false, conflicts=[S3])
Could anyone help me about it?

How to output elements of an ArrayList based on the frequency?

My Arraylist looks like [A, B, D, E, C, A, A, B]
and i want to print to console
A,3
B,2
C,1
D,1
E,1
How can I do this in Java?
Create a hashmap, in which there will be one entry per symbol. The symbol itself will be the 'key' and the # of times that symbol shows up will be the value within the hashmap.
Iterate through your list, each time you find a new symbol, add a matching entry to the map, and each time you find an existing symbol, add one to its value in the hashmap
Add the hashmaps' entryset to a list, sort that list using a comparator then print this sorted list out.
public static void main(String[] args) {
String mySymbols = "A,B,D,E,C,A,A,B";
ArrayList<String> myList = new ArrayList<String>();
Collections.addAll(myList, mySymbols.split(","));
HashMap<String, Integer> countingMap = new HashMap<String, Integer>();
for (String s : myList){
if(countingMap.containsKey(s)){
Integer newValue = countingMap.get(s) + 1;
countingMap.put(s, newValue);
}
else{
countingMap.put(s, 1);
}
}
List<Entry> entries = new ArrayList<Entry>();
entries.addAll(countingMap.entrySet());
entries.sort(new Comparator<Entry>() {
#Override
public int compare(Entry o1, Entry o2) {
return (Integer)o2.getValue() - (Integer)o1.getValue();
}
});
Iterator iter = entries.iterator();
while(iter.hasNext()){
Entry thisEntry = (Entry) iter.next();
Object key = thisEntry.getKey();
Object value = thisEntry.getValue();
System.out.println(key+", "+value);
}
}
Test the code here
You could accomplish this with a Stream.
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
import java.util.stream.Collectors;
class MyClass {
public static void main(String...args) {
printListElementFrequencies(Arrays.asList("A", "B", "D", "E", "C", "A", "A", "B"));
}
static void printListElementFrequencies(List<String> list) {
Map<String, Long> frequencies = list.stream().collect(Collectors.groupingBy(s -> s, TreeMap::new, Collectors.counting()));
Comparator<String> sortByFrequencyThenLexicographically = Comparator.<String, Long>comparing(frequencies::get).reversed().thenComparing(Comparator.comparing(s -> s));
list.stream().sorted(sortByFrequencyThenLexicographically).forEach(s -> System.out.println(s + ", " + frequencies.get(s)));
}
}
Edit: I missed the sorting part.
If the contents of your array are of type char this can be a approached using another array
char[] foo = new char['Z' - 'A'];
then you will loop through your original array incrementing the position of that char in foo. if you use a fast sorting algorithm afterwards afterwards you can just print them out but if i+1 = i then compare them by char value
It could be done by converting the list into frequency map and sorting the entry set by values in descending order and then by key:
List<String> list = Arrays.asList("A", "B", "D", "E", "C", "A", "A", "B", "Z", "D");
list.stream()
.collect(Collectors.groupingBy(x -> x, Collectors.counting())) // Map<String, Long>
.entrySet()
.stream()
.sorted(Map.Entry.<String, Long>comparingByValue().reversed()
.thenComparing(Map.Entry.comparingByKey())
) // Stream<Map.Entry<String, Long>>
.forEach(e -> System.out.println(e.getKey() + ", " + e.getValue()));
Output:
A, 3
B, 2
D, 2
C, 1
E, 1
Z, 1
Also it is possible to use Collectors::toMap with a merge function (here the frequency is calculated as Integer):
list.stream()
.collect(Collectors.toMap(x -> x, x -> 1, Integer::sum)) // Map<String, Integer>
.entrySet()
.stream()
.sorted(Map.Entry.<String, Integer>comparingByValue().reversed()
.thenComparing(Map.Entry.comparingByKey())
)
.forEach(e -> System.out.println(e.getKey() + " -> " + e.getValue()));

Comparator in binary search

I am not sure how to write comparator for Collectionos.binarySearch(). Can anyone help ? sample code:
List<Object> list1 = new ArrayList<>();
List<List<Object>> list2 = new ArrayList<>();
//loop starts
// adds elements into list1
list1.add(values);//values is an object containing elements like [3, John, Smith]
if (list2.size() == 0) {
list2.add(list1);//first element
} else {
if (index >= 0) {
int index = Collections.binarySearch(list2, list1, comparator);
list2.add(index, list1);//I want to add these elements in ascending order ?
}
}
//loop ends
How do I write comparator, so that elements in list 2 are added in ascending or descending order.
You can use an anonymous class which implements a Comparator<List<Object>>:
int index = Collections.binarySearch(list2, list1, new Comparator<List<Object>>() {
#Override
public int compare(List<Object> o1, List<Object> o2) {
// Your implementation here
return 0;
}
});
You could implement an IComparer<List<Object>> class, or use a lambda expression.
You just need to create a class that implements the Comparator interface.
For example, you can do this inline with an anonymous class:
Comparator<List<Object>> comparator = new Comparator<List<Object>>() {
#Override
public int compare(List<Object> x, List<Object> y) {
// custom logic to compare x and y here. Return a negative number
// if x < y, a positive number if x > y, and 0 otherwise
}
};
Collections.binarySearch(list, comparator);

apache pig Java UDF - changing values in attributes doesn't seem to stick

I'm trying to write a Java UDF that will rank tuples in a bag using a java UDF.
The tuples have a value column that is the criteria for the ranking and a rank column which is initially set to 0.
The tuples are sorted based on the value column.
All the tuples are placed in a bag and that bag is placed inside a new tuple which is passed to the UDF.
The UDF is modifying the rank column however - once the method exits the values have all become 0 again. I'm not sure how to get the values to "Stick".
Any help would greatly appreciated.
Here is my java class
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.pig.FilterFunc;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataBag;
import org.apache.pig.impl.logicalLayer.FrontendException;
import java.util.Iterator;
import org.apache.pig.PigWarning;
/**
*
* #author Winter
*/
public class Ranker extends EvalFunc<String>{
#Override
public String exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {
return null;
}
List<Object> list = tuple.getAll();
DataBag db = (DataBag) list.get(0);
Integer num = (Integer)list.get(1);
Iterator<Tuple>itr = db.iterator();
boolean containsNonNull = false;
int i = 1;
double previous=0;
while (itr.hasNext()) {
Tuple t= itr.next();
double d = (Double)t.get(num.intValue());
int rankCol = t.size()-1;
Integer rankVal = (Integer)t.get(rankCol);
if(i == 0){
System.out.println("i==0");
previous = d;
t.set(rankCol, i);
} else {
if(d == previous)
t.set(rankCol, i);
else{
System.out.print("d!==previous|" + d + "|"+ previous+"|"+rankVal);
t.set(rankCol, ++i);
rankVal = (Integer)t.get(rankCol);
System.out.println("|now rank val" + rankVal);
previous = d;
}
}
}
return "Y";
}
}
Here is how I am calling everything in Pig -
REGISTER /myJar.jar;
A = LOAD '/Users/Winter/milk-tea-coffee.tsv' as (year:chararray, milk:double);
B = foreach A generate year, milk, 0 as rank;
C = order B by milk asc;
D = group C by rank order C by milk;
E = foreach D generate D.C.year,D.C.milk,D.C.rank, piglet3.evalFunctions.Ranker(D.C,1);
dump E;
I can tell its working inside the UDF because of the print statements inside the UDF -
d!==previous|21.2|0.0|0|now rank val2
d!==previous|21.6|21.2|0|now rank val3
d!==previous|21.9|21.6|0|now rank val4
d!==previous|22.0|21.9|0|now rank val5
d!==previous|22.5|22.0|0|now rank val6
d!==previous|22.9|22.5|0|now rank val7
d!==previous|23.0|22.9|0|now rank val8
d!==previous|23.4|23.0|0|now rank val9
d!==previous|23.8|23.4|0|now rank val10
d!==previous|23.9|23.8|0|now rank val11
but when I dump out E or D or C the rank column only contains 0s.
The exec function must return the output you want from the UDF. You are currently modifying the Tuple that is being passed to the exec function, then returning the String "Y" -- all that Pig see's as output from your UDF is "Y". In this case, you should return the Tuple instead of "Y".
I think the following code is close to your intent, but I'm not quite clear on what you are trying to do:
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.pig.FilterFunc;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataBag;
import org.apache.pig.impl.logicalLayer.FrontendException;
import java.util.Iterator;
import org.apache.pig.PigWarning;
/**
*
* #author Winter
*/
public class Ranker extends EvalFunc<Tuple>{
#Override
public Tuple exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {
return null;
}
List<Object> list = tuple.getAll();
DataBag db = (DataBag) list.get(0);
Integer num = (Integer)list.get(1);
Iterator<Tuple>itr = db.iterator();
boolean containsNonNull = false;
int i = 1;
double previous=0;
while (itr.hasNext()) {
Tuple t= itr.next();
double d = (Double)t.get(num.intValue());
int rankCol = t.size()-1;
Integer rankVal = (Integer)t.get(rankCol);
if(i == 0){
System.out.println("i==0");
previous = d;
t.set(rankCol, i);
} else {
if(d == previous)
t.set(rankCol, i);
else{
System.out.print("d!==previous|" + d + "|"+ previous+"|"+rankVal);
t.set(rankCol, ++i);
rankVal = (Integer)t.get(rankCol);
System.out.println("|now rank val" + rankVal);
previous = d;
}
}
}
return tuple;
}
}

How to multiply several fields in a tuple by a given field of the tuple

For each row of data, I would like to multiply fields 1 through N by field 0. The data could have hundreds of fields per row (or a variable number of fields for that matter), so writing out each pair is not feasible. Is there a way to specify a range of fields, sort of like the the following (incorrect) snippet?
A = LOAD 'foo.csv' USING PigStorage(',');
B = FOREACH A GENERATE $0*($1,..);
A UDF could come in handy here.
Implement exec(Tuple input) and iterate over all fields of the tuple as follows (not tested):
public class MultiplyField extends EvalFunc<Long> {
public Long exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
}
try {
Long retVal = 1;
for (int i = 0; i < input.size(); i++) {
Long j = (Long)input.get(i);
retVal *= j;
}
return retVal;
} catch(Exception e) {
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
}
}
Then register your UDF and call it from your FOREACH.