Spark JavaPairRDD iteration - apache-pig

How can iterate on JavaPairRDD. I have done a group by and got back a RDD as below JavaPairRDD (Tuple 7 set of Strings and List of Objects)
Now I have to iterate over this RDD and do some calculations like FOR EACH in Pig.
Basically I would like to iterate the key and the list of values and do some operations and then return back a JavaPairRDD?
JavaPairRDD<Tuple7<String, String,String,String,String,String,String>, List<Records>> sizes =
piTagRecordData.groupBy( new Function<Records, Tuple7<String, String,String,String,String,String,String>>() {
private static final long serialVersionUID = 2885738359644652208L;
#Override
public Tuple7<String, String,String,String,String,String,String> call(Records row) throws Exception {
Tuple7<String, String,String,String,String,String,String> compositeKey = new Tuple7<String, String, String, String, String, String, String>(row.getAsset_attribute_id(),row.getDate_time_value(),row.getOperation(),row.getPi_tag_count(),row.getAsset_id(),row.getAttr_name(),row.getCalculation_type());
return compositeKey;
}
});
After this I want to perform FOR EACH member of sizes (JavaPairRDD), operation -- something like
rejected_records = FOREACH sizes GENERATE FLATTEN(Java function on the List of Records based on the group key
I am using Spark 0.9.0

Even though you are talking about "FOR EACH", it really sounds like you want the flatMap operation, since you want to produce new values and flatten them. This is available for Java RDDs, including a JavaPairRDD.

You can use void foreach(VoidFunction<T> f) method. More info and methods: https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark/api/java/JavaRDDLike.html#foreach(org.apache.spark.api.java.function.VoidFunction)

if you want to view some value of JavaPairRDD, I would do like this
for (Tuple2<String, String> test : pairRdd.take(10)) //or pairRdd.collect()
{
System.out.println(test._1);
System.out.println(test._2);
}
Note:Tuple2 (assuming you have strings inside the JavaPairRDD), change the datatype according to the data type stored in the JavaPairRDD.

Related

Combining Two List in Kotlin with Index

There is a data class as fruits.
data class Fruits(
val code: String, //Unique
val name: String
)
The base list indexed items with boolean variable is as below.
val indexList: MutableList<Boolean> = MutableList(baseFruitList.size) { false }
Now the Favourite Indexed list is as below
val favList: MutableList<Boolean> = MutableList(favFruitList.size) { true}
I want a combined full list which basically has the fav item indicated as true.
Ex:
baseFruitList = {[FT1,apple],[FT2,grapes],[FT3,banana],[FT4,mango],[FT5,pears]}
favList = {[FT2,grapes],[FT4,mango]}
The final index list should have
finalIndexed = {false,true,false,true,false}
How can we achieve in Kotlin, without iterating through each element.
You can do
val finalIndexed = baseFruitList.map { it in favList }
assuming, like #Tenfour04 is asking, that name is guaranteed to be a specific value (including matching case) for a specific code (since that combination is how a data class matches another, e.g. for checking if it's in another list)
If you can't guarantee that, this is safer:
val finalIndexed = baseFruitList.map { fruit ->
favList.any { fav.code == fruit.code }
}
but here you have to iterate over all the favs (at least until you find a match) looking to see if one has the code.
But really, if code is the unique identifier here, why not just store those in your favList?
favList = listOf("FT2", "FT4") // or a Set would be more efficient, and more correct!
val finalIndexed = baseFruitList.map { it.code in favList }
I don't know what you mean about "without iterating through each element" - if you mean without an explicit indexed for loop, then you can use these simple functions like I have here. But there's always some amount of iteration involved. Sets are always an option to help you minimise that

Java8 Streams - Compare Two List's object values and add value to sub object of first list?

I have two classes:
public class ClassOne {
private String id;
private String name;
private String school;
private String score; //Default score is null
..getters and setters..
}
public class ClassTwo {
private String id;
private String marks;
..getters and setters..
}
And, I have two Lists of the above classes,
List<ClassOne> listOne;
List<ClassTwo> listTwo;
How can I compare two list and assign marks from listTwo to score of listOne based on the criteria if the IDs are equal. I know, we can use two for loops and do it. But I want to implement it using Java8 streams.
List<ClassOne> result = new ArrayList<>();
for(ClassOne one : listOne) {
for(ClassTwo two : listTwo) {
if(one.getId().equals(two.getId())) {
one.setScore(two.getmarks());
result.add(one);
}
}
}
return result;
How can I implement this using Java8 lambda and streams?
Let listOne.size() is N and listTwo.size() is M.
Then 2-for-loops solution has complexity of O(M*N).
We can reduce it to O(M+N) by indexing listTwo by ids.
Case 1 - assuming listTwo has no objects with the same id
// pair each id with its marks
Map<String, String> marksIndex = listTwo.stream().collect(Collectors.toMap(ObjectTwo::getId, ObjectTwo::getMarks));
// go through list of `ObjectOne`s and lookup marks in the index
listOne.forEach(o1 -> o1.setScore(marksIndex.get(o1.getId())));
Case 2 - assuming listTwo has objects with the same id
final Map<String, List<ObjectTwo>> marksIndex = listTwo.stream()
.collect(Collectors.groupingBy(ObjectTwo::getId, Collectors.toList()));
final List<ObjectOne> result = listOne.stream()
.flatMap(o1 -> marksIndex.get(o1.getId()).stream().map(o2 -> {
// make a copy of ObjectOne instance to avoid overwriting scores
ObjectOne copy = copy(o1);
copy.setScore(o2.getMarks());
return copy;
}))
.collect(Collectors.toList());
To implement copy method you either need to create a new object and copy fields one by one, but in such cases I prefer to follow the Builder pattern. It also results in more "functional" code.
Following code copies marks from ObjectTwo to score in ObjectOne, if both ids are equal, it doesn't have intermediate object List<ObjectOne> result
listOne.stream()
.forEach(one -> {listTwo.stream()
.filter(two -> {return two.getId().equals(one.getId());})
.limit(1)
.forEach(two -> {one.setScore(two.getMarks());});
});
This should work.
Map<String, String> collect = listTwo.stream().collect(Collectors.toMap(ObjectTwo::getId, ObjectTwo::getMarks));
listOne
.stream()
.filter(item -> collect.containsKey(item.getId()))
.forEach(item -> item.setScore(collect.get(item.getId())));

Comparing and removing object from ArrayLists using Java 8

My apologies if this is a simple basic info that I should be knowing. This is the first time I am trying to use Java 8 streams and other features.
I have two ArrayLists containing same type of objects. Let's say list1 and list2. Let's say the lists has Person objects with a property "employeeId".
The scenario is that I need to merge these lists. However, list2 may have some objects that are same as in list1. So I am trying to remove the objects from list2 that are same as in list1 and get a result list that then I can merge in list1.
I am trying to do this with Java 8 removeIf() and stream() features. Following is my code:
public List<PersonDto> removeDuplicates(List<PersonDto> list1, List<PersonDto> list2) {
List<PersonDto> filteredList = list2.removeIf(list2Obj -> {
list1.stream()
.anyMatch( list1Obj -> (list1Obj.getEmployeeId() == list2Obj.getEmployeeId()) );
} );
}
The above code is giving compile error as below:
The method removeIf(Predicate) in the type Collection is not applicable for the arguments (( list2Obj) -> {})
So I changed the list2Obj at the start of "removeIf()" to (<PersonDto> list2Obj) as below:
public List<PersonDto> removeDuplicates(List<PersonDto> list1, List<PersonDto> list2) {
List<PersonDto> filteredList = list2.removeIf((<PersonDto> list2Obj) -> {
list1.stream()
.anyMatch( list1Obj -> (list1Obj.getEmployeeId() == list2Obj.getEmployeeId()) );
} );
}
This gives me an error as below:
Syntax error on token "<", delete this token for the '<' in (<PersonDto> list2Obj) and Syntax error on token(s), misplaced construct(s) for the part from '-> {'
I am at loss on what I really need to do to make it work.
Would appreciate if somebody can please help me resolve this issue.
I've simplified your function just a little bit to make it more readable:
public static List<PersonDto> removeDuplicates(List<PersonDto> left, List<PersonDto> right) {
left.removeIf(p -> {
return right.stream().anyMatch(x -> (p.getEmployeeId() == x.getEmployeeId()));
});
return left;
}
Also notice that you are modifying the left parameter, you are not creating a new List.
You could also use: left.removeAll(right), but you need equals and hashcode for that and it seems you don't have them; or they are based on something else than employeeId.
Another option would be to collect those lists to a TreeSet and use removeAll:
TreeSet<PersonDto> leftTree = left.stream()
.collect(Collectors.toCollection(() -> new TreeSet<>(Comparator.comparing(PersonDto::getEmployeeId))));
TreeSet<PersonDto> rightTree = right.stream()
.collect(Collectors.toCollection(() -> new TreeSet<>(Comparator.comparing(PersonDto::getEmployeeId))));
leftTree.removeAll(rightTree);
I understand you are trying to merge both lists without duplicating the elements that belong to the intersection. There are many ways to do this. One is the way you've tried, i.e. remove elements from one list that belong to the other, then merge. And this, in turn, can be done in several ways.
One of these ways would be to keep the employee ids of one list in a HashSet and then use removeIf on the other list, with a predicate that checks whether each element has an employee id that is contained in the set. This is better than using anyMatch on the second list for each element of the first list, because HashSet.contains runs in O(1) amortized time. Here's a sketch of the solution:
// Determine larger and smaller lists
boolean list1Smaller = list1.size() < list2.size();
List<PersonDto> smallerList = list1Smaller ? list1 : list2;
List<PersonDto> largerList = list1Smaller ? list2 : list1;
// Create a Set with the employee ids of the larger list
// Assuming employee ids are long
Set<Long> largerSet = largerList.stream()
.map(PersonDto::getEmployeeId)
.collect(Collectors.toSet());
// Now remove elements from the smaller list
smallerList.removeIf(dto -> largerSet.contains(dto.getEmployeeId()));
The logic behind this is that HashSet.contains will take the same time for both a large and a small set, because it runs in O(1) amortized time. However, traversing a list and removing elements from it will be faster on smaller lists.
Then, you are ready to merge both lists:
largerList.addAll(smallerList);

Reading Hadoop SequenceFiles with Hive

I have some mapred data from the Common Crawl that I have stored in a SequenceFile format. I have tried repeatedly to use this data "as is" with Hive so I can query and sample it at various stages. But I always get the following error in my job output:
LazySimpleSerDe: expects either BytesWritable or Text object!
I have even constructed a simpler (and smaller) dataset of [Text, LongWritable] records, but that fails as well. If I output the data to text format and then create a table on that, it works fine:
hive> create external table page_urls_1346823845675
> (pageurl string, xcount bigint)
> location 's3://mybucket/text-parse/1346823845675/';
OK
Time taken: 0.434 seconds
hive> select * from page_urls_1346823845675 limit 10;
OK
http://0-italy.com/tag/package-deals 643 NULL
http://011.hebiichigo.com/d63e83abff92df5f5913827798251276/d1ca3aaf52b41acd68ebb3bf69079bd1.html 9 NULL
http://01fishing.com/fly-fishing-knots/ 3437 NULL
http://01fishing.com/flyin-slab-creek/ 1005 NULL
...
I tried using a custom inputformat:
// My custom input class--very simple
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
public class UrlXCountDataInputFormat extends
SequenceFileInputFormat<Text, LongWritable> { }
I create the table then with:
create external table page_urls_1346823845675_seq
(pageurl string, xcount bigint)
stored as inputformat 'my.package.io.UrlXCountDataInputFormat'
outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
location 's3://mybucket/seq-parse/1346823845675/';
But I still get the same SerDer error.
I'm sure there's something really basic I'm missing here, but I can't seem to get it right. Additionally, I have to be able to parse the SequenceFiles in place (i.e. I can't convert my data to text). So I need to figure out the SequenceFile approach for future portions of my project.
Solution:
As #mark-grover pointed out below, the issue is that Hive ignores the key by default. With only one column (i.e. just the value), the serder was unable to map my second column.
The solution was to use a custom InputFormat that was great deal more complex than what I had used originally. I tracked down one answer at link to a Git about using the keys instead of the values, and then I modified it to suit my needs: take the key and value from an internal SequenceFile.Reader and then combining them into the final BytesWritable. I.e. something like this (from the custom Reader, as that's where all the hard work happens):
// I used generics so I can use this all with
// other output files with just a small amount
// of additional code ...
public abstract class HiveKeyValueSequenceFileReader<K,V> implements RecordReader<K, BytesWritable> {
public synchronized boolean next(K key, BytesWritable value) throws IOException {
if (!more) return false;
long pos = in.getPosition();
V trueValue = (V) ReflectionUtils.newInstance(in.getValueClass(), conf);
boolean remaining = in.next((Writable)key, (Writable)trueValue);
if (remaining) combineKeyValue(key, trueValue, value);
if (pos >= end && in.syncSeen()) {
more = false;
} else {
more = remaining;
}
return more;
}
protected abstract void combineKeyValue(K key, V trueValue, BytesWritable newValue);
}
// from my final implementation
public class UrlXCountDataReader extends HiveKeyValueSequenceFileReader<Text,LongWritable>
#Override
protected void combineKeyValue(Text key, LongWritable trueValue, BytesWritable newValue) {
// TODO I think we need to use straight bytes--I'm not sure this works?
StringBuilder builder = new StringBuilder();
builder.append(key);
builder.append('\001');
builder.append(trueValue.get());
newValue.set(new BytesWritable(builder.toString().getBytes()) );
}
}
With that, I get all my columns!
http://0-italy.com/tag/package-deals 643
http://011.hebiichigo.com/d63e83abff92df5f5913827798251276/d1ca3aaf52b41acd68ebb3bf69079bd1.html 9
http://01fishing.com/fly-fishing-knots/ 3437
http://01fishing.com/flyin-slab-creek/ 1005
http://01fishing.com/pflueger-1195x-automatic-fly-reels/ 1999
Not sure if this is impacting you but Hive ignores keys when reading SequenceFiles. You may need to create a custom InputFormat (unless you can find one online:-))
Reference: http://mail-archives.apache.org/mod_mbox/hive-user/200910.mbox/%3C5573211B-634D-4BB0-9123-E389D90A786C#metaweb.com%3E

Can I pass parameters to UDFs in Pig script?

I am relatively new to PigScript. I would like to know if there is a way of passing parameters to Java UDFs in Pig?
Here is the scenario:
I have a log file which have different columns (each representing a Primary Key in another table). My task is to get the count of distinct primary key values in the selected column.
I have written a Pig script which does the job of getting the distinct primary keys and counting them.
However, I am now supposed to write a new UDF for each column. Is there a better way to do this? Like if I can pass a row number as parameter to UDF, it avoids the need for me writing multiple UDFs.
The way to do it is by using DEFINE and the constructor of the UDF. So here is an example of a customer "splitter":
REGISTER com.sample.MyUDFs.jar;
DEFINE CommaSplitter com.sample.MySplitter(',');
B = FOREACH A GENERATE f1, CommaSplitter(f2);
Hopefully that conveys the idea.
To pass parameters you do the following in your pigscript:
UDF(document, '$param1', '$param2', '$param3')
edit: Not sure if those params need to be wrappedin ' ' or not
while in your UDF you do:
public class UDF extends EvalFunc<Boolean> {
public Boolean exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return false;
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
String var1 = input.get(1).toString();
InputStream var1In = fs.open(new Path(var1));
String var2 = input.get(2).toString();
InputStream var2In = fs.open(new Path(var2));
String var3 = input.get(3).toString();
InputStream var3In = fs.open(new Path(var3));
return doyourthing(input.get(0).toString());
}
}
for example
Yes, you can pass any parameter in the Tuple parameter input of your UDF:
exec(Tuple input)
and access it using
input.get(index)