how to assign a sequentially increasing number to a relation - apache-pig

I am wondering if there is a way to create a new field in a relation and then assign some sequentially increasing number to it? Here is one example:
ordered_products = ORDER products BY price ASC;
ordered_products_with_sequential_id = FOREACH ordered_products GENERATE price, some_sequential_id;
How can I create some_sequential_id? I am not sure whether that's doable in Pig though.

I suspect you have to write your own UDF to get that running. One way to do it would be by incrementing a static variable (AtomicInteger) in your UDF in the exec implementation.
public class IncrEval extends EvalFunc<Long> {
final static AtomicLong res = new AtomicLong(0);
#Override
public Long exec (Tuple tip) throws IOException {
if (tip == null || tip.size() == 0) {
return null;
}
res.incrementAndGet();
return res.longValue();
}
}
Pig script entry:
b = FOREACH a GENERATE <something>, com.eval.IncrEval() AS ID:long;

Related

How to create filters in Spring boot based on a List

I have to created filters data in spring boot. From frontEnd, I am sending a list containing the selected id of each category.I need to return items based on this list. I can do it this way
Service:
public List<ProductModel> getProductFilter(Integer[] categories) {
int size = categories.length;
if(size == 1){
return productRepository.getProductFilteredByOneCategory(Long.valueOf(categories[0]));
}else {
return productRepository.getProductFilteredByTwoCategory(Long.valueOf(categories[0]),Long.valueOf(categories[1]));
}
}
}
Repository:
#Query("SELECT c FROM ProductModel c WHERE c.categoryModel.id = :Category_id")
List<ProductModel> getProductFilteredByOneCategory(Long Category_id);
#Query("SELECT c FROM ProductModel c WHERE c.categoryModel.id = :Category_idOne OR c.categoryModel.id = :Category_idTwo")
List<ProductModel> getProductFilteredByTwoCategory(Long Category_idOne, Long Category_idTwo);
But if there are 50 of these categories, it is useless. Can anyone tell me how to make it better? There is some way to generate a query from a list?
You can use in instead of using multiple or operations as follows. It select all productModels match any categoryModel id in List.
#Query("SELECT c FROM ProductModel c WHERE c.categoryModel.id in category_ids")
List<ProductModel> getProductFilteredByCategoryIds(List<Long> Category_ids);
As #YJR said, IN clause is the solution, but you should also consider declaring query method as shown below, which doesn't require writing JPQL.
public List<ProductModel> findByCategoryModel_IdIn(List<Long> categoryIds);

How to use Lambda expressions in java for nested if-else and for loop together

I have following Code where i will receive list of names as parameter.In the loop, first i'm assigning index 0 value from list to local variable name. There after comparing next values from list with name. If we receive any non-equal value from list, i'm assigning value of result as 1 and failing the test case.
Below is the Array list
List<String> names= new ArrayList<String>();
names.add("John");
names.add("Mark");
Below is my selenium test method
public void test(List<String> names)
String name=null;
int a=0;
for(String value:names){
if(name==null){
System.out.println("Value is null");
name=value;
}
else if(name.equals(value)){
System.out.println("Received Same name");
name=value;
}
else{
a=1;
Assert.fail("Received different name in between");
}
}
How can i convert above code into lambda expressions?. I'm using cucumber data model, hence i receive data as list from feature file. Since i can't give clear explanation, just posted the example logic i need to convert to lambda expression.
Here's the solution: it cycles all element in your list checking if are all the same.
You can try adding or editing the list so you can have different outputs. I've written the logic, you can easly put it into a JUnit test
List<String> names= new ArrayList<>();
names.add("John");
names.add("Mark");
String firstEntry = names.get(0);
boolean allMatch = names.stream().allMatch(name -> firstEntry.equals(name));
System.out.println("All names are the same: "+allMatch);
Are you looking for duplicates, whenever you have distinct value , set a=1 and say assert to fail. You can achieve this by :
List<String> names= new ArrayList<String>();
names.add("John");
names.add("Mark");
if (names.stream().distinct().limit(2).count() > 1) {
a= 1,
Assert.fail("Received different name in between");
} else {
System.out.println("Received Same name");
}

Need to implement ETL logic of generating sequence number through Pig Script

I have a current ETL Logic which I have to implement in pig.
ETL Logic is creating a unique sequence number for the column if incoming value is null or blank.
Need to do this through pig.
you can generate sequence number using RANK but in your condition it is little bit different you are checking if that value is either '0' or 'null' then only you are assigning sequence number..
My point of view you should use UDF for this..
package pig.test;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class SequenceNumber extends EvalFunc<Integer> {
static int cnt = 0;
public Integer exec(Tuple v) throws IOException{
int a = (Integer)v.get(0);
if(a == 0) {
cnt++ ;
return new Integer(cnt);
}
else
return new Integer(a);
}
}
In pig:
--Replace all null with 0
Step-1 A1 = foreach A generate *, (id is null ? 0 : id) as sq;
Step-2 T1 = foreach A1 generate sq,<your_fields>,<your_fields>;
Step-3 Result = foreach T1 generate sqno(*),<your_fields>,<your_fields>;
Hope this will help!!

Pig: Get index in nested foreach

I have a pig script with code like :
scores = LOAD 'file' as (id:chararray, scoreid:chararray, score:int);
scoresGrouped = GROUP scores by id;
top10s = foreach scoresGrouped{
sorted = order scores by score DESC;
sorted10 = LIMIT sorted 10;
GENERATE group as id, sorted10.scoreid as top10candidates;
};
It gets me a bag like
id1, {(scoreidA),(scoreidB),(scoreIdC)..(scoreIdFoo)}
However, I wish to include the index of items as well, so I'd have results like
id1, {(scoreidA,1),(scoreidB,2),(scoreIdC,3)..(scoreIdFoo,10)}
Is it possible to include the index somehow in the nested foreach, or would I have to write my own UDF to add it in afterwards?
For indexing elements in a bag you may use the Enumerate UDF from LinkedIn's DataFu project:
register '/path_to_jar/datafu-0.0.4.jar';
define Enumerate datafu.pig.bags.Enumerate('1');
scores = ...
...
result = foreach top10s generate id, Enumerate(top10candidates);
You'll need a UDF, whose only argument is the sorted bag you want to add a rank to. I've had the same need before. Here's the exec function to save you a little time:
public DataBag exec(Tuple b) throws IOException {
try {
DataBag bag = (DataBag) b.get(0);
Iterator<Tuple> it = bag.iterator();
while (it.hasNext()) {
Tuple t = (Tuple)it.next();
if (t != null && t.size() > 0 && t.get(0) != null) {
t.append(n++);
}
newBag.add(t);
}
} catch (ExecException ee) {
throw ee;
} catch (Exception e) {
int errCode = 2106;
String msg = "Error while computing item number in " + this.getClass().getSimpleName();
throw new ExecException(msg, errCode, PigException.BUG, e);
}
return newBag;
}
(The counter n is initialized as a class variable outside the exec function.)
You can also implement the Accumulator interface, which will allow you to do this even if your entire bag won't fit in memory. (The COUNT built-in function does this.) Just be sure to set n = 1L; in the cleanup() method and return newBag; in getValue(), and everything else is the same.

Can I pass parameters to UDFs in Pig script?

I am relatively new to PigScript. I would like to know if there is a way of passing parameters to Java UDFs in Pig?
Here is the scenario:
I have a log file which have different columns (each representing a Primary Key in another table). My task is to get the count of distinct primary key values in the selected column.
I have written a Pig script which does the job of getting the distinct primary keys and counting them.
However, I am now supposed to write a new UDF for each column. Is there a better way to do this? Like if I can pass a row number as parameter to UDF, it avoids the need for me writing multiple UDFs.
The way to do it is by using DEFINE and the constructor of the UDF. So here is an example of a customer "splitter":
REGISTER com.sample.MyUDFs.jar;
DEFINE CommaSplitter com.sample.MySplitter(',');
B = FOREACH A GENERATE f1, CommaSplitter(f2);
Hopefully that conveys the idea.
To pass parameters you do the following in your pigscript:
UDF(document, '$param1', '$param2', '$param3')
edit: Not sure if those params need to be wrappedin ' ' or not
while in your UDF you do:
public class UDF extends EvalFunc<Boolean> {
public Boolean exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return false;
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
String var1 = input.get(1).toString();
InputStream var1In = fs.open(new Path(var1));
String var2 = input.get(2).toString();
InputStream var2In = fs.open(new Path(var2));
String var3 = input.get(3).toString();
InputStream var3In = fs.open(new Path(var3));
return doyourthing(input.get(0).toString());
}
}
for example
Yes, you can pass any parameter in the Tuple parameter input of your UDF:
exec(Tuple input)
and access it using
input.get(index)