Pig: Get index in nested foreach - apache-pig

I have a pig script with code like :
scores = LOAD 'file' as (id:chararray, scoreid:chararray, score:int);
scoresGrouped = GROUP scores by id;
top10s = foreach scoresGrouped{
sorted = order scores by score DESC;
sorted10 = LIMIT sorted 10;
GENERATE group as id, sorted10.scoreid as top10candidates;
};
It gets me a bag like
id1, {(scoreidA),(scoreidB),(scoreIdC)..(scoreIdFoo)}
However, I wish to include the index of items as well, so I'd have results like
id1, {(scoreidA,1),(scoreidB,2),(scoreIdC,3)..(scoreIdFoo,10)}
Is it possible to include the index somehow in the nested foreach, or would I have to write my own UDF to add it in afterwards?

For indexing elements in a bag you may use the Enumerate UDF from LinkedIn's DataFu project:
register '/path_to_jar/datafu-0.0.4.jar';
define Enumerate datafu.pig.bags.Enumerate('1');
scores = ...
...
result = foreach top10s generate id, Enumerate(top10candidates);

You'll need a UDF, whose only argument is the sorted bag you want to add a rank to. I've had the same need before. Here's the exec function to save you a little time:
public DataBag exec(Tuple b) throws IOException {
try {
DataBag bag = (DataBag) b.get(0);
Iterator<Tuple> it = bag.iterator();
while (it.hasNext()) {
Tuple t = (Tuple)it.next();
if (t != null && t.size() > 0 && t.get(0) != null) {
t.append(n++);
}
newBag.add(t);
}
} catch (ExecException ee) {
throw ee;
} catch (Exception e) {
int errCode = 2106;
String msg = "Error while computing item number in " + this.getClass().getSimpleName();
throw new ExecException(msg, errCode, PigException.BUG, e);
}
return newBag;
}
(The counter n is initialized as a class variable outside the exec function.)
You can also implement the Accumulator interface, which will allow you to do this even if your entire bag won't fit in memory. (The COUNT built-in function does this.) Just be sure to set n = 1L; in the cleanup() method and return newBag; in getValue(), and everything else is the same.

Related

Hibernate Search manual indexing throw a "org.hibernate.TransientObjectException: The instance was not associated with this session"

I use Hibernate Search 5.11 on my Spring Boot 2 application, allowing to make full text research.
This librairy require to index documents.
When my app is launched, I try to re-index manually data of an indexed entity (MyEntity.class) each five minutes (for specific reason, due to my server context).
I try to index data of the MyEntity.class.
MyEntity.class has a property attachedFiles, which is an hashset, filled with a join #OneToMany(), with lazy loading mode enabled :
#OneToMany(mappedBy = "myEntity", cascade = CascadeType.ALL, orphanRemoval = true)
private Set<AttachedFile> attachedFiles = new HashSet<>();
I code the required indexing process, but an exception is thrown on "fullTextSession.index(result);" when attachedFiles property of a given entity is filled with one or more items :
org.hibernate.TransientObjectException: The instance was not associated with this session
The debug mode indicates a message like "Unable to load [...]" on entity hashset value in this case.
And if the HashSet is empty (not null, only empty), no exception is thrown.
My indexing method :
private void indexDocumentsByEntityIds(List<Long> ids) {
final int BATCH_SIZE = 128;
Session session = entityManager.unwrap(Session.class);
FullTextSession fullTextSession = Search.getFullTextSession(session);
fullTextSession.setFlushMode(FlushMode.MANUAL);
fullTextSession.setCacheMode(CacheMode.IGNORE);
CriteriaBuilder builder = session.getCriteriaBuilder();
CriteriaQuery<MyEntity> criteria = builder.createQuery(MyEntity.class);
Root<MyEntity> root = criteria.from(MyEntity.class);
criteria.select(root).where(root.get("id").in(ids));
TypedQuery<MyEntity> query = fullTextSession.createQuery(criteria);
List<MyEntity> results = query.getResultList();
int index = 0;
for (MyEntity result : results) {
index++;
try {
fullTextSession.index(result); //index each element
if (index % BATCH_SIZE == 0 || index == ids.size()) {
fullTextSession.flushToIndexes(); //apply changes to indexes
fullTextSession.clear(); //free memory since the queue is processed
}
} catch (TransientObjectException toEx) {
LOGGER.info(toEx.getMessage());
throw toEx;
}
}
}
Does someone have an idea ?
Thanks !
This is probably caused by the "clear" call you have in your loop.
In essence, what you're doing is:
load all entities to reindex into the session
index one batch of entities
remove all entities from the session (fullTextSession.clear())
try to index the next batch of entities, even though they are not in the session anymore... ?
What you need to do is to only load each batch of entities after the session clearing, so that you're sure they are still in the session when you index them.
There's an example of how to do this in the documentation, using a scroll and an appropriate batch size: https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#search-batchindex-flushtoindexes
Alternatively, you can just split your ID list in smaller lists of 128 elements, and for each of these lists, run a query to get the corresponding entities, reindex all these 128 entities, then flush and clear.
Thanks for the explanations #yrodiere, they helped me a lot !
I chose your alternative solution :
Alternatively, you can just split your ID list in smaller lists of 128 elements, and for each of these lists, run a query to get the corresponding entities, reindex all these 128 entities, then flush and clear.
...and everything works perfectly !
Well seen !
See the code solution below :
private List<List<Object>> splitList(List<Object> list, int subListSize) {
List<List<Object>> splittedList = new ArrayList<>();
if (!CollectionUtils.isEmpty(list)) {
int i = 0;
int nbItems = list.size();
while (i < nbItems) {
int maxLastSubListIndex = i + subListSize;
int lastSubListIndex = (maxLastSubListIndex > nbItems) ? nbItems : maxLastSubListIndex;
List<Object> subList = list.subList(i, lastSubListIndex);
splittedList.add(subList);
i = lastSubListIndex;
}
}
return splittedList;
}
private void indexDocumentsByEntityIds(Class<Object> clazz, String entityIdPropertyName, List<Object> ids) {
Session session = entityManager.unwrap(Session.class);
List<List<Object>> splittedIdsLists = splitList(ids, 128);
for (List<Object> splittedIds : splittedIdsLists) {
FullTextSession fullTextSession = Search.getFullTextSession(session);
fullTextSession.setFlushMode(FlushMode.MANUAL);
fullTextSession.setCacheMode(CacheMode.IGNORE);
Transaction transaction = fullTextSession.beginTransaction();
CriteriaBuilder builder = session.getCriteriaBuilder();
CriteriaQuery<Object> criteria = builder.createQuery(clazz);
Root<Object> root = criteria.from(clazz);
criteria.select(root).where(root.get(entityIdPropertyName).in(splittedIds));
TypedQuery<Object> query = fullTextSession.createQuery(criteria);
List<Object> results = query.getResultList();
int index = 0;
for (Object result : results) {
index++;
try {
fullTextSession.index(result); //index each element
if (index == splittedIds.size()) {
fullTextSession.flushToIndexes(); //apply changes to indexes
fullTextSession.clear(); //free memory since the queue is processed
}
} catch (TransientObjectException toEx) {
LOGGER.info(toEx.getMessage());
throw toEx;
}
}
transaction.commit();
}
}

How Do I Return the Index of type T in a Collection based on some criteria?

Kotlin has some pretty cool functions for collections. However, I have come across a problem in which the solution is not apparent to me.
I have a List of Objects. Those Objects have an ID field which coincides with a SQLite database. SQL operations are performed on the database, and a new list is generated. How can the index of an item from the new list be found based on the "ID" field (or any other field for that matter)?
the Collection.find{} function return the object, but not the index.
indexOfFirst can find the index of the first element of a collection that satisfies a specified predicate.
We have a DB SQlite that a call is made to to retrieve parentList We can obtain the items in the ArrayList with this type of code
fun onDoIt(view: View){
initDB()
for (t in 0..X-1) {
var N:String = parentList[t].dept
// NOTE two syntax here [t] and get(t)
if(t == 1){
var B:String = parentList[0].idD.toString()
println("$$$$$$$$$$$$$$$$$$$$$ ====== "+B)
}
var I:String = parentList.get(t).idD.toString()
println("################### id "+I+" for "+N)
}
}
private fun initDB() {
parentList = db.querySPDept()
if (parentList.isEmpty()) {
title = "No Records in DB"
} else {
X = parentList.size
println("**************************************** SIZE " + X)
title = "SP View Activity"
}
}

Why does this Linq Query Return Different Results than SQL Equivalent?

I'm sure I'm missing something simple but I have a linq query here:
public static List<Guid> GetAudience()
{
var createdOn = new DateTime(2018, 6, 30, 0, 0, 0);
var x = new List<Guid>();
try
{
var query = from acc in Account
where acc.num != null
&& acc.StateCode.Equals(0)
&& acc.CreatedOn < createdOn
select new
{
acc.Id
};
foreach (var z in query)
{
if (z.Id != null)
{
x.Add(z.Id.Value);
}
}
}
catch (Exception e)
{
Console.WriteLine(e);
}
return x;
}
I wanted to verify the count in SQL because it would only take a couple seconds so:
select count(*)
from Account a
where a.num is not null
and a.statecode = 0
and a.createdon < '2018-06-30 00:00:00'
And now the SQL query is returning 9,329 whereas Linq is returning 10,928. Why are my counts so far off when the queries are doing the same thing (so I thought)? What simple thing am I missing?
Thanks in advance--
Your method is returning a list of records where the Id values are not null (plus the other criteria). The SQL query is returning a count of the number of records (plus the other criteria). Without the definition of your table, it's hard to know whether that is significant.
Unrelated tip: it's not a good idea to catch and swallow exceptions like that - the caller of your method will have no idea that anything went wrong, so processing will continue; but it will be using incomplete data, potentially leading to other problems in your program later.

How to use Lambda expressions in java for nested if-else and for loop together

I have following Code where i will receive list of names as parameter.In the loop, first i'm assigning index 0 value from list to local variable name. There after comparing next values from list with name. If we receive any non-equal value from list, i'm assigning value of result as 1 and failing the test case.
Below is the Array list
List<String> names= new ArrayList<String>();
names.add("John");
names.add("Mark");
Below is my selenium test method
public void test(List<String> names)
String name=null;
int a=0;
for(String value:names){
if(name==null){
System.out.println("Value is null");
name=value;
}
else if(name.equals(value)){
System.out.println("Received Same name");
name=value;
}
else{
a=1;
Assert.fail("Received different name in between");
}
}
How can i convert above code into lambda expressions?. I'm using cucumber data model, hence i receive data as list from feature file. Since i can't give clear explanation, just posted the example logic i need to convert to lambda expression.
Here's the solution: it cycles all element in your list checking if are all the same.
You can try adding or editing the list so you can have different outputs. I've written the logic, you can easly put it into a JUnit test
List<String> names= new ArrayList<>();
names.add("John");
names.add("Mark");
String firstEntry = names.get(0);
boolean allMatch = names.stream().allMatch(name -> firstEntry.equals(name));
System.out.println("All names are the same: "+allMatch);
Are you looking for duplicates, whenever you have distinct value , set a=1 and say assert to fail. You can achieve this by :
List<String> names= new ArrayList<String>();
names.add("John");
names.add("Mark");
if (names.stream().distinct().limit(2).count() > 1) {
a= 1,
Assert.fail("Received different name in between");
} else {
System.out.println("Received Same name");
}

how to assign a sequentially increasing number to a relation

I am wondering if there is a way to create a new field in a relation and then assign some sequentially increasing number to it? Here is one example:
ordered_products = ORDER products BY price ASC;
ordered_products_with_sequential_id = FOREACH ordered_products GENERATE price, some_sequential_id;
How can I create some_sequential_id? I am not sure whether that's doable in Pig though.
I suspect you have to write your own UDF to get that running. One way to do it would be by incrementing a static variable (AtomicInteger) in your UDF in the exec implementation.
public class IncrEval extends EvalFunc<Long> {
final static AtomicLong res = new AtomicLong(0);
#Override
public Long exec (Tuple tip) throws IOException {
if (tip == null || tip.size() == 0) {
return null;
}
res.incrementAndGet();
return res.longValue();
}
}
Pig script entry:
b = FOREACH a GENERATE <something>, com.eval.IncrEval() AS ID:long;