How to iterate multiple times over data during PIG store function - apache-pig

I wonder if it possible to write a user-defined store function for PIG that iterates twice over the data / input tuples.
I read here http://pig.apache.org/docs/r0.7.0/udf.html#Store+Functions how to write your own store function, e.g. by implementing your own "getNext()" method.
For my use case, however, it is necessary to see every tuple twice in the "getNext()" method, so I wonder whether there is a way to that, for example by reseting the reader somehow or by overwriting some other method...
Additional information: I am looking for a way to iterate from tuple 1 to tuple n and then again from 1 to n.
Does anyone has an idea how to do something like that?
Thanks!
Sebastian

This is from top of my head, but you could try something like this:
imports here ...;
class MyStorage extends PigStorage {
private int counter = 0;
private Tuple cachedTuple = null;
public Tuple getNext(){
if (this.counter++ % 2 == 0) {
this.cachedTuple = super.getNext();
}
return this.cachedTuple;
}
}

Related

Ignite :Remove data from cache on count of 10 put operation in cache

I have a json Object and i am putting it into cache by using a thread which calls for every 5 sec,i want to remove cache data after 10 put oration perform and put that data into third party database.how can i do this,which are the techniques to do this.if have a sample example please share.Thanks
You can achieve a similar behaviour by using a cache store with write-behind along with an expiry policy.
But given the number of records, that you want to keep in the cache, I would do something like this:
private static final int BATCH_SIZE = 10;
private Map<K, V> batch = new HashMap<>();
public void addRecord(K key, V val) {
batch.put(k, v);
if (batch.size() == BATCH_SIZE) {
flush(batch); // Write data into the database.
batch.clear();
}
}

Lodash functions are "type" sensitives

I've been using the last version of lodash since quite of time and I like it. I have one question though.
I noticed lodash functions are "types" sensitives
_.find(users, {'age': 1}); will not work well if 1 is "1"
_.filter(users, {'age': "36"}); will not work if "36" is 36
Question
Is there a way to make lodash able to filter or find objects without taking account the type?
_.find(users, {'age': 1}) would then return all objects whose age is a string or a number equals to 1
Its because of its comparison is with === when you are passing an condition, however you can always pass a callback for your kind of checking:
for your purpose:
_.filter(users, function(user){return user.age==36});
It is a plain and simple way of finding even in Native JavaScript, however if you really want to use the benefit of not writing a callback code every time you have a object literal as query data, you can function which will convert a object to it's corresponding callback.
function convertToFilterCallback(obj) {
var keys = Object.keys(obj);
return function(each) {
for (var idx = 0; idx < keys.length; idx++) {
var key = keys[idx];
if (obj[key] != each[key]) {
return false;
}
}
return true;
}
}
and then use it like,
_.filter(users, convertToFilterCallback({..<your object literal>..}));
However, if you are doing so, you can use the native find of filter method, and specifically not an advantage over lodash.

How to enable parallelism for a custom U-SQL Extractor

I’m implementing a custom U-SQL Extractor for our internal file format (binary serialization). It works well in the "Atomic" mode:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class BinaryExtractor : IExtractor
If I switch off the “Atomic“ mode, It looks like U-SQL is splitting the file in a random place (I guess just by 250MB chunks). This is not acceptable for me. The file format has a special row delimiter. Can I define a custom row delimiter in my Extractor and enable parallelism for it. Technically I can change our row delimiter to a new one if it can help.
Could anyone help me with this question?
The file is indeed split into chunks (I think it is 1 GB at the moment, but the exact value is implementation defined and may change for performance reasons).
If the file is indeed row delimited, and assuming your raw input data for the row is less than 4MB, you can use the input.Split() function inside your UDO to do the splitting into rows. The call will automatically handle the case if the raw input data spans the chunk boundary (assuming it is less than 4MB).
Here is an example:
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow outputrow)
{
// this._row_delim = this._encoding.GetBytes(row_delim); in class ctor
foreach (Stream current in input.Split(this._row_delim))
{
using (StreamReader streamReader = new StreamReader(current, this._encoding))
{
int num = 0;
string[] array = streamReader.ReadToEnd().Split(new string[]{this._col_delim}, StringSplitOptions.None);
for (int i = 0; i < array.Length; i++)
{
// DO YOUR PROCESSING
}
}
yield return outputrow.AsReadOnly();
}
}
Please note that you cannot read across chunk boundaries yourself and you should make sure your data is indeed splittable into rows.

How to fetch subquery columns in Phalcon

what would be the best way (or any way, because I can't figure out anything blending with Phalcon models at the moment) to fetch data from query like this:
SELECT *, (select sum(l.log_volume) from logs l where l.parcel_id = p.parcel_id) as parcel_total_volume
FROM PARCELS p
what I want is to basically have calculated fields easily accessible from model, preferably calculated on sql side and fetched with every record.
If this must be done in model php code instead, then how?
I would go for something like this:
<?php
class Parcels extends Phalcon\Mvc\Model
{
private static $sumsCache = [];
/*
...
*/
public function getTotalVolume()
{
if(!isset($sumsCache[$this->parcel_id])) {
$sumsCache[$this->parcel_id] = Logs::sum("parcel_id = $this->parcel_id");
}
return $sumsCache[$this->parcel_id];
}
}
This way you fetch the sum JIT and you store a static cache for each parcel_id.

New ArrayList filtering from another ArrayList using a String as Filter

In my program, a Die (dice for embroidery) is a class with different fields. One of them is of the type String and it is called haveIt. So, if the user of the program enters the word "Yes" on the haveIt field, he should be able to track a list of all the Dies he has, on the myInventory list.
How do I do this? Should I create the myInventory ArrayList<Die> on the fields and constructor of my Contoller class or should I built it inside a special method in that class?
I have tryed everything and nothing works. But I am really new on this.
Here is my last attempt, creating a loop to create the new ArrayList<Die> (that has "Yes" on the haveIt field) from a special getMyInventory method in my Controller class:
public ArrayList<Die> getMyInventory(Die anyDie) {
for (int counting = 0; counting <
diesBigList.Count ; counting++);
{
if
(((Die)diesBigList[counting]).doIHaveIt.contains("Yes"))
myInventory.add(diesBigList[counting]);
return myInventory;
}
}
It does not compile. It tells me that the result should be an Array type but it is resolved as ArrayList... (and I do not comprendo that).
Thanks in advance.
Your missing a return statement. What if this is never true?
if (((Die)diesBigList[counting]).doIHaveIt.contains("Yes"))
you never reach your return statement
Here is the answer
public ArrayList<Die> getMyInventory(Die anyDie) {
ArrayList<Die> myInventory = new ArrayList<Die>();
for (int counting = 0; counting < diesBigList.Count; counting++) {
if (((Die)diesBigList[counting]).doIHaveIt.contains("Yes")) {
myInventory.add(diesBigList[counting]);
}
}
return myInventory;
}
Also there could be a problem with this: diesBigList.Count I have no idea where you got that object or what it's methods looks like but I presume in my code your making that call correctly.