Can mrjob tasks output sets? - mrjob

I tried outputting a python set from a mapper in mrjob. I changed the function signatures of my combiners and reducers accordingly.
However, I get this error:
Counters From Step 1
Unencodable output:
TypeError: 172804
When I change the sets to lists, this error disappears. Are there certain python types that cannot be outputted by mappers in mrjob?

Values are moved between stages of the MapReduce using Protocols, generally Raw, JSON or Pickle.
You must make sure that the values being moved around can be properly handled by the Protocol you pick. I would imagine that there's no default JSON representation of a set, and perhaps there's no raw representation either?
Try setting the INTERNAL_PROTOCOL to Pickle, as so:
class yourMR(MRJob):
INTERNAL_PROTOCOL = PickleProtocol
def map(self, key, value):
# mapper
def reduce(self, key, value):
# reducer
Note: MRJob will handle pickling and unpickling for you, so don't worry about that aspect. You can also set the INPUT and OUTPUT protocols if necessary (for multiple stages, or set output from the reducer).

Related

Send heavy data through protobuf. Custom field

I'm developing the API for the application using protobuf and grpc.
I need to send the data with the arbitrary size. Sometimes it is small, sometimes huge. For example Nympy array. If the size is small I want to send it through protobuf, if the size if huge I want to dump data into file and send the filepath to this file through protobuf.
To do so I've created a following .proto messages:
message NumpyTroughProtobuf {
repeated int32 shape = 1;
repeated float array = 2;
}
message NumpyTroughfile {
string filepath = 1;
}
message NumpyTrough {
google.protobuf.Any data = 1;
}
The logic is simple: If the size is big I use data as NumpyTroughfile or if small data as NumpyTroughProtobuf.
Problem (what I want to avoid):
The mechanism of data transformation is the part of my app.
In the current approach I have to check and covert the data before I create the message NumpyTrough. So I have to add some logic into my application which will care of data check and cast. The same I have to do for any language which I use (for example if I send massages from Python to C++).
What I want to do:
The mechanism of data transformation is the part of customized protobuf.
I want to hide the data transformation. I want that my app to send a pure Numpy array into NumpyTrough.data field. All data transformation should be hided.
So I want that the logic of data transformation be the part of custom Protobuf field, not the part of my application.
Which meant that I would like to create a custom type of the field. I just implement the behavior of this filed (marshal/unmarshal) for any languages which I use. Then I can just directly send Numpy data into this custom field and this field will decide how to proceed: turn the data in into file or via other method, send trough Protobuf and restore on the receiver side.
Somethig like this https://github.com/gogo/protobuf/blob/master/custom_types.md but it seems this is not a part of protobuf ecosystem.
Protobuf only defines schema.
You can't add logic to a protobuf definition.
Protobug Any represents arbitrary binary data and so -- somewhere -- you'll need to explain to your users what it represents in order that they can ship data in the correct format to your service.
You get to decide how to distribute the processing of the data:
Either partly client-side functionality that performs preprocessing of the data and ships the output (either as structured data using non-Any types or, if still necessary as Any).
Or partly server-side that receives entirely unprocessed client-side data shipped through Any
Or some combination of the two
NOTE You may want to consider shipping the data regardless of size as file references to simplify your implementation. You're correct to bias protobuf to smaller message sizes but, depending on the file size distribution, does it make sense to complicate your implementation with 2 paths?

TensorFlow Dataset `.map` - Is it possible to ignore errors?

Short version:
When using Dataset map operations, is it possible to specify that any 'rows' where the map invocation results in an error are quietly filtered out rather than having the error bubble up and kill the whole session?
Specifics:
I have an input pipeline set up that (more or less) does the following:
reads a set of file paths of images stored locally (images of varying dimensions)
reads a suggested set of 'bounding boxes' from a csv
Produces the set of all image path to bounding box combinations
Reads and decodes the image then produces the set of 'cropped' images for each of these combinations using tf.image.crop_to_bounding_box
My issue is that there are (very rare) instances where my suggested bounding boxes are outside the bounds of a given image so (understandably) tf.image.crop_to_bounding_box throws an error something like this:
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [width must be >= target + offset.]
which kills the session.
I'd prefer it if these errors were simply ignored and that the pipeline moved onto the next combination.
(I understand that the correct fix for this specific issue would be commit the time to checking each bounding box and image dimension size are possible the step before and filter them out using a filter operation before it got to the map with the cropping operation. I was wondering if there was an easy way to just ignore an error and move on to the next case both for easy of implementation in this specific case and also in more general cases)
For Tensorflow 2
dataset = dataset.apply(tf.data.experimental.ignore_errors())
There is tf.contrib.data.ignore_errors. I've never tried this myself, but according to the docs the usage is simply
dataset = dataset.map(some_map_function)
dataset = dataset.apply(tf.contrib.data.ignore_errors())
It should simply pass through the inputs (i.e. returns the same dataset) but ignore any that throw an error.

How to add a new syntax element in HM (HEVC test Model)

I've been working on the HM reference software for a while, to improve something in the intra prediction part. Now a new intra prediction algorithm is added to the code and I let the encoder choose between my algorithm and the default algorithm of HM (according to the RDCost of course).
What I need now, is to signal a flag for each PU, so that the decoder will be able to perform the same algorithm as the encoder decides in the rate distortion loop.
I want to know what exactly should I do to properly add this one bit flag to the stream, without breaking anything in the code.
Assuming that I want to use a CABAC context model to keep the track of my flag's statistics, what else should I do:
adding a new context model like ContextModel3DBuffer m_cCUIntraAlgorithmSCModel to the TEncSbac.h file.
properly initializing the model (both at encoder and decoder side) by looking at how the HM initialezes other context models.
calling the function m_pcBinIf->encodeBin(myFlag, cCUIntraAlgorithmSCModel) and m_pcTDecBinIfdecodeBin(myFlag, cCUIntraAlgorithmSCModel) at the encoder side and decoder side, respectively.
I take these three steps but apparently it breaks something.
PS: Even an equiprobable signaling (i.e. without using CABAC contexts) will be useful. I just want to send this flag peacefully!
Thanks in advance.
I could solve this problem finally. It was a bug in the CABAC context initialization.
But I want to share this experience as many people may want to do the same thing.
The three steps that I explained are essentially necessary to add a new syntax element, but one might be very careful with the followings:
In the beginning, you need to decide either you want to use a separate context model for your syntax element? Or you want to use an existing one? In case of CABAC separation, you should define a ContextModel3DBuffer and the best way to do that is: finding a similar syntax element in the code; then duplicating its ``ContextModel3DBuffer'' definition and ALL of its occurences in the code. This way assures that you are considering everything.
Encoding of each syntax elements happens in two different places: first, in the RDO loop to make a "decision", and second, during the actual encoding phase and when the decisions are being encoded (e.g. encodeCtu function).
The order of encoding/decoding syntaxt elements should be the same at the encoder/decoder sides. For example if your new syntax element is encoded after splitFlag and before predMode at the encoder side, you should decode it exactly between splitFlag and predMode at the decoder side.
The context model is implemented as a 3D matrix in order to let track the statistics of syntaxt elements separately for different block sizes, componenets etc. This means that when you want to call the function encodeBin, you may make sure that a correct index is being used. I've made stupid mistakes in this part!
Apart from the above remarks, I found a the function getState very useful for debugging. This function returns the state of your CABAC context model in an arbitrary place of the code when you have access to it. It is very useful to compare the state at the same place of the encoder and the decoder when you have a mismatch. For example, it happens a lot that you encode a 1 but you decode a 0. In this case, you need to check the state of your CABAC context before encoding and decoding. They should be the same. If they are not the same, track back the error to find the first place of mismatch.
I hope it was helpful.

GNU Radio block with variable number of intputs/outputs

I am currently trying to do the signal processing of multiple channels in parallel using a custom source block. Up to now I created an OOT-source block which streams data for only one channel into one output perfectly fine.
Now I am searching for a way to expand this module so that it can support a higher number of channels ( = outputs of the source block; up to 64) in parallel. Because the protocol used to pull the samples pulls them all at one it is not possible to use more instances of the same source block.
Things I have found so far:
A pdf which seems to explain exactly what to do already but it seems that it is outdated and that this functionality is no longer supported under GNU Radio.
The description of a feature which should be implemented in the future.
Is there are known solution or workaround for this problem?
Look at the add block: It has configurable many inputs!
Now, the trick here is threefold:
define an io_signature as input and output that allows for adjustable numbers. If you use gr_modtool add to create a new block, your io_signatures will be filled with <+MIN_IN+>, <+MAX_IN+>, <+MIN_OUT+> and <+MAX_OUT+>. Adjust these to reflect your actual minimum and maximum in- and output port numbers. If you want to have 1 to infinity inputs, use 1, -1.
in your (general_)work method, check for the number of inputs by doing something like ninputs = input_items.size(), and for the number of outputs by doing noutputs = output_items.size().
(optionally, if you want to use GRC) modify the <sink>/<source> definitions in your block GRC XML:
<sink>
<name>in</name>
<type>complex</type>
<nports>$num_inputs</nports>
</sink>
num_inputs could be a block parameter; compare the add_XX block source code.

How to find large objects in ZODB

I'm trying to analyze my ZODB because it grew really large (it's also large after packing).
The package zodbbrowser has a feature that displays the amount of bytes of an object. It does so by getting the length of the pickled state (name of the variable), but it also does a bit of magic which I don't fully understand.
How would I go to find the largest objects in my ZODB?
I've written a method which should do exactly this. Feel free to use it, but be aware that this is very memory consuming. The package zodbbrowser must be installed.
def zodb_objects_by_size(self):
"""
Recurse over the ZODB tree starting from self.aq_parent. For
each object, use zodbbrowser's implementation to get the raw
object state. Put each length into a Counter object and
return a list of the biggest objects, specified by path and
size.
"""
from zodbbrowser.history import ZodbObjectHistory
from collections import Counter
def recurse(obj, results):
# Retrieve state pickle from ZODB, get length
history = ZodbObjectHistory(obj)
pstate = history.loadStatePickle()
length = len(pstate)
# Add length to Counter instance
path = '/'.join(obj.getPhysicalPath())
results[path] = length
# Recursion
for child in obj.contentValues():
# Play around portal tools and other weird objects which
# seem to contain themselves
if child.contentValues() == obj.contentValues():
continue
# Rolling in the deep
try:
recurse(child, results)
except (RuntimeError, AttributeError), err:
import pdb; pdb.set_trace() ## go debug
results = Counter()
recurse(self.aq_parent, results)
return results.most_common()