mrjob: suppress key (or value) in reducer output - mrjob

By default, mrJob stores the key and the value from output in key[tab]output format.
This happens even if the key (or the value) is empty, null, or otherwise not interesting. Suppose my key, value pair is None, {"a":1", "b":1}. Then I get this:
None {"a":1, "b":2}
Is there a way to suppress the key or the value? I just want this:
{"a":1, "b":2}
BTW, I've already tried this. Am I missing something...?
class MyMrJobClass(MRJob):
OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol
def step1_mapper(self, _, line):
...
yield my_key, my_value
def step1_reducer(self, key, values):
for v in values:
...
yield None, my_data
def steps(self):
return [
self.mr(
mapper=self.step1_mapper,
reducer=self.step1_reducer,
),
]
NB: I know that I don't need to overwrite steps for a single-step job. This will eventually be a multistep job, so it's important to build the class that way.
Thanks!

You can use mrjob.protocol.JSONValueProtocol (notice the Value. See the documentation) as the output protocol instead of mrjob.protocol.JSONProtocol.
The documentation has more information on using custom protocols.

Related

How do use fuzzy matching validation for either a non-present key or an empty array?

In karate version 0.9.6 I used the following match statement in a .feature file and it worked for validating the value to be an empty array or a key that was not present.
def example = {}
match example.errors == '##[0]'
In 1.0 the documentation example suggests that this should check for the key being present and null or an empty array and testing this fails with a validation error that the value is not present.
From https://karatelabs.github.io/karate/#schema-validation
# should be null or an array of strings
* match foo == '##[] #string'
This appears to be an undocumented breaking change from pre-1.0 to 1.0.
My question is: how do I construct a validator to cover this case correctly when the key is allowed to be absent but if it is present it must be an empty array?
I've found an undesirable solution for now but am leaving this open in case someone has a better answer.
I'm validating the entire parent object with a minimal schema:
Replace
match $.errors == '##[0]'
With
* match $ == { data: '#object', extensions: '##object', errors: '##[0]' }
While more brittle and verbose it is technically working.
This indeed looks like an in-intended breaking change. Here is another workaround:
* def example = {}
* def expected = example.errors ? '#[0]' : '#notpresent'
* match example.errors == expected
I see you have opened an issue here: https://github.com/karatelabs/karate/issues/1825
EDIT: this might be an improvement over the workaround you came up with in your answer:
* match example contains { errors: '##[0]' }

How to specify which key/value pairs to exclude in spaCy's Doc.to_disk(path, exclude=['user_data'])?

My nlp pipeline has some doc extensions that store 3 items (a string for file name and two dicts which map non-serializable objects). I'd like only to exclude the non-serializable key/value pairs in the user data, but keep the filename.
doc.to_disk(path, exclude=['user_data'])
works as expected, excluding all user data. There are apparently options to instead exclude either 'user_data_keys' or 'user_data_values' but I find no explanation of their usage, and furthermore I can't think of any good reason to store either all the keys without the values or all the values without the keys!
I would like to exclude both keys and values of only certain fields in the doc.user_data. If this is possible, how is it done?
You will need to specify which keys or values you want to exclude.
https://spacy.io/api/doc#serialization-fields
data = doc.to_bytes(exclude=["text", "tensor"])
doc.from_disk("./doc.bin", exclude=["user_data"])
Per this thread here, you can try the following work around:
def remove_unserializable_results(doc):
doc.user_data = {}
for x in dir(doc._):
if x in ['get', 'set', 'has']: continue
setattr(doc._, x, None)
for token in doc:
for x in dir(token._):
if x in ['get', 'set', 'has']: continue
setattr(token._, x, None)
return doc
nlp.add_pipe(remove_unserializable_results, last=True)

Default values for query parameters

Please forgive me if my question does not make sense.
What im trying to do is to inject in values for query parameters
GET1 File
Scenario:
Given path 'search'
And param filter[id] = id (default value or variable from another feature file)
POST1 File
Scenario:
def newid = new id made by a post call
def checkid = read call(GET1) {id : newid}
like if one of my feature files creates a new id then i want to do a get call with the above scenario. therefore i need a parameter there which takes in the new id.
On the other hand if i do not have an id newly created or the test creating it is not part of the suite. i want to still be able to run the above mentioned scenario but this time it has a default value to it.
Instead of param use params. It is designed so that any keys with null values are ignored.
After the null is set on the first line below, you can make a call to another feature, and overwrite the value of criteria. If it still is null, no params will be set.
* def criteria = null
Given path 'search'
And params { filter: '#(criteria)' }
There are multiple other ways to do this, also refer to this set of examples for data-driven search params: dynamic-params.feature
The doc on conditional logic may also give you some ideas.

Pandas HDF5 Select with Where on non natural-named columns

in my continuing spree of exotic pandas/HDF5 issues, I encountered the following:
I have a series of non-natural named columns (nb: because of a good reason, with negative numbers being "system" ids etc), which normally doesn't give an issue:
fact_hdf.select('store_0_0', columns=['o', 'a-6', 'm-13'])
however, my select statement does fall over it:
>>> fact_hdf.select('store_0_0', columns=['o', 'a-6', 'm-13'], where=[('a-6', '=', [0, 25, 28])])
blablabla
File "/srv/www/li/venv/local/lib/python2.7/site-packages/tables/table.py", line 1251, in _required_expr_vars
raise NameError("name ``%s`` is not defined" % var)
NameError: name ``a`` is not defined
Is there any way to work around it? I could rename my negative value from "a-1" to a "a_1" but that means reloading all of the data in my system. Which is rather much! :)
Suggestions are very welcome!
Here's a test table
In [1]: df = DataFrame({ 'a-6' : [1,2,3,np.nan] })
In [2]: df
Out[2]:
a-6
0 1
1 2
2 3
3 NaN
In [3]: df.to_hdf('test.h5','df',mode='w',table=True)
In [5]: df.to_hdf('test.h5','df',mode='w',table=True,data_columns=True)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6_kind'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6_dtype'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
There is a very way, but would to build this into the code itself. You can do a variable substitution on the column names as follows. Here is the existing routine (in master)
def select(self):
"""
generate the selection
"""
if self.condition is not None:
return self.table.table.readWhere(self.condition.format(), start=self.start, stop=self.stop)
elif self.coordinates is not None:
return self.table.table.readCoordinates(self.coordinates)
return self.table.table.read(start=self.start, stop=self.stop)
If instead you do this
(Pdb) self.table.table.readWhere("(x>2.0)",
condvars={ 'x' : getattr(self.table.table.cols,'a-6')})
array([(2, 3.0)],
dtype=[('index', '<i8'), ('a-6', '<f8')])
e.g. by subsituting x with the column reference, you can get the data.
This could be done on detection of invalid column names, but is pretty tricky.
Unfortunately I would suggest renaming your columns.

Rails postgres hstore - how to ommit a nil input value from hash

I have a form that displays inputs based on user preferences. I am storing the values as an hstore hash since I dont know ahead of time exactly what the form input for each user will be. The problem I am running in to is that even though a user has an input preferenced doesnt mean they have to enter a value for it each time. Which, can result in :foo => "".
All the doc examples show you how to find records you know the key name of. In my case, I dont know the key name...I need to find all the keys in a hash whose value => "".
Then, I should be able to do something like the docs shows...for each empty value
person.destroy_key(:data, :foo).destroy_key(:data, :bar).save
avals(hstore) is likely what I need to user... How do you use avals with rails?
Since hstore is just a hash in rails...you just need to evaluate the hash before saving it.
...in model
before_save :remove_blanks
private
def remove_blanks
self.hstore = self.hstore.reject{ |k,v| v.blank? }
end
replace 'hstore' with your hstore column name