How to specify which key/value pairs to exclude in spaCy's Doc.to_disk(path, exclude=['user_data'])? - spacy

My nlp pipeline has some doc extensions that store 3 items (a string for file name and two dicts which map non-serializable objects). I'd like only to exclude the non-serializable key/value pairs in the user data, but keep the filename.
doc.to_disk(path, exclude=['user_data'])
works as expected, excluding all user data. There are apparently options to instead exclude either 'user_data_keys' or 'user_data_values' but I find no explanation of their usage, and furthermore I can't think of any good reason to store either all the keys without the values or all the values without the keys!
I would like to exclude both keys and values of only certain fields in the doc.user_data. If this is possible, how is it done?

You will need to specify which keys or values you want to exclude.
https://spacy.io/api/doc#serialization-fields
data = doc.to_bytes(exclude=["text", "tensor"])
doc.from_disk("./doc.bin", exclude=["user_data"])
Per this thread here, you can try the following work around:
def remove_unserializable_results(doc):
doc.user_data = {}
for x in dir(doc._):
if x in ['get', 'set', 'has']: continue
setattr(doc._, x, None)
for token in doc:
for x in dir(token._):
if x in ['get', 'set', 'has']: continue
setattr(token._, x, None)
return doc
nlp.add_pipe(remove_unserializable_results, last=True)

Related

Nextflow: split or subset a channel containing tuples

I have joined two channels as a way of filtering out items that do not have all the necessary files. The resulting items of the joined channel look like:
[sample1, [sample1.csv], [sample1_1.fastq, sample1_2.fastq]]
I now wish to remove the csv entry so that the items of the channel have the form:
[sample1, [sample1_1.fastq, sample1_2.fastq]]
for use in existing downstream processes.
I've been looking at multiMap and branch but can't seem to find anything that does what I want. What am I missing?
You can use the map operator for this:
joined_ch
.map { sample, csv_files, fastq_files ->
tuple( sample, fastq_files )
}
.view()

How to show Feature Names in Graphviz?

I'm building a tree in Graphviz and I can't seem to be able to get the feature names to show up, I have defined a list with the feature names like so:
names = list(df.columns.values)
Which prints:
['Gender',
'SuperStrength',
'Mask',
'Cape',
'Tie',
'Bald',
'Pointy Ears',
'Smokes']
So the list is being created, later I build the tree like so:
export_graphviz(tree, out_file=ddata, filled=True, rounded=True, special_characters=False, impurity=False, feature_names=names)
But the final image still has the feature names listed like X[]:
How can I get the actual feature names to show up? (Cape instead of X[3], etc.)
I can only imagine this has to do with passing the names as an array of the values. It works fine if you pass the columns directly:
export_graphviz(tree, out_file=ddata, filled=True, rounded=True, special_characters=False, impurity=False, feature_names=df.columns)
If needed, you can also slice the columns:
export_graphviz(tree, out_file=ddata, filled=True, rounded=True, special_characters=False, impurity=False, feature_names=df.columns[5:])

How to get name of streams in MinibatchSource?

How can I get the names of each of the streams in a MinibatchSource?
Can I get the names associated with with the stream information returned by stream_infos?
minibatch_source.stream_infos()
I also have a follow-up-question:
The result from:
print(reader_train.streams.keys())
is
dict_keys(['labels', 'features'
How does these names relate to the construction of the MiniBatchSource, which is done like this?
return MinibatchSource(ImageDeserializer(map_file, StreamDefs(
features = StreamDef(field='image', transforms=transforms), # first column in map file is referred to as 'image'
labels = StreamDef(field='label', shape=num_classes) # and second as 'label'
)))
I would have thought that my streams would be named ‘image’ and ‘label’, but they were named ‘labels’ and ‘features’.
I guess those names are somehow default names?
For your original question:
minibatch_source.streams.keys()
See for example this tutorial under the section "A Brief Look at Data and Data Reading".
For your followup question: The names returned by keys() are the arguments of StreamDefs(). This is all you need in your program. If you define your MinibatchSource like this
return MinibatchSource(ImageDeserializer(map_file, StreamDefs(
image = StreamDef(field='image', transforms=transforms), # first column in map file is referred to as 'image'
label = StreamDef(field='label', shape=num_classes) # and second as 'label')))
then the names will match. You can choose any names you want but the value of the field inside StreamDef() should match the source (which depends on your input data and the Deserializer you are using).

Pandas HDF5 Select with Where on non natural-named columns

in my continuing spree of exotic pandas/HDF5 issues, I encountered the following:
I have a series of non-natural named columns (nb: because of a good reason, with negative numbers being "system" ids etc), which normally doesn't give an issue:
fact_hdf.select('store_0_0', columns=['o', 'a-6', 'm-13'])
however, my select statement does fall over it:
>>> fact_hdf.select('store_0_0', columns=['o', 'a-6', 'm-13'], where=[('a-6', '=', [0, 25, 28])])
blablabla
File "/srv/www/li/venv/local/lib/python2.7/site-packages/tables/table.py", line 1251, in _required_expr_vars
raise NameError("name ``%s`` is not defined" % var)
NameError: name ``a`` is not defined
Is there any way to work around it? I could rename my negative value from "a-1" to a "a_1" but that means reloading all of the data in my system. Which is rather much! :)
Suggestions are very welcome!
Here's a test table
In [1]: df = DataFrame({ 'a-6' : [1,2,3,np.nan] })
In [2]: df
Out[2]:
a-6
0 1
1 2
2 3
3 NaN
In [3]: df.to_hdf('test.h5','df',mode='w',table=True)
In [5]: df.to_hdf('test.h5','df',mode='w',table=True,data_columns=True)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6_kind'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6_dtype'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
There is a very way, but would to build this into the code itself. You can do a variable substitution on the column names as follows. Here is the existing routine (in master)
def select(self):
"""
generate the selection
"""
if self.condition is not None:
return self.table.table.readWhere(self.condition.format(), start=self.start, stop=self.stop)
elif self.coordinates is not None:
return self.table.table.readCoordinates(self.coordinates)
return self.table.table.read(start=self.start, stop=self.stop)
If instead you do this
(Pdb) self.table.table.readWhere("(x>2.0)",
condvars={ 'x' : getattr(self.table.table.cols,'a-6')})
array([(2, 3.0)],
dtype=[('index', '<i8'), ('a-6', '<f8')])
e.g. by subsituting x with the column reference, you can get the data.
This could be done on detection of invalid column names, but is pretty tricky.
Unfortunately I would suggest renaming your columns.

Redis sorted sets and best way to store uids

I have data consisting of user_ids and tags of these user ids.
The user_ids occur multiple times and have pre-specified number of tags (500) however that might change in the feature. What must be stored is the user_id, their tags and their count.
I want later to easily find tags with top score.. etc. Every time a tag appears it is incremented
My implementation in redis is done using sorted sets
every user_id is a sorted set
key is user_id and is a hex number
works like this:
zincrby user_id:x 1 "tag0"
zincrby user_id:x 1 "tag499"
zincrby user_id:y 1 "tag3"
and so on
having in mind that I want to get tags with highest score, is there a better way?
The second issue is that right now I 'm using "keys *" to retrieve these keys for client side manipulation which I know that it's not aimed towards production systems.
Plus it would be great for memory problems to iterate through a specified number of keys (in the range of 10000). I know that keys have to be stored in memory, however they don't follow
a specific pattern to allow for partial retrieval so I can avoid "zmalloc" error (4GB 64 bit debian server).
Keys amount to range of 20 million.
Any thoughts?
My first point would be to note that 4 GB are tight to store 20M sorted sets. A quick try shows that 20M users, each of them with 20 tags would take about 8 GB on a 64 bits box (and it accounts for the sorted set ziplist memory optimizations provided with Redis 2.4 - don't even try this with earlier versions).
Sorted sets are the ideal data structure to support your use case. I would use them exactly as you described.
As you pointed out, KEYS cannot be used to iterate on keys. It is rather meant as a debug command. To support key iteration, you need to add a data structure to provide this access path. The only structures in Redis which can support iteration are the list and the sorted set (through the range methods). However, they tend to transform O(n) iteration algorithms into O(n^2) (for list), or O(nlogn) (for zset). A list is also a poor choice to store keys since it will be difficult to maintain it as keys are added/removed.
A more efficient solution is to add an index composed of regular sets. You need to use a hash function to associate a specific user to a bucket, and add the user id to the set corresponding to this bucket. If the user id are numeric values, a simple modulo function will be enough. If they are not, a simple string hashing function will do the trick.
So to support iteration on user:1000, user:2000 and user:1001, let's choose a modulo 1000 function. user:1000 and user:2000 will be put in bucket index:0 while user:1001 will be put in bucket index:1.
So on top of the zsets, we now have the following keys:
index:0 => set[ 1000, 2000 ]
index:1 => set[ 1001 ]
In the sets, the prefix of the keys is not needed, and it allows Redis to optimize the memory consumption by serializing the sets provided they are kept small enough (integer sets optimization proposed by Sripathi Krishnan).
The global iteration consists in a simple loop on the buckets from 0 to 1000 (excluded). For each bucket, the SMEMBERS command is applied to retrieve the corresponding set, and the client can then iterate on the individual items.
Here is an example in Python:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# ----------------------------------------------------
import redis, random
POOL = redis.ConnectionPool(host='localhost', port=6379, db=0)
NUSERS = 10000
NTAGS = 500
NBUCKETS = 1000
# ----------------------------------------------------
# Fill redis with some random data
def fill(r):
p = r.pipeline()
# Create only 10000 users for this example
for id in range(0,NUSERS):
user = "user:%d" % id
# Add the user in the index: a simple modulo is used to hash the user id
# and put it in the correct bucket
p.sadd( "index:%d" % (id%NBUCKETS), id )
# Add random tags to the user
for x in range(0,20):
tag = "tag:%d" % (random.randint(0,NTAGS))
p.zincrby( user, tag, 1 )
# Flush the pipeline every 1000 users
if id % 1000 == 0:
p.execute()
print id
# Flush one last time
p.execute()
# ----------------------------------------------------
# Iterate on all the users and display their 5 highest ranked tags
def iterate(r):
# Iterate on the buckets of the key index
# The range depends on the function used to hash the user id
for x in range(0,NBUCKETS):
# Iterate on the users in this bucket
for id in r.smembers( "index:%d"%(x) ):
user = "user:%d" % int(id)
print user,r.zrevrangebyscore(user,"+inf","-inf", 0, 5, True )
# ----------------------------------------------------
# Main function
def main():
r = redis.Redis(connection_pool=POOL)
r.flushall()
m = r.info()["used_memory"]
fill(r)
info = r.info()
print "Keys: ",info["db0"]["keys"]
print "Memory: ",info["used_memory"]-m
iterate(r)
# ----------------------------------------------------
main()
By tweaking the constants, you can also use this program to evaluate the global memory consumption of this data structure.
IMO this strategy is simple and efficient, because it offers O(1) complexity to add/remove users, and true O(n) complexity to iterate on all items. The only downside is the key iteration order is random.