find dtype of root attributes in hdf5 file - numpy

I am attempting to determine the dtype of root-level attributes of an existing hdf5 file using h5py. I have a crude solution that works but I find it unattractive and hope that there is a better way. The file root level attributes are shown from the HDF5View program.
I need to know that the attribute CHS Data Format has type of 'string length=9 or that Save Point ID is 64-bit floating-point, so that I can properly transform in code. I can get this information in a brute force manner, I am hoping there is a cleaner way.
hf = h5py.File(hdf5_filename, 'r')
for k in hf.attrs.keys():
print (k,hf.attrs[k],type(hf.attrs[k]),hf.attrs[k].dtype)
which yields:
CHS Data Format b'Version_1' <class 'numpy.bytes_'> |S9
Grid Name b'SMA' <class 'numpy.bytes_'> |S3
Latitude Units b'deg' <class 'numpy.bytes_'> |S3
Longitude Units b'deg' <class 'numpy.bytes_'> |S3
Project b'USACE_NACCS' <class 'numpy.bytes_'> |S11
Region b'Virginia_to_Maine' <class 'numpy.bytes_'> |S17
Save Point ID 1488.0 <class 'numpy.float64'> float64
Save Point Latitude b'41.811900' <class 'numpy.bytes_'> |S9
Save Point Longitude b'-71.398700' <class 'numpy.bytes_'> |S10
Vertical Datum b'MSL' <class 'numpy.bytes_'> |S3
This gives me the information I need but requires parsing and I would also like to have the additional information that is seen in the image from hdf5View
Even though this is likely not the best/clearest solution, I am posting in case it is of some assistance to others trying to achieve this same goal.

Related

Is there a way to get tree data as a list with the LightGBM Classifier

In random forest type models, there is usually an attribute like "estimators" which returns all the tree split as a list of lists.
I can't seem find something similar with lightgbm. The closest I can come is lgb.plot_tree which gives a nice visualization of a single tree. But I would like to use the data shown in the visualization in variables.
How can I get at this data?
There's not something exactly the same in LightGBM. But you could use the dump_model or trees_to_dataframe methods of the booster_ attribute of the scikit-learn estimator, i.e.
clf = lgb.LGBMClassifier().fit(X, y)
clf.booster_.dump_model()
clf.booster_.trees_to_dataframe()

Correct annotation to train spaCy's NER

I'm having some troubles finding the right way how to annotate my data. I'm dealing with laboratory test related texts and I am using the following labels:
1) Test specification (e.g. voltage, length,...)
2) Test object (e.g. battery, steal beam,...)
3) Test value (e.g. 5 V; 5 m...)
Let's take this example sentences:
The battery voltage should be 5 V.
I would annotate this sentences like this:
The
battery voltage (test specification)
should
be
5 V (Test value)
.
However, if this sentences looks like this:
The voltage of the battery should be 5 V.
I would use the following annotation:
The
voltage (Test specification)
of
the
battery (Test object)
should
be
5 V (Test value)
.
Is anyone experienced in annotating data to explain if this is the right way? Or should I use in he first example the Test object label for battery as well? Or should I combine the labels in the second example voltage of the battery as Test specification?
I am annotating the data to perform information extraction.
Thanks for any help!
All of your examples are unusual annotations formats. The typical way to tag NER data (in text) is to use an IOB/BILOU format, where each token is on one line, the file is a TSV, and one of the columns is a label. So for your data it would look like:
The
voltage U-SPEC
of
the
battery U-OBJ
should
be
5 B-VALUE
V L-VALUE
.
Pretend that is TSV, and I have omitted O tags, which are used for "other" items.
You can find documentation of these schema in the spaCy docs.
If you already have data in the format you provided, or you find it easier to make it that way, it should be easy to convert at least. For training NER spaCy requires the data be provided in a particular format, see the docs for details, but basically you need the input text, character spans, and the labels of those spans. Here's example data:
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
This format is trickier to produce manually than the above TSV type format, so generally you would produce the TSV-like format, possibly using a tool, and then convert it.
The main rule to correctly annotate entities is to be consistent (i.e. you always apply the same rules when deciding which entity is what). I can see you already khave some rules in terms of when battery voltage should be considered test object or test specification.
Apply those rules consistently and you'll be ok.
Have a look at the spacy-annotator.
It is a library that helps you annotating data in the way you want.
Example:
import pandas as pd
import re
from spacy_annotator.pandas_annotations import annotate as pd_annotate
# Data
df = pd.DataFrame.from_dict({'full_text' : [The battery voltage should be 5 V., 'The voltage of the battery should be 5 V.']})
# Annotations
pd_dd = pd_annotate(df,
col_text = 'full_text', # Column in pandas dataframe containing text to be labelled
labels = ['test_specification', 'test object', 'test_value'], # List of labels
sample_size=1, # Size of the sample to be labelled
delimiter=',', # Delimiter to separate entities in GUI
model = None, # spaCy model for noisy pre-labelling
regex_flags=re.IGNORECASE # One (or more) regex flags to be applied when searching for entities in text
)
# Example output
pd_dd['annotations'][0]
The code will show you a user interface you can use to annotate each relevant entities.

Can we Ignore unnecessary classes in the Tensorflow object detection API by only omitting labels in pbtxt label map file?

So I am trying to create custom datasets for object detection using the Tensorflow Object detection API. When working with open source datasets the annotation files I have come across as PASCAL VOC xmls or jsons. These contain a list of labels for each class for example:
<annotation>
<folder>open_images_volume</folder>
<filename>0d2471ff2d033ccd.jpg</filename>
<path>/mnt/open_images_volume/0d2471ff2d033ccd.jpg</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>1024</width>
<height>1024</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>Chair</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>8</xmin>
<ymin>670</ymin>
<xmax>409</xmax>
<ymax>1020</ymax>
</bndbox>
</object>
<object>
<name>Table</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>232</xmin>
<ymin>654</ymin>
<xmax>555</xmax>
<ymax>1020</ymax>
</bndbox>
</object>
</annotation>
Here the annotation file describes 2 classes, Table & chair. I am only interested in detecting chairs, which is why the pbtxt file I have generated is simply
item {
id: 1
display_name: "Chair"
}
My question is that will the model train on simply the annotations of the class "Chair" because that's what I have defined via the label_map.pbtxt or do I need to manually scrape all the annotation files and remove the bounding box co-ordinates through regex or xmltree in order to make sure the additional bounding boxes do not interfere with training. So is it possible to select only custom classes for training even if the annotation files have additional classes through the TF API or is it necessary to clean up the entire datasets and manually remove unnecessary class labels? Will it affect training in any way?
You can use a .pbtxt that only has the classes that you need to train and you don't have to change the xmls.
Also, make sure to change the num_classes: your_num_classes.

Comparing the structures of two graphs

Is there a way in TensorFlow to find out if two graphs have the same structure ?
I am designing an abstract class whose individual instances are expected to represent different architectures. I have provided an abc.abstractmethod get() which defines the graph. However, I also want to be able to load a pre-trained graph from disk. I want to check if the pre-trained graph has the same definition as the one mentioned in the get() method of a concrete class.
How may I achieve this structural comparison ?
You can get graph definition of current graph as str(tf.get_default_graph().as_graph_def()) and compare for exact equality against your previous result.
Also, TensorFlow tests have more advanced function EqualGraphDef which can tell that two graphs are equal even when graph format has changed, ie, if actual and expected as GraphDef proto objects, you could do
from tensorflow.python import pywrap_tensorflow
diff = pywrap_tensorflow.EqualGraphDefWrapper(actual.SerializeToString(),
expected.SerializeToString())
assert not diff

How to serialize a graph structure?

Flat files and relational databases give us a mechanism to serialize structured data. XML is superb for serializing un-structured tree-like data.
But many problems are best represented by graphs. A thermal simulation program will, for instance, work with temperature nodes connected to each others through resistive edges.
So what is the best way to serialize a graph structure? I know XML can, to some extent, do it---in the same way that a relational database can serialize a complex web of objects: it usually works but can easily get ugly.
I know about the dot language used by the graphviz program, but I'm not sure this is the best way to do it. This question is probably the sort of thing academia might be working on and I'd love to have references to any papers discussing this.
How do you represent your graph in memory?
Basically you have two (good) options:
an adjacency list representation
an adjacency matrix representation
in which the adjacency list representation is best used for a sparse graph, and a matrix representation for the dense graphs.
If you used suchs representations then you could serialize those representations instead.
If it has to be human readable you could still opt for creating your own serialization algorithm. For example you could write down the matrix representation like you would do with any "normal" matrix: just print out the columns and rows, and all the data in it like so:
1 2 3
1 #t #f #f
2 #f #f #t
3 #f #t #f
(this is a non-optimized, non weighted representation, but can be used for directed graphs)
Typically relationships in XML are shown by the parent/child relationship. XML can handle graph data but not in this manner. To handle graphs in XML you should use the xs:ID and xs:IDREF schema types.
In an example, assume that node/#id is an xs:ID type and that link/#ref is an xs:IDREF type. The following XML shows the cycle of three nodes 1 -> 2 -> 3 -> 1.
<data>
<node id="1">
<link ref="2"/>
</node>
<node id="2">
<link ref="3"/>
</node>
<node id="3">
<link ref="1"/>
</node>
</data>
Many development tools have support for ID and IDREF too. I have used Java's JAXB (Java XML Binding. It supports these through the #XmlID and the #XmlIDREF annotations. You can build your graph using plain Java objects and then use JAXB to handle the actual serialization to XML.
XML is very verbose. Whenever I do it, I roll my own. Here's an example of a 3 node directed acyclic graph. It's pretty compact and does everything I need it to do:
0: foo
1: bar
2: bat
----
0 1
0 2
1 2
Adjacency lists and adjacency matrices are the two common ways of representing graphs in memory. The first decision you need to make when deciding between these two is what you want to optimize for. Adjacency lists are very fast if you need to, for example, get the list of a vertex's neighbors. On the other hand, if you are doing a lot of testing for edge existence or have a graph representation of a markov chain, then you'd probably favor an adjacency matrix.
The next question you need to consider is how much you need to fit into memory. In most cases, where the number of edges in the graph is much much smaller than the total number of possible edges, an adjacency list is going to be more efficient, since you only need to store the edges that actually exist. A happy medium is to represent the adjacency matrix in compressed sparse row format in which you keep a vector of the non-zero entries from top left to bottom right, a corresponding vector indicating which columns the non-zero entries can be found in, and a third vector indicating the start of each row in the column-entry vector.
[[0.0, 0.0, 0.3, 0.1]
[0.1, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0]
[0.5, 0.2, 0.0, 0.3]]
can be represented as:
vals: [0.3, 0.1, 0.1, 0.5, 0.2, 0.3]
cols: [2, 3, 0, 0, 1, 4]
rows: [0, 2, null, 4]
Compressed sparse row is effectively an adjacency list (the column indices function the same way), but the format lends itself a bit more cleanly to matrix operations.
One example you might be familiar is Java serialization. This effectively serializes by graph, with each object instance being a node, and each reference being an edge. The algorithm used is recursive, but skipping duplicates. So the pseudo code would be:
serialize(x):
done - a set of serialized objects
if(serialized(x, done)) then return
otherwise:
record properties of x
record x as serialized in done
for each neighbour/child of x: serialize(child)
Another way of course is as a list of nodes and edges, which can be done as XML, or in any other preferred serialization format, or as an adjacency matrix.
On a less academic, more practical note, in CubicTest we use Xstream (Java) to serialize tests to and from xml. Xstream handles graph-structured object relations, so you might learn a thing or two from looking at it's source and the resulting xml. You're right about the ugly part though, the generated xml files don't look pretty.