What are the advantages of using tf.train.SequenceExample over tf.train.Example for variable length features? - tensorflow

Recently I read this guide on undocumented featuers in TensorFlow, as I needed to pass variable length sequences as input. However, I found the protocol for tf.train.SequenceExample relatively confusing (especially due to lack of documentation), and managed to build an input pipe using tf.train.Example just fine instead.
Are there any advantages to using tf.train.SequenceExample? Using the standard example protocol when there is a dedicated one for variable length sequences seems like a cheat, but does it bear any consequence?

Here are the definitions of the Example and SequenceExample protocol buffers, and all the protos they may contain:
message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true]; }
message Feature {
oneof kind {
BytesList bytes_list = 1;
FloatList float_list = 2;
Int64List int64_list = 3;
}
};
message Features { map<string, Feature> feature = 1; };
message Example { Features features = 1; };
message FeatureList { repeated Feature feature = 1; };
message FeatureLists { map<string, FeatureList> feature_list = 1; };
message SequenceExample {
Features context = 1;
FeatureLists feature_lists = 2;
};
An Example contains a Features, which contains a mapping from feature name to Feature, which contains either a bytes list, or a float list or an int64 list.
A SequenceExample also contains a Features, but it also contains a FeatureLists, which contains a mapping from list name to FeatureList, which contains a list of Feature. So it can do everything an Example can do, and more. But do you really need that extra functionality? What does it do?
Since each Feature contains a list of values, a FeatureList is a list of lists. And that's the key: if you need lists of lists of values, then you need SequenceExample.
For example, if you handle text, you can represent it as one big string:
from tensorflow.train import BytesList
BytesList(value=[b"This is the first sentence. And here's another."])
Or you could represent it as a list of words and tokens:
BytesList(value=[b"This", b"is", b"the", b"first", b"sentence", b".", b"And", b"here",
b"'s", b"another", b"."])
Or you could represent each sentence separately. That's where you would need a list of lists:
from tensorflow.train import BytesList, Feature, FeatureList
s1 = BytesList(value=[b"This", b"is", b"the", b"first", b"sentence", b"."])
s2 = BytesList(value=[b"And", b"here", b"'s", b"another", b"."])
fl = FeatureList(feature=[Feature(bytes_list=s1), Feature(bytes_list=s2)])
Then create the SequenceExample:
from tensorflow.train import SequenceExample, FeatureLists
seq = SequenceExample(feature_lists=FeatureLists(feature_list={
"sentences": fl
}))
And you can serialize it and perhaps save it to a TFRecord file.
data = seq.SerializeToString()
Later, when you read the data, you can parse it using tf.io.parse_single_sequence_example().

The link you provided lists some benefits. You can see how parse_single_sequence_example is used here https://github.com/tensorflow/magenta/blob/master/magenta/common/sequence_example_lib.py
If you managed to get the data into your model with Example, it should be fine. SequenceExample just gives a little more structure to your data and some utilities for working with it.

Related

Transcoding HTTP to gRPC: Same endpoint with different parameters

I already have a working gRPC project working. I'm looking to build an API to be able to do some HTTP requests.
I have the following 2 types:
message FindRequest {
ModelType model_type = 1;
oneof by {
string id = 2;
string name = 3;
}
}
message GetAllRequest {
ModelType model_type = 1;
int32 page_size = 2;
oneof paging {
int32 page = 3;
bool skip_paging = 4;
}
}
And then, I would like to have those 2 endpoints:
// Get a data set by ID or name. Returns an empty data set if there is no such
// data set in the database. If there are multiple data sets with the same
// name in the database it returns the first of these data sets.
rpc Find(FindRequest) returns (DataSet){
option (google.api.http) = { get: "/datasets" };
}
// Get (a page of) all data sets of a given type. If no page size is given
// (page <= 0) it defaults to 100. An unset page (page <= 0) defaults to the
// first page.
rpc GetAll(GetAllRequest) returns (GetAllResponse){
option (google.api.http) = { get: "/datasets" };
}
It makes sense to me to have 2 different endpoints with the same name, but that differ with the parameters.
For instance, requesting /datasets?model-type=XXX should be mapped to the GetAll function, while requesting /datasets?model-type=XXX&name=YYY should be mapped to Find function.
However, it doesn't work, since the mapping fails I guess, so none of these endpoints returns me anything.
I think the solution to make the mapping working would be to force the parameter to be required, however, I am working with proto3, which has disallowed the required field.
So how could I be able to have 2 endpoints with the same name, but different parameters, with proto3 ?
I know that if I am using different endpoint names, it is working, for example for the findRequest, I could have the following endpoint : /findDatasets, but regarding the best practice about API naming convention, it is not advisable, or is it ?
The conventional way to solve this problem is to use different methods. My hunch is that it's an anti-pattern to try to differentiate using the fields in the request string.
service YourService {
rpc FindSomething(FindSomethingRequest) returns (FindSomethingResponse){
option (google.api.http) = { get: "/something/find" };
}
rpc ListSomething(ListSomethingRequest) returns (ListSomethingResponse){
option (google.api.http) = { get: "/something/list" };
}
}
message FindSomethingRequest {
ModelType model_type = 1;
string id = 2;
string name = 3;
}
message ListSomethingRequest {
int32 page_size = 2;
int32 page_token = 3;
}
message ListSomethingResponse {
repeated ModelType model_types = 1;
int32 page_size = 2;
int32 next_page_token = 3;
}
I'm unsure of your underlying thing structure but, I think it's better practice to model things with all possible properties and permit leaving some unset (e.g. either id or name or possibly both in FindSomethingRequest) rather than creating different message types for all possible queries. You model the thing not how you interact with it.
In your implementation (!) of FindSomething, you then deal with the permutations of how users of the message may construct the fields. Perhaps reporting an error "Either id or name is required`.
I think ListSomething's messages could be simpler too. You request a List of (ModelTypes) and give a page_size and an page_token (that could be ""). It returns a list of ModelTypes, the size of the page returned (possibly less than requested) and a next_page_token if there is more data, that you can use to make the next ListSomething request.

How to get all stored values not terms for a field in Lucene.Net?

I saw an example of extracting all available terms for a field here
The reason it doesn't fit my porpouses is because terms and stored values are different, e.g. stored value of "black cat" will be represnted as two terms "black" and "cat". in my code I need to extract whole stored values in this case "black cat".
Yes, you could do that. I'm not C# programmer, but hopefully you will understand Java code.
IndexReader reader = DirectoryReader.open(dir);
final int len = reader.maxDoc();
for (int i = 0; i < len; ++i) {
Document document = reader.document(i);
List<IndexableField> fields = document.getFields();
for (IndexableField field : fields) {
if (field.fieldType().stored()) {
System.out.println(field.stringValue());
}
}
}
So, basically, I'm traversing across all docs, get all fields, and if they are stored, get the data. You could filter it by the name of the field, that are needed for you.
Full test could be found here - https://raw.githubusercontent.com/MysterionRise/information-retrieval-adventure/master/src/main/java/org/mystic/GetAllStoredFieldValues.java (also with the proof, that it works correctly)

Why does this documentation example fail? Is my workaround an acceptable equivalent?

While exploring the documented example raised in this perl6 question that was asked here recently, I found that the final implementation option - (my interpretation of the example is that it provides three different ways to do something) - doesn't work. Running this;
class HTTP::Header does Associative {
has %!fields handles <iterator list kv keys values>;
sub normalize-key ($key) { $key.subst(/\w+/, *.tc, :g) }
method EXISTS-KEY ($key) { %!fields{normalize-key $key}:exists }
method DELETE-KEY ($key) { %!fields{normalize-key $key}:delete }
method push (*#_) { %!fields.push: #_ }
multi method AT-KEY (::?CLASS:D: $key) is rw {
my $element := %!fields{normalize-key $key};
Proxy.new(
FETCH => method () { $element },
STORE => method ($value) {
$element = do given $value».split(/',' \s+/).flat {
when 1 { .[0] } # a single value is stored as a string
default { .Array } # multiple values are stored as an array
}
}
);
}
}
my $header = HTTP::Header.new;
say $header.WHAT; #-> (Header)
$header<Accept> = "text/plain";
$header{'Accept-' X~ <Charset Encoding Language>} = <utf-8 gzip en>;
$header.push('Accept-Language' => "fr"); # like .push on a Hash
say $header<Accept-Language>.perl; #-> $["en", "fr"]
... produces the expected output. Note that the third last line with the X meta-operator assigns a literal list (built with angle brackets) to a hash slice (given a flexible definition of "hash"). My understanding is this results in three seperate calls to method AT-KEY each with a single string argument (apart from self) and therefore does not exersise the default clause of the given statement. Is that correct?
When I invent a use case that excersises that part of the code, it appears to fail;
... as above ...
$header<Accept> = "text/plain";
$header{'Accept-' X~ <Charset Encoding Language>} = <utf-8 gzip en>;
$header{'Accept-Language'} = "en, fr, cz";
say $header<Accept-Language>.perl; #-> ["en", "fr", "cz"] ??
# outputs
(Header)
This Seq has already been iterated, and its values consumed
(you might solve this by adding .cache on usages of the Seq, or
by assigning the Seq into an array)
in block at ./hhorig.pl line 20
in method <anon> at ./hhorig.pl line 18
in block <unit> at ./hhorig.pl line 32
The error message provides an awesome explanation - the topic is a sequence produced by the split and is now spent and hence can't be referenced in the when and/or default clauses.
Have I correctly "lifted" and implemented the example? Is my invented use case of several language codes in the one string wrong or is the example code wrong/out-of-date? I say out-of-date as my recollection is that Seq's came along pretty late in the perl6 development process - so perhaps, this code used to work but doesn't now. Can anyone clarify/confirm?
Finally, taking the error message into account, the following code appears to solve the problem;
... as above ...
STORE => method ($value) {
my #values = $value».split(/',' \s+/) ;
$element = do given #values.flat {
when 1 { $value } # a single value is stored as a string
default { #values } # multiple values are stored as an array
}
}
... but is it an exact equivalent?
That code works now (Rakudo 2018.04) and prints
$["en", "fr", "cz"]
as intended. It was probably a bug which was eventually solved.

How do I loop thought each DB field to see if range is correct

I have this response in soapUI:
<pointsCriteria>
<calculatorLabel>Have you registered for inContact, signed up for marketing news from FNB/RMB Private Bank, updated your contact details and chosen to receive your statements</calculatorLabel>
<description>Be registered for inContact, allow us to communicate with you (i.e. update your marketing consent to 'Yes'), receive your statements via email and keep your contact information up to date</description>
<grades>
<points>0</points>
<value>No</value>
</grades>
<grades>
<points>1000</points>
<value>Yes</value>
</grades>
<label>Marketing consent given and Online Contact details updated in last 12 months</label>
<name>c21_mrktng_cnsnt_cntct_cmb_point</name>
</pointsCriteria>
There are many many many pointsCriteria and I use the below xquery to give me the DB value and Range of what that field is meant to be:
<return>
{
for $x in //pointsCriteria
return <DBRange>
<db>{data($x/name/text())}</db>
<points>{data($x//points/text())}</points>
</DBRange>
}
</return>
And i get the below response
<return><DBRange><db>c21_mrktng_cnsnt_cntct_cmb_point</db><points>0 1000</points></DBRange>
That last bit sits in a property transfer. I need SQL to bring back all rows where that DB field is not in that points range (field can only be 0 or 1000 in this case), my problem is I dont know how to loop through each DBRange/DBrange in this manner? please help
I'm not sure that I really understand your question, however I think that you want to make queries in your DB using specific table with a column name defined in your <db> field of your xml, and using as values the values defined in <points> field of the same xml.
So you can try using a groovy TestStep, first parse your Xml and get back your column name, and your points. To iterate over points if the values are separated with a blank space you can make a split(" ") to get a list and then use each() to iterate over the points on this list. Then using groovy.sql.Sql you can perform the queries in your DB.
Only one more thing, you need to put the JDBC drivers for your vendor DB in $SOAPUI_HOME/bin/ext and then restart SOAPUI in order that it can load the necessary driver classes.
So the follow code approach can achieve your goal:
import groovy.sql.Sql
import groovy.util.XmlSlurper
// soapui groovy testStep requires that first register your
// db vendor drivers, as example I use oracle drivers...
com.eviware.soapui.support.GroovyUtils.registerJdbcDriver( "oracle.jdbc.driver.OracleDriver")
// connection properties db (example for oracle data base)
def db = [
url : 'jdbc:oracle:thin:#db_host:d_bport/db_name',
username : 'yourUser',
password : '********',
driver : 'oracle.jdbc.driver.OracleDriver'
]
// create the db instance
def sql = Sql.newInstance("${db.url}", "${db.username}", "${db.password}","${db.driver}")
def result = '''<return>
<DBRange>
<db>c21_mrktng_cnsnt_cntct_cmb_point</db>
<points>0 1000</points>
</DBRange>
</return>'''
def resXml = new XmlSlurper().parseText(result)
// get the field
def field = resXml.DBRange.db.text()
// get the points
def points = resXml.DBRange.points.text()
// points are separated by blank space,
// so split to get an array with the points
def pointList = points.split(" ")
// for each point make your query
pointList.each {
def sqlResult = sql.rows "select * from your_table where ${field} = ?",[it]
log.info sqlResult
}
sql.close();
Hope this helps,
Thanks again for your help #albciff, I had to add this into a multidimensional array (I renamed field to column and result is a large return from the Xquery above)
def resXml = new XmlSlurper().parseText(result)
//get the columns and points ranges
def Column = resXml.DBRange.db*.text()
def Points = resXml.DBRange.points*.text()
//sorting it all out into a multidimensional array (index per index)
count = 0
bigList = Column.collect
{
[it, Points[count++]]
}
//iterating through the array
bigList.each
{//creating two smaller lists and making it readable for sql part later
def column = it[0]
def points = it[1]
//further splitting the points to test each
pointList = points.split(" ")
pointList.each
{//test each points range per column
def sqlResult = sql.rows "select * from my_table where ${column} <> ",[it]
log.info sqlResult
}
}
sql.close();
return;

Proper Way to Retrieve More than 128 Documents with RavenDB

I know variants of this question have been asked before (even by me), but I still don't understand a thing or two about this...
It was my understanding that one could retrieve more documents than the 128 default setting by doing this:
session.Advanced.MaxNumberOfRequestsPerSession = int.MaxValue;
And I've learned that a WHERE clause should be an ExpressionTree instead of a Func, so that it's treated as Queryable instead of Enumerable. So I thought this should work:
public static List<T> GetObjectList<T>(Expression<Func<T, bool>> whereClause)
{
using (IDocumentSession session = GetRavenSession())
{
return session.Query<T>().Where(whereClause).ToList();
}
}
However, that only returns 128 documents. Why?
Note, here is the code that calls the above method:
RavenDataAccessComponent.GetObjectList<Ccm>(x => x.TimeStamp > lastReadTime);
If I add Take(n), then I can get as many documents as I like. For example, this returns 200 documents:
return session.Query<T>().Where(whereClause).Take(200).ToList();
Based on all of this, it would seem that the appropriate way to retrieve thousands of documents is to set MaxNumberOfRequestsPerSession and use Take() in the query. Is that right? If not, how should it be done?
For my app, I need to retrieve thousands of documents (that have very little data in them). We keep these documents in memory and used as the data source for charts.
** EDIT **
I tried using int.MaxValue in my Take():
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
And that returns 1024. Argh. How do I get more than 1024?
** EDIT 2 - Sample document showing data **
{
"Header_ID": 3525880,
"Sub_ID": "120403261139",
"TimeStamp": "2012-04-05T15:14:13.9870000",
"Equipment_ID": "PBG11A-CCM",
"AverageAbsorber1": "284.451",
"AverageAbsorber2": "108.442",
"AverageAbsorber3": "886.523",
"AverageAbsorber4": "176.773"
}
It is worth noting that since version 2.5, RavenDB has an "unbounded results API" to allow streaming. The example from the docs shows how to use this:
var query = session.Query<User>("Users/ByActive").Where(x => x.Active);
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
User activeUser = enumerator.Current.Document;
}
}
There is support for standard RavenDB queries, Lucence queries and there is also async support.
The documentation can be found here. Ayende's introductory blog article can be found here.
The Take(n) function will only give you up to 1024 by default. However, you can change this default in Raven.Server.exe.config:
<add key="Raven/MaxPageSize" value="5000"/>
For more info, see: http://ravendb.net/docs/intro/safe-by-default
The Take(n) function will only give you up to 1024 by default. However, you can use it in pair with Skip(n) to get all
var points = new List<T>();
var nextGroupOfPoints = new List<T>();
const int ElementTakeCount = 1024;
int i = 0;
int skipResults = 0;
do
{
nextGroupOfPoints = session.Query<T>().Statistics(out stats).Where(whereClause).Skip(i * ElementTakeCount + skipResults).Take(ElementTakeCount).ToList();
i++;
skipResults += stats.SkippedResults;
points = points.Concat(nextGroupOfPoints).ToList();
}
while (nextGroupOfPoints.Count == ElementTakeCount);
return points;
RavenDB Paging
Number of request per session is a separate concept then number of documents retrieved per call. Sessions are short lived and are expected to have few calls issued over them.
If you are getting more then 10 of anything from the store (even less then default 128) for human consumption then something is wrong or your problem is requiring different thinking then truck load of documents coming from the data store.
RavenDB indexing is quite sophisticated. Good article about indexing here and facets here.
If you have need to perform data aggregation, create map/reduce index which results in aggregated data e.g.:
Index:
from post in docs.Posts
select new { post.Author, Count = 1 }
from result in results
group result by result.Author into g
select new
{
Author = g.Key,
Count = g.Sum(x=>x.Count)
}
Query:
session.Query<AuthorPostStats>("Posts/ByUser/Count")(x=>x.Author)();
You can also use a predefined index with the Stream method. You may use a Where clause on indexed fields.
var query = session.Query<User, MyUserIndex>();
var query = session.Query<User, MyUserIndex>().Where(x => !x.IsDeleted);
using (var enumerator = session.Advanced.Stream<User>(query))
{
while (enumerator.MoveNext())
{
var user = enumerator.Current.Document;
// do something
}
}
Example index:
public class MyUserIndex: AbstractIndexCreationTask<User>
{
public MyUserIndex()
{
this.Map = users =>
from u in users
select new
{
u.IsDeleted,
u.Username,
};
}
}
Documentation: What are indexes?
Session : Querying : How to stream query results?
Important note: the Stream method will NOT track objects. If you change objects obtained from this method, SaveChanges() will not be aware of any change.
Other note: you may get the following exception if you do not specify the index to use.
InvalidOperationException: StreamQuery does not support querying dynamic indexes. It is designed to be used with large data-sets and is unlikely to return all data-set after 15 sec of indexing, like Query() does.