Schema of codec is part of the data stream

Schema of codec is part of the data stream - scodec

I'm currently evaluating if scodec is the right tool for my task. I have to parse an InputStream (file or network) which is structured the following:
| Header - FieldDesc1 - FieldDesc2 - ... \
- FieldDescM - Record1 - Record2 - ... - RecordN |
This means the stream starts with some metadata, which descibes what will follow. Each element is separated by a delimiter ( - ) which identifies what type it is. The N field descriptions contain the information which structure and size each of the N records will have.
I was readily able to parse header as well as the sequence of fields, because I was able to formulate a codec which is known at compile time. But I'm kind of puzzled how to build a codec at runtime due to the information from the field descriptions.
Is that possible? If yes, perhaps you can point me to an example?

Here is a starting point if it's still relevant:
Use a DiscriminatorCodec (and a potentially related question), or
Use consume() on the codec decoding the type identifier (which I guess is a simple number), and pass the type to a function returning the correct wanted Codec.
For example using consume(), you can determine what codec to use at decode time:
def variableTypeC =
int8.consume(tid => selectCodec(tid))(selectTypeId(_))
I had to work on a similar problem and went for the consume() solution (as I had the feeling it provided me with a bit more flexbility and was only discovering scodec at the time).
I'd be happy to build an example using DiscriminatorCodec if there is any need for it :).

Related

protocol-buffers: string or byte sequence of the exact length

Looking at https://developers.google.com/protocol-buffers/docs/proto3#scalar it appears that string and bytes types don't limit the length? Does it mean that we're expected to specify the length of transmitted string in a separate field, e.g. :
message Person {
string name = 1;
int32 name_len = 2;
int32 user_id = 3;
...
}

The wire type used for string/byte is Length-delimited. This means that the message includes the strings length. How this is made available to you will depend upon the language you are using - for example the table says that in C++ a string type is used so you can call name.length() to retrieve the length.
So there is no need to specify the length in a separate field.

One of the things that I wished GPB did was allow the schema to be used to set constraints on such things as list/array length, or numerical value ranges. The best you can do is to have a comment in the .proto file and hope that programmers pay attention to it!
Other serialisation technologies do do this, like XSD (though often the tools are poor), ASN.1 and JSON schema. It's very useful. If GPB added these (it doesn't change wire formats), GPB would be pretty well "complete".

Automatically detect security identifier columns using Visions

I'm interested in using the Visions library to automate the process of identifying certain types of security (stock) identifiers. The documentation mentions that it could be used in such a way for ISBN codes but I'm looking for a more concrete example of how to do it. I think the process would be pretty much identical for the fields I'm thinking of as they all have check digits (ISIN, SEDOL, CUSIP).
My general idea is that I would create custom types for the different identifier types and could use those types to
Take a dataframe where the types are unknown and identify columns matching the types (even if it's not a 100% match)
Validate the types on a dataframe where the intended type is known

Great question and use-case! Unfortunately, the documentation on making new types probably needs a little love right now as there were API breaking changes with the 0.7.0 release. Both the previous link and this post from August, 2020 should cover the conceptual idea of type creation in greater detail. If any of those examples break then mea culpa and our apologies, we switched to a dispatch based implementation to support different backends (pandas, numpy, dask, spark, etc...) for each type. You shouldn't have to worry about that for now but if you're interested you can find the default type definitions here with their backends here.
Building an ISBN Type
We need to make two basic decisions when defining a type:
What defines the type
What other types are our new type related to?
For the ISBN use-case O'Reilly provides a validation regex to match ISBN-10 and ISBN-13 codes. So,
What defines a type?
We want every element in the sequence to be a string which matches a corresponding ISBN-10 or ISBN-13 regex
What other types are our new type related to?
Since ISBN's are themselves strings we can use the default String type provided by visions.
Type Definition
from typing import Sequence
import pandas as pd
from visions.relations import IdentityRelation, TypeRelation
from visions.types.string import String
from visions.types.type import VisionsBaseType
isbn_regex = "^(?:ISBN(?:-1[03])?:?●)?(?=[0-9X]{10}$|(?=(?:[0-9]+[-●]){3})[-●0-9X]{13}$|97[89][0-9]{10}$|(?=(?:[0-9]+[-●]){4})[-●0-9]{17}$)(?:97[89][-●]?)?[0-9]{1,5}[-●]?[0-9]+[-●]?[0-9]+[-●]?[0-9X]$"
class ISBN(VisionsBaseType):
#staticmethod
def get_relations() -> Sequence[TypeRelation]:
relations = [
IdentityRelation(String),
]
return relations
#staticmethod
def contains_op(series: pd.Series, state: dict) -> bool:
return series.str.contains(isbn_regex).all()
Looking at this closely there are three things to take note of.
The new type inherits from VisionsBaseType
We had to define a get_relations method which is how we relate a new type to others we might want to use in a typeset. In this case, I've used an IdentityRelation to String which means ISBNs are subsets of String. We can also use InferenceRelation's when we want to support relations which change the underlying data (say converting the string '4.2' to the float 4.2).
A contains_op this is our definition of the type. In this case, we are applying a regex string to every element in the input and verifying it matched the regex provided by O'Reilly.
Extensions
In theory ISBNs can be encoded in what looks like a 10 or 13 digit integer as well - to work with those you might want to create an InferenceRelation between Integer and ISBN. A simple implementation would involve coercing Integers to string and applying the above regex.

How can I decode Wordnet entites?

in my experiment, I am using WordNet18 data, which contains triplets of the form (subject, predicate, object), some examples are as follows:
03964744 _hyponym 04371774
00260881 _hypernym 00260622
02199712 _member_holonym 02188065
I would like to know what the entity IDs, like 03964744, stand for. Is there anybody knows how to decode the entities?
Thank you in advance.

The 8-digit numbers you see are probably the byte offset of the entry in the data files. See http://wordnet.princeton.edu/wordnet/man/wnintro.5WN.html
After quite a bit of hunting around, I think you are looking at the numbers of WordNet 3.0 (the byte offsets for a given synset differ between versions; 3.1 is the latest version).
Your first entry seems to be saying that swing is a type of toy:
http://compling.hss.ntu.edu.sg/omw/cgi-bin/wn-gridx.cgi?usrname=&gridmode=grid&synset=04371774-n&lang=eng&lang2=eng
http://compling.hss.ntu.edu.sg/omw/cgi-bin/wn-gridx.cgi?synset=03964744-n

How can I test the equality of two MediaStreams?

I'm wondering if there is a way to determine if two MediaStreams are equal.
What do you mean by "equal"?
I'd like to determine if the two streams are using the same hardware sources (Same microphone and camera are being used).
Acquiring streamB with the exact same constraints as streamA would mean they are equal.
Here is what I've tried so far:
comparing via the MediaStream id e.g.: streamA.id == streamB.id
This falls away since according to the spec:
When a MediaStream object is created, the User Agent must generate an identifier string, and must initialize the object's id attribute to that string. A good practice is to use a UUID [rfc4122], which is 36 characters long in its canonical form. To avoid fingerprinting, implementations should use the forms in section 4.4 or 4.5 of RFC 4122 when generating UUIDs.
Compare the id's of the MediaStreamTracks - same story, a UUID is generated per track.
Compare the tracks labels, which in the current Chrome contain names/identifiers of the hardware. This is very close to what I'm looking for, however (emphasis mine):
User Agents may label audio and video sources (e.g., "Internal microphone" or "External USB Webcam"). The label attribute must return the label of the object's corresponding source, if any. If the corresponding source has or had no label, the attribute must instead return the empty string
Is there a different approach I could take? Should I never end up in a situation where I compare two media streams? Would you say I can trust the label attribute?
Thanks for your time.

groupId together with kind is probably the closest thing you will get. Until you get multiple mics/cams on the same device...

Lossless assignment between Field-Symbols

I'm currently trying to perform a dynamic lossless assignment in an ABAP 7.0v SP26 environment.
Background:
I want to read in a csv file and move it into an internal structure without any data losses. Therefore, I declared the field-symbols:
<lfs_field> TYPE any which represents a structure component
<lfs_element> TYPE string which holds a csv value
Approach:
My current "solution" is this (lo_field is an element description of <lfs_field>):
IF STRLEN( <lfs_element> ) > lo_field->output_length.
RAISE EXCEPTION TYPE cx_sy_conversion_data_loss.
ENDIF.
I don't know how precisely it works, but seems to catch the most obvious cases.
Attempts:
MOVE EXACT <lfs_field> TO <lfs_element>.
...gives me...
Unable to interpret "EXACT". Possible causes: Incorrect spelling or comma error
...while...
COMPUTE EXACT <lfs_field> = <lfs_element>.
...results in...
Incorrect statement: "=" missing .
As the ABAP version is too old I also cannot use EXACT #( ... )
Example:
In this case I'm using normal variables. Lets just pretend they are field-symbols:
DATA: lw_element TYPE string VALUE '10121212212.1256',
lw_field TYPE p DECIMALS 2.
lw_field = lw_element.
* lw_field now contains 10121212212.13 without any notice about the precision loss
So, how would I do a perfect valid lossless assignment with field-symbols?

Don't see an easy way around that. Guess that's why they introduced MOVE EXACT in the first place.
Note that output_length is not a clean solution. For example, string always has output_length 0, but will of course be able to hold a CHAR3 with output_length 3.
Three ideas how you could go about your question:
Parse and compare types. Parse the source field to detect format and length, e.g. "character-like", "60 places". Then get an element descriptor for the target field and check whether the source fits into the target. Don't think it makes sense to start collecting the possibly large CASEs for this here. If you have access to a newer ABAP, you could try generating a large test data set there and use it to reverse-engineer the compatibility rules from MOVE EXACT.
Back-and-forth conversion. Move the value from source to target and back and see whether it changes. If it changes, the fields aren't compatible. This is unprecise, as some formats will change although the values remain the same; for example, -42 could change to 42-, although this is the same in ABAP.
To-longer conversion. Move the field from source to target. Then construct a slightly longer version of target, and move source also there. If the two targets are identical, the fields are compatible. This fails at the boundaries, i.e. if it's not possible to construct a slightly-longer version, e.g. because the maximum number of decimal places of a P field is reached.
DATA target TYPE char3.
DATA source TYPE string VALUE `123.5`.
DATA(lo_target) = CAST cl_abap_elemdescr( cl_abap_elemdescr=>describe_by_data( target ) ).
DATA(lo_longer) = cl_abap_elemdescr=>get_by_kind(
p_type_kind = lo_target->type_kind
p_length = lo_target->length + 1
p_decimals = lo_target->decimals + 1 ).
DATA lv_longer TYPE REF TO data.
CREATE DATA lv_longer TYPE HANDLE lo_longer.
ASSIGN lv_longer->* TO FIELD-SYMBOL(<longer>).
<longer> = source.
target = source.
IF <longer> = target.
WRITE `Fits`.
ELSE.
WRITE `Doesn't fit, ` && target && ` is different from ` && <longer>.
ENDIF.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Schema of codec is part of the data stream - scodec

Related

protocol-buffers: string or byte sequence of the exact length

Automatically detect security identifier columns using Visions

How can I decode Wordnet entites?

How can I test the equality of two MediaStreams?

Lossless assignment between Field-Symbols

Categories

Resources