How to see all the possible options for schema metadata in tensorflow? - tensorflow

I am using tensorflow data validation and I am trying to build schemas around my datasets. I've built the initial schemas and I can see/edit them in notepad, but I'm having a hard time actually finding a resource that shows me exactly what kind of parameters I can set in the file for a given data type (ie min or max values or data shapes).
Does anyone know of a good resource or even a comprehensive schema I can use to further edit my schema file?

Schemas are just a kind of protocol buffers message, and they are defined within TensorFlow Metadata. You can find the protocol buffers definition in tensorflow_metadata/proto/v0/schema.proto, which describes and documents all the possible properties and options.

Related

How exactly do references in flatbuffers work?

According to the "Writing a schema" guide for Google FlatBuffers it is possible to share data using references: "Remember that you can share data (refer to the same string/table within a buffer), so factoring out repeating data into its own data structure may be worth it."
However, I don't quite understand how this is meant to be accomplished. I have a flatbuffer that I'm trying to reverse engineer and I discovered that there are multiple offsets pointing to the same string value. When I compile the decoded JSON file again, there are multiple occurences of that string. What exactly do I have to specify in the schema file to prevent this?
Thank you :)
JSON has no way to represent such references, so a buffer is "flattened" to a tree when output as JSON. Only at the binary level can a FlatBuffer represent a DAG. You can construct such a DAG simply by using a child offset twice in parent(s) while serializing.

One time migration of VSAM files from Mainframe to Cloud Azure

Want to migrate bulk files (e.g VSAM) from Mainframe to Azure in the beginning of the Project, how that can be achieved ?
Any utility or do we need to write own scripts?
I suspect there are some utilities out there but I suspect they are most / all priced products. Since VSAM datasets are not defined using a language construct like DDL you will likely have to do most of the heavy lifting. Either writing your own programs or custom scripts. You didn’t mention operating system but I assume you’re working on z/OS.
Here are some things to consider:
The structure of the VSAM dataset is basically record oriented. There are three basic types you’ll run into that host application data:
Key Sequenced Datasets (KSDS)
Entry Sequenced Datasets (ESDS)
Relative Record datasets (RRDS)
Familiarize yourself with the means of defining the datasets as it will give you some insight into the dataset specifics. DFSMS Access Method Services Commands will show the utilities used to create them and get information like Keylength and offest of the key. DEFINE CLUSTER is the command to create the dataset. You mentioned you are moving the data toi Azure but this will help you understand the characteristics of the data you are moving.
Since there is no DDL for VSAM datasets you will generally find the structure in the programs that manipulate them like COBOL Copybooks, HLASM DSECTs and similar constructs. This is the long pole in the tent for you.
Consider what are the semantics of accessing the data. VSAM as an access method does have some ability to control read/write access on a macro level using a DEFINE CLUSTER option called SHAREOPTIONS. The SHAREOPTIONS instruct the operating system how to handle the VSAM buffers in terms of reading and writing so that multiple processes can access the same data. Its primitive if compared to sahred files systems like NFS. VSAM allows the application to control access (or serialization) using ENQ / DEQ functions. These enable applications to express intent in the cluster about a VSAM file and coordinate their own activities.
You might find that converting a VSAM file to a relational form like Db2 is better for you. Again, you’ll have to create the DDL to describe the tables, data formats and the like.
Another consideration is data conversion. You’ll find there is character data that is most likely in EBCDIC and needs to be converted to a new code page. Numeric data can be in Packed Decimal, Binary, or even text and will need to be converted.
The short answer is there isn’t an “Easy Button” to do what you want. Consider the data is only one of the questions that needs to be answered. Serialization and access to the data, codepage conversion, if you are moving some data but not others will you need to be able to map some of the converted data back to data on the mainframe.
Consider exploring IBM CDC classic replication. You can achieve it with click of buttons.
I have not done for Azure. So not sure about support.

Does Semantic tools like Anzo create a copy of data?

I'm new to semantic technologies. I understand what RDF, OWL and Ontologies and other basic terminologies are and how semantic search uses them. When we create a semantic search module using anzo with enterprise search capabilities. It connects with various data sources and creates relationship between them. Now I'm interested in knowing what a semantic tool like anzo does internally.
Does it creates a copy of data on local machine or it hits data sources every time we execute a SPARQL query
If it stores data, is this data stored in its row format or data is stored after cleaning and creating semantic relation between them.
What happens to data after query is executed. How does it get current data every time?
Any thoughts over it would be valuable for me.
Thanks a lot in advance!
Based on your comments, it appears you're using Anzo Graph Query Engine? If so, then the answers to you questions are
A copy of the data is held in memory
Not clear from any of the published information.
It doesn't. You need to load in the data using the 'LOAD' command.
A bit more on 3: You would be responsible for implementing a mechanism to keep the data in here up-to-date with the underlying data source. (which might be as simple as rebuilding the graph from a nightly dump or trying to implement a change data capture against the underlying store which replicated CRUD operations on the graph)
My answers are based on the marketing and support information available on the CambridgeSemantics site.

Liquibase load data in a format other than CSV

With the load data option that Liquibase provides, one can specify seed data in a CSV format. Is there a way I can provide say, a JSON or XML file with data that Liquibase would understand?
The use case is we are trying to put in some sample data which is hierarchical. E.g. Category - Subcategory relation which would require putting in parent id for all related categories. If there is a way to avoid including the ids in the seed data via say, JSON.
{
"MainCat1": ["SubCat11", "SubCat12"],
"MainCat2": ["SubCat21", "SubCat22"]
}
Very likely to have this as not supported (couldn't make Google help me) but is there a way to write a plugin or something that does this? Pointer to a guide (if any) would help.
NOTE: This is not about specifying the change log in that format.
This not currently supported and supporting it robustly would be pretty difficult. The main difficultly lies in the fact that Liquibase is designed to be database-platform agnostic, combined with the design goal of being able to generate the SQL required to do an operation without actually doing the operation live.
Inserting data like you want without knowing the keys and just generating SQL that could be run later is going to be very difficult, perhaps even impossible. I would suggest approaching Nathan, who is the main developer for Liquibase, more directly. The best way to do that might be through the JIRA bug database for Liquibase.
If you want to have a crack at implementing it, you could start by looking at the code for the LoadDataChange class (source in Github), which is where the CSV support currently lives.

Avro a replacement for Writables

I am very new to Hadoop and have to delve into its serialization. I know that Hadoop comes with its own serializer called Writables. I was curious to know whether Avro (or protobuf, thrift) replaces the Writables interface or Avro is just meant for serializing the MR client data but not the internal communication between say namenode and datanode.
AVRO is a serialization library (with apis for a number of languages). AVRO is an alternative to using/implementing your Key / Value objects as Writables, but hadoop still uses it's own RPC data structures when communicating between the various services (datanodes, namenodes, job and task trackers).
I've read somewhere that Avro may well end up being the standard internal data exchange mechanism/serialization framework within Hadoop, which makes sense as it is based on inheritance, much like the "new" Hadoop API (the one that uses the mapreduce namespace for its libraries), whereas the "old" API (mapred libraries) is based on interfaces. That means, in practice, that you can certainly use avro with both APIs, although one or two things may require custom code if you're using the mapred libs (e.g. multiple output formats, chain mappers).
But Avro offers far more than "just" doing away with the need for your own writables (although that is, in my view, a considerable plus): it offers fairly efficient serialization, the choice between serializing against generated entity classes (like thrift requires) or using a so-called GenericRecord structure instead, and not having to have tagged data. This is possible as Avro always has its data schema available at read and write time (it's actually saved in json format as a header in the data file) which means you have the option of "projecting" from one set of fields to a subset of those fields by simply providing this information implicitly in the schema used to read the data. You can then adapt to changes in input data structure by tweaking your schemas, rather than changing your code in multiple places. You can also change the way your data is sorted by defining your schema appropriately (as there is an optional ORDER attribute avalailable).