I doing some meta programming with bigquery and noticed something I didn't expect.
I'm using this query:
SELECT * FROM `bigquery-public-data.samples.shakespeare` LIMIT 1000
which should go against a public dataset. If you analyze that query you would see that the schema looked like this:
"schema": {
"fields": [
{
"name": "word",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "word_count",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "corpus",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "corpus_date",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
},
This might look good at first, but if you look at the table definition for bigquery-public-data.samples.shakespeare you will notice that the every field in that select are required in the table, so why does it end up being nullable in the schema for the select?
Some context:
I'm working on a F# type provider wher I try to encode all the values as correct as possible. That means nullable as option types and non nullable as regular types. If I always get nullable it will make the use much more cumbersome for fields that can't be nullable.
Even though fields are REQUIRED in the table schema, query can do transformations which converts non NULL values to NULL values, therefore query result may have different schema (both with respect to nullability and to data types) then what the original table had.
Related
Suppose I have a json like this:
{"1": {"first_name": "a", "last_name": "b"},
"2": {"first_name": "c", "last_name": "d"}}
As you can see, values have such schema:
{"type": "object",
"properties": {
"first_name": {"type": "string"},
"last_name": {"type": "string"}
},
"additionalProperties": false,
"required": ["first_name", "last_name"]}
I want to know how can I define a schema which can validate the above json?
The additionalProperties takes a JSON Schema as it's value. (Yes, a boolean is a valid JSON Schema!)
Let's recap what the additionalProperties keyword does...
The behavior of this keyword depends on the presence and annotation
results of "properties" and "patternProperties" within the same schema
object. Validation with "additionalProperties" applies only to the
child values of instance names that do not appear in the annotation
results of either "properties" or "patternProperties".
For all such properties, validation succeeds if the child instance
validates against the "additionalProperties" schema.
https://json-schema.org/draft/2020-12/json-schema-core.html#additionalProperties
In simplest terms, if you don't use properties or patternProperties within the same schema object, the value schema of additionalProperties applies to ALL values of the applicable object in your instance.
As such, you only need to nest your existing schema as follows.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"additionalProperties": YOUR SCHEMA
}
I want to change multiple types (supported in the latest drafts of JSON Schema so does OpenAPI v3.1) to anyOf, oneOf but I am a bit confused to which the types would be mapped to. Or can I map to any of the two.
PS. I do have knowledge about anyOf, oneOf, etc. but multiple types behavior is a little ambiguous. (I know the schema is invalid but it is just an example that is more focused towards type conversion)
{
"type": ["null", "object", "integer", "string"],
"properties": {
"prop1": {
"type": "string"
},
"prop2": {
"type": "string"
}
},
"enum": [2, 3, 4, 5],
"const": "sample const entry",
"exclusiveMinimum": 1.22,
"exclusiveMaximum": 50,
"maxLength": 10,
"minLength": 2,
"format": "int32"
}
I am converting it this way.
{
"anyOf": [{
"type": "null"
},
{
"type": "object",
"properties": {
"prop1": {
"type": "string"
},
"prop2": {
"type": "string"
}
}
},
{
"type": "integer",
"enum": [2, 3, 4, 5],
"exclusiveMinimum": 1.22,
"exclusiveMaximum": 50,
"format": "int32"
},
{
"type": "string",
"maxLength": 10,
"minLength": 2,
"const": "sample const entry"
}
]
}
anyOf gives you a closer match for the semantics than oneOf;
The problem (or benefit!) of oneOf is that it will fail if you happen to match 2 different cases.
That is unlikely to be what you want, given the source of your conversion which has those looser semantics.
Imagine converting ["integer","number"], for example; if the input was a 1, you'd match both and fail using oneOf.
First of all, your example is not valid:
The initial schema doesn't match anything, it's an "impossible" schema. The "enum": [2, 3, 4, 5] and "const": "sample const entry" constraints are mutually exclusive, and so are "const": "sample const entry" and "maxLength": 10.
The rewritten schema is not equivalent to the original schema because the enum and const were moved from the root level into subschemas. Yes, this way the schema makes more sense and will sort of work (e.g. it will match the specified numbers - but not strings! because of const vs maxLength contradiction), but it's not the same the original schema.
With regard to oneOf/anyOf:
It depends.
The choice between anyOf and oneOf depends on the context, i.e. whether an instance is can match more than one subschema or exactly one subschema. In other words, whether multiple subschema match is considered OK or an error. Nullable references typically need anyOf rather than oneOf, but other cases vary from schema to schema.
For example,
"type": ["number", "integer"]
corresponds to anyOf because there's an overlap - integer values are also valid "number" values in JSON Schema.
Whereas
"type": ["string", "integer"]
can be represented using either oneOf or anyOf. oneOf is semantically closer since strings and integers are totally different data types with no overlap. But technically anyOf also works, it's just there won't be more than one subschema match in this particular case.
In your example, all base type values are distinct with no overlap, so I would use oneOf, but technically anyOf will also work.
On using the typedef() method from pyral(version - 1.4.2) library, I am able to get the fields for the respective artifacts types(eg. defect),
but in the response -> the data type (attribute type) for the fields differs from the one seen on the UI. eg. for drop-down field I'm getting 'Rating/String' as AttributeType instead of 'DROP_DOWN'.
how could I get the real data type as the one seen on UI through the response?
Is there any other api that I can use to get all the fields and there datatype, allowed values attributes for a defect?
If you take a look at the WSAPI docs, you'll find that the AttributeDefinition object has many different elements. The AttributeType is the one you're accessing which can be one of the following values:
"BINARY_DATA", "BOOLEAN", "COLLECTION", "DATE", "DECIMAL", "INTEGER", "OBJECT", "QUANTITY", "RATING", "RAW", "STATE", "STRING", "TEXT", "MAP", "WEB_LINK"
But it also has an element called RealAttributeType which can be one of the following values:
"BINARY_DATA", "OBJECT", "QUANTITY", "RATING", "STATE", "RAW", "COLLECTION", "TEXT", "BOOLEAN", "INTEGER", "DECIMAL", "WEB_LINK", "DATE", "STRING", "DROP_DOWN", "MULTI_VALUE", "USER"
Have you tried accessing the RealAttributeType to see its value?
Having found that for one specific sheets document I was trying to reference as an external table, the heading row was being included in the data when executing queries*. I decided to drop the table and recreate it using a definitions file which would definitely expose the options from the docs. It didn't seem to work as no schema is created, despite being defined in the file.
I've recreated the issue with a simple sheet with 3 columns and a frozen header row and the following test.def file:
{
"autodetect": false,
"schema": {
"fields": [
{"name": "c1", "type": "STRING", "mode": "nullable"},
{"name": "c2", "type": "STRING", "mode": "nullable"},
{"name": "c3", "type": "STRING", "mode": "nullable"},
]
},
"sourceFormat": "GOOGLE_SHEETS",
"sourceUris": [
"https://docs.google.com/spreadsheets/..."
],
"googleSheetsOptions": {
"skipLeadingRows": 1
}
}
and then I try to create the file using:
bq mk myproject:mydataset.mytable < test.def
the table is created but no schema is present - what am I doing wrong?
this issue remains but I cannot identify why 95% of the time the table is created OK and the first/header row correctly excluded from the data returned by a query but in one specific case, created the same way as all others, the header row is returned in the data ...
Odd :(
M
OK so the correct syntax is:
bq mk --external_table_definition=myfile.def project:dataset.table
This also allows you to tell google to skip leading rows on the sheet (as of tiem of writing not possible from BQ UI)
M
I am confused about for which situation I am defining the properties in my json schemas.
Assume I have an item product for which I am trying to define a schema. In my database the products table has id, brand_id, name, item_number and description. All except description are required fields. The id is autogenerated by the database and the brand_id is set upon creation automatically by the api based on the user creating.
This means I can POST /api/products using only the following data:
{
"product": {
"name": "Product Name",
"item_number": "item001"
}
}
However, how should I now define the product schema? Should I include the properties id and brand_id? If so, should I label them as required, even though they are set automatically?
This is what I came up with:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"id": "http://jsonschema.net/products",
"type": "object",
"properties": {
"item_number": {
"id": "http://jsonschema.net/products/item_number",
"type": "string"
},
"name": {
"id": "http://jsonschema.net/products/name",
"type": "string"
},
"description": {
"id": "http://jsonschema.net/products/description",
"type": "string",
"default": "null"
}
},
"required": [
"item_number",
"name"
]
}
You should only define in your JSON schema properties that are dealt with by the user of the API.
In your case, it makes no sense to have id and brand_id in the schema that defines the POST body entity for the creation of a new product because these values are not provided by the API user.
This said, you may have another schema for existing product entities where these two fields are present, if it's OK to expose them publicly.
If that's the case, for this you can use schema union mechanism and have the "existing product" schema use allOf new_product.json and add id and brand_id to it.