How do I enforce mutually-exclusive properties in two sibling objects using JSON schema? - jsonschema

I have a JSON object with two constant properties - foo and bar. The values are also objects with arbitrary keys. How do I assert, using JSON schema, that the properties of the foo and bar objects are mutually exclusive?
For instance, this is a good (=should pass validation) object, because foo has properties a, b while bar has properties c, d:
{
"foo": {
"a": 1,
"b": 2
},
"bar": {
"c": 3,
"d": 4
}
}
While this is a bad (=should fail validation) object, because both foo and bar have property a in them:
{
"foo": {
"a": 1,
"b": 2
},
"bar": {
"c": 3,
"a": 4
}
}
Notice a, b, c were just arbitrarily used - I don't require any of these properties, only that if they appear in one object they shall not appear in the other (and vice versa). Also - there are only two objects foo, bar and not an arbitrarily long list of objects.
Here's a simple schema (modified from the one generated using genson) that would validate both the good and bad examples above (because the constraint is missing):
{'$schema': 'http://json-schema.org/schema#',
'type': 'object',
'properties': {
'foo': {'type': 'object'},
'bar': {'type': 'object'}},
'required': ['bar', 'foo']
}

Related

BigQuery: Get field names of a STRUCT

I have some data in a STRUCT in BigQuery. Below I have visualised an example of the data as JSON:
{
...
siblings: {
david: { a: 1 }
sarah: { b: 1, c: 1 }
}
...
}
I want to produce a field from a query that resembles ["david", "sarah"]. Essentially I just want to get the keys from the STRUCT (object). Note that every user will have different key names in the siblings STRUCT.
Is this possible in BigQuery?
Thanks,
A
Your structs schema must be consistent throughout the table. They can't change keys because they're part of the table schema. To get the keys you simply take a look at the table schema.
If values change, they're probably values in an array - I guess you might have something like this:
WITH t AS (
SELECT 1 AS id, [STRUCT('david' AS name, 33 as age), ('sarah', 42)] AS siblings
union all
SELECT 2, [('ken', 19), ('ryu',21), ('chun li',23)]
)
SELECT * FROM t
If you tried to introduce new keys in the second row or within the array, you'd get an error Array elements of types {...} do not have a common supertype at ....
The first element of the above example in json representation looks like this:
{
"id": "1",
"siblings": [
{
"name": "david",
"age": "33"
},
{
"name": "sarah",
"age": "42"
}
]
}

Sort axis by another field in Vega Lite

I'm trying to sort ordinal data on the x axis by a different field from the one I'm using as a label. Both fields (I'll call them 'sortable' and 'nonsortable') are one-to-one, meaning one is computed from the other and there will never be an instance when one 'sortable' value will correspond to two different 'nonsortable' values, and vice versa.
I tried two approaches:
Changing the sort order to use a different field like this:
...
x: {
field: 'nonsortable',
sort: {
field: 'sortable',
op: 'count',
},
},
...
I wasn't sure which aggregate operation to use, but since the two fields are one-to-one, that shouldn't matter right?
This changed the order in a way that I don't understand, but it certainly didn't sort the axis by the 'sortable' field as intended.
Changing the label to a different field like this:
...
x: {
field: 'sortable',
axis: {
labelExpr: 'datum.nonsortable',
},
}
...
This didn't work at all. I think maybe I misunderstood how the label expressions work.
Is there another way to do this, or maybe a way salvage one of these attempts?
If no aggregation is required, you should pass the sort attribute without an aggregate. For example (vega editor link):
{
"data": {
"values": [
{"sortable": 5, "nonsortable": "five"},
{"sortable": 2, "nonsortable": "two"},
{"sortable": 3, "nonsortable": "three"},
{"sortable": 1, "nonsortable": "one"},
{"sortable": 4, "nonsortable": "four"}
]
},
"mark": "point",
"encoding": {
"x": {
"type": "nominal",
"field": "nonsortable",
"sort": {"field": "sortable"}
}
}
}

How to use Schema.from_dict() for nested dictionaries?

I am trying to create a Schema class using nested dictionaries that has some list as elements. However when I do a dumps() Only the top level elements are dumped.
Have a rest api that returns a list of certain things,eg. list of users. but the schema is such that certain aggregate details are sent at the top level, the data looks something like this. This is what i am expecting as output:
{
"field1": 5,
"field2": false,
"field3": {
"field4": 40,
"field5": [
{
"field6": "goo goo gah gah",
"field7": 99.341879,
"field8": {
"field9": "goo goo gah gah",
"field10": "goo goo gah gah"
}
}]
}
}
Heres my code:
MySchema = Schema.from_dict(
{
"field1": fields.Int(),
"field2": fields.Bool(),
"field3": {
"field4": fields.Int(),
"field5": [
{
"field6": fields.Str(),
"field7": fields.Float(),
"field8": {
"field9": fields.Str(),
"field10": fields.Str()
}
}]
}
}
)
#Then use it like:
response = MySchema().dumps(data)
Actual result:
"{\"field1\": 5, \"field2\": false}"
Option 1
You're looking for several nested schemas, interconnected through fields.Nested:
from marshmallow import Schema, fields
Field8Schema = Schema.from_dict({
"field9": fields.Str(),
"field10": fields.Str()
})
Field5Schema = Schema.from_dict({
"field6": fields.Str(),
"field7": fields.Float(),
"field8": fields.Nested(Field8Schema),
})
Field3Schema = Schema.from_dict({
"field4": fields.Int(),
"field5": fields.List(fields.Nested(Field5Schema))
})
MySchema = Schema.from_dict({
"field1": fields.Int(),
"field2": fields.Bool(),
"field3": fields.Nested(Field3Schema),
})
MySchema().dump(data)
# {'field2': False,
# 'field1': 5,
# 'field3': {'field4': 40,
# 'field5': [{'field6': 'goo goo gah gah',
# 'field8': {'field9': 'goo goo gah gah', 'field10': 'goo goo gah gah'},
# 'field7': 99.341879}]}}
Option 2
If the nesting won't be that deep, it might be simpler to use decorators, i.e. nest and unnest data as suggested in the docs:
class UserSchema(Schema):
#pre_load(pass_many=True)
def remove_envelope(self, data, many, **kwargs):
namespace = 'results' if many else 'result'
return data[namespace]
#post_dump(pass_many=True)
def add_envelope(self, data, many, **kwargs):
namespace = 'results' if many else 'result'
return {namespace: data}
It feels it fits your case nicely.
Comments
I'd suggest not to use from_dict as it is less readable for such a complex data, and instead switch to a class-based schema.
There's plenty of good examples of nesting in the docs.

BigQuery: Create column of JSON datatype

I am trying to load json with the following schema into BigQuery:
{
key_a:value_a,
key_b:{
key_c:value_c,
key_d:value_d
}
key_e:{
key_f:value_f,
key_g:value_g
}
}
The keys under key_e are dynamic, ie in one response key_e will contain key_f and key_g and for another response it will instead contain key_h and key_i. New keys can be created at any time so I cannot create a record with nullable fields for all possible keys.
Instead I want to create a column with JSON datatype that can then be queried using the JSON_EXTRACT() function. I have tried loading key_e as a column with STRING datatype but value_e is analysed as JSON and so fails.
How can I load a section of JSON into a single BigQuery column when there is no JSON datatype?
Having your JSON as a single string column inside BigQuery is definitelly an option. If you have large volume of data this can end up with high query price as all your data will end up in one column and actually querying logic can become quite messy.
If you have luxury of slightly changing your "design" - I would recommend considering below one - here you can employ REPEATED mode
Table schema:
[
{ "name": "key_a",
"type": "STRING" },
{ "name": "key_b",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{ "name": "key",
"type": "STRING"},
{ "name": "value",
"type": "STRING"}
]
},
{ "name": "key_e",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{ "name": "key",
"type": "STRING"},
{ "name": "value",
"type": "STRING"}
]
}
]
Example of JSON to load
{"key_a": "value_a1", "key_b": [{"key": "key_c", "value": "value_c"}, {"key": "key_d", "value": "value_d"}], "key_e": [{"key": "key_f", "value": "value_f"}, {"key": "key_g", "value": "value_g"}]}
{"key_a": "value_a2", "key_b": [{"key": "key_x", "value": "value_x"}, {"key": "key_y", "value": "value_y"}], "key_e": [{"key": "key_h", "value": "value_h"}, {"key": "key_i", "value": "value_i"}]}
Please note: it should be newline delimited JSON so each row must be on one line
You can't do this directly with BigQuery, but you can make it work in two passes:
(1) Import your JSON data as a CSV file with a single string column.
(2) Transform each row to pack your "any-type" field into a string. Write a UDF that takes a string and emits the final set of columns you would like. Append the output of this query to your target table.
Example
I'll start with some JSON:
{"a": 0, "b": "zero", "c": { "woodchuck": "a"}}
{"a": 1, "b": "one", "c": { "chipmunk": "b"}}
{"a": 2, "b": "two", "c": { "squirrel": "c"}}
{"a": 3, "b": "three", "c": { "chinchilla": "d"}}
{"a": 4, "b": "four", "c": { "capybara": "e"}}
{"a": 5, "b": "five", "c": { "housemouse": "f"}}
{"a": 6, "b": "six", "c": { "molerat": "g"}}
{"a": 7, "b": "seven", "c": { "marmot": "h"}}
{"a": 8, "b": "eight", "c": { "badger": "i"}}
Import it into BigQuery as a CSV with a single STRING column (I called it 'blob'). I had to set the delimiter character to something arbitrary and unlikely (thorn -- 'รพ') or it tripped over the default ','.
Verify your table imported correctly. You should see your simple one-column schema and the preview should look just like your source file.
Next, we write a query to transform it into your desired shape. For this example, we'd like the following schema:
a (INTEGER)
b (STRING)
c (STRING -- packed JSON)
We can do this with a UDF:
// Map a JSON string column ('blob') => { a (integer), b (string), c (json-string) }
bigquery.defineFunction(
'extractAndRepack', // Name of the function exported to SQL
['blob'], // Names of input columns
[{'name': 'a', 'type': 'integer'}, // Output schema
{'name': 'b', 'type': 'string'},
{'name': 'c', 'type': 'string'}],
function (row, emit) {
var parsed = JSON.parse(row.blob);
var repacked = JSON.stringify(parsed.c);
emit({a: parsed.a, b: parsed.b, c: repacked});
}
);
And a corresponding query:
SELECT a, b, c FROM extractAndRepack(JsonAnyKey.raw)
Now you just need to run the query (selecting your desired target table) and you'll have your data in the form you like.
Row a b c
1 0 zero {"woodchuck":"a"}
2 1 one {"chipmunk":"b"}
3 2 two {"squirrel":"c"}
4 3 three {"chinchilla":"d"}
5 4 four {"capybara":"e"}
6 5 five {"housemouse":"f"}
7 6 six {"molerat":"g"}
8 7 seven {"marmot":"h"}
9 8 eight {"badger":"i"}
One way to do it, is to load this file as CSV instead of JSON (and quote the values or eliminate newlines in the middle), then it will become single STRING column inside BigQuery.
P.S. You are right that having a native JSON data type would have made this scenario much more natural, and BigQuery team is well aware of it.

Is it possible to turn an array returned by the Mongo GeoNear command (using Ruby/Rails) into a Plucky object?

As a total newbie I have been trying to get the geoNear command working in my rails application and it appear to be working fine. The major annoyance for me is that it is returning an array with strings rather than keys which I can call on to pull out data.
Having dug around, I understand that MongoMapper uses Plucky to turn the the query resultant into a friendly object which can be handled easily but I haven't been able to find out how to transform the result of my geoNear query into a plucky object.
My questions are:
(a) Is it possible to turn this into a plucky object and how do i do that?
(b) If it is not possible how can I most simply and systematically extract each record and each field?
here is the query in my controller
#mult = 3963 * (3.14159265 / 180 ) # Scale to miles on earth
#results = #db.command( {'geoNear' => "places", 'near'=> #search.coordinates , 'distanceMultiplier' => #mult, 'spherical' => true})
Here is the object i'm getting back (with document content removed for simplicity)
{"ns"=>"myapp-development.places", "near"=>"1001110101110101100100110001100010100010000010111010", "results"=>[{"dis"=>0.04356444023196527, "obj"=>{"_id"=>BSON::ObjectId('4ee6a7d210a81f05fe000001'),...}}], "stats"=>{"time"=>0, "btreelocs"=>0, "nscanned"=>1, "objectsLoaded"=>1, "avgDistance"=>0.04356444023196527, "maxDistance"=>0.0006301239824196907}, "ok"=>1.0}
Help is much appreciated!!
Ok so lets say you store the results into a variable called places_near:
places_near = t.command( {'geoNear' => "places", 'near'=> [50,50] , 'distanceMultiplier' => 1, 'spherical' => true})
This command returns an hash that has a key (results) which maps to a list of results for the query. The returned document looks like this:
{
"ns": "test.places",
"near": "1100110000001111110000001111110000001111110000001111",
"results": [
{
"dis": 69.29646421910687,
"obj": {
"_id": ObjectId("4b8bd6b93b83c574d8760280"),
"y": [
1,
1
],
"category": "Coffee"
}
},
{
"dis": 69.29646421910687,
"obj": {
"_id": ObjectId("4b8bd6b03b83c574d876027f"),
"y": [
1,
1
]
}
}
],
"stats": {
"time": 0,
"btreelocs": 1,
"btreelocs": 1,
"nscanned": 2,
"nscanned": 2,
"objectsLoaded": 2,
"objectsLoaded": 2,
"avgDistance": 69.29646421910687
},
"ok": 1
}
To iterate over the responses just iterate as you would over any list in ruby:
places_near['results'].each do |result|
# do stuff with result object
end