How to prepare Google Natural Language Proscessing output (json) for Big Query - sql

I'm trying to query the output of a Natural Language Processing (NLP) call in Big Query (BQ) but I'm struggling to get the output in the right format for BQ.
I understand that BQ takes json files (as newline delimited) - but just not sure that (a) the output of NLP is json newline delimited and (b) if my schema is correct.
Here's the json output I'm working with:
{
"entities": [
{
"name": "Rowling",
"type": "PERSON",
"metadata": {
"wikipedia_url": "http://en.wikipedia.org/wiki/J._K._Rowling"
},
"salience": 0.65751493,
"mentions": [
{
"text": {
"content": " J.",
"beginOffset": -1
}
},
{
"text": {
"content": "K. Rowl",
"beginOffset": -1
}
}
]
},
{
"name": "LONDON",
"type": "LOCATION",
"metadata": {
"wikipedia_url": "http://en.wikipedia.org/wiki/London"
},
"salience": 0.14284456,
"mentions": [
{
"text": {
"content": "\ufeffLON",
"beginOffset": -1
}
}
]
},
{
"name": "Harry Potter",
"type": "WORK_OF_ART",
"metadata": {
"wikipedia_url": "http://en.wikipedia.org/wiki/Harry_Potter"
},
"salience": 0.0726779,
"mentions": [
{
"text": {
"content": "th Harry Pot",
"beginOffset": -1
}
},
{
"text": {
"content": "‘Harry Pot",
"beginOffset": -1
}
}
]
},
{
"name": "Deathly Hallows",
"type": "WORK_OF_ART",
"metadata": {
"wikipedia_url": "http://en.wikipedia.org/wiki/Harry_Potter_and_the_Deathly_Hallows"
},
"salience": 0.022565609,
"mentions": [
{
"text": {
"content": "he Deathly Hall",
"beginOffset": -1
}
}
]
}
],
"language": "en"
}
Is there a way to send the output directly to big query via the command line in Google Cloud shell?
Any information would be greatly appreciated!
Thanks

Glad you found my Harry Potter blog post! I'd recommend storing the NL API's JSON response as a string in BigQuery and then using a user-defined function to query it. You should be able to run the following (the table is publicly viewable) to get a count of how often each entity appears in the JSON you posted:
SELECT
COUNT(*) as entity_count, entity
FROM
JS(
(SELECT entities FROM [sara-bigquery:samples.hp_udf]),
entities,
"[{ name: 'entity', type: 'string'}]",
"function(row, emit) {
try {
x = JSON.parse(row.entities);
entities = x['entities'];
entities.forEach(function(data) {
emit({ entity: data.name });
});
} catch (e) {}
}"
)
GROUP BY entity
ORDER BY entity_count DESC

send the output directly to big query via the command line in Google Cloud shell
Look at this page, and search for "bq load"
https://cloud.google.com/bigquery/bq-command-line-tool
Here they have some example about json schema.
Schema to load json data to google big query

Related

How to check a particular value in karate on basis of condition when there are more than one key in response

There are multiple keys present in x.details[0].user so how to compare in such condition. It works fine where there is only one key while when there are more than one key it fails with error as there are multiple keys for user.
Please guide
* def array =
"""
{
"type": "1",
"array": 2,
"details": [
{
"path": "path",
"user": {
"school": {
"name": [
{
"value": "this is school",
"codeable": {
"details": [
{
"hello": "yty",
"condition": "check1"
}
],
"text": "123S"
}
}
]
},
"sample": "test1",
"id": "22222"
},
"data": {
"check": "abc"
}
},
{
"path": "path",
"user": {
"school": {
"name": [
{
"value": "this is school",
"codeable": {
"details": [
{
"hello": "def",
"condition": "check2"
}
],
"text": "123O"
}
}
]
},
"sample": "test",
"id": "11111"
},
"data": {
"check": "xyz"
}
}
]
}
"""
* def lookup = { 'check1': 'yty', 'check2': 'def' }
* match each array.details contains { user: { school: { name[0]: { codeable: { details[0]: { hello: '#(lookup[_$.user.school.name[0].codeable.details[0].condition])' } } } } } }
I have tried multiple ways but not sure how to make it work when there are more than one keys getting returned.
First let me say this. Please, please do not try to be "too clever" in your tests. Read this please: https://stackoverflow.com/a/54126724/143475
Just have simple tests that test for specific, predictable responses and stop there. The maximum complexity should be match contains, that's it. What you are trying to do is in my opinion a waste of time, and will just cause grief for anyone who tries to maintain these tests in future.
That said, I'm giving you a completely different approach here. There are a few different ways, this is just one. Here we attack the problem piece by piece instead of trying to write one massive match.
* def allXyz = array.details.filter(x => x.data.check == 'xyz')
* match each allXyz..details == [{ hello: 'def', condition: 'check2' }]
You should be able to extend this to do all the weird assertions that you want. If you are wondering what allXyz..details does, you can print it like this:
* def temp = $allXyz..details
* print temp

Unexpected behavior of ARRAY_SLICE in Cosmos Db SQL API

I have Cosmos DB collection (called sample) containing the following documents:
[
{
"id": "id1",
"messages": [
{
"messageId": "message1",
"Text": "Value1"
},
{
"messageId": "message2",
"Text": "Value2"
}
]
},
{
"id": "id2",
"messages": [
{
"messageId": "message3",
"Text": "Value3"
},
{
"messageId": "message4",
"Text": "Value1"
}
]
},
{
"id": "id3",
"messages": [
{
"messageId": "message5",
"Text": "Value1"
},
{
"messageId": "message6",
"Text": "Value2"
}
]
},
{
"id": "id4",
"messages": [
{
"messageId": "message7",
"Text": "Value5"
},
{
"messageId": "message8",
"Text": "Value2"
}
]
},
]
I am trying to retrieve all the Documents, having messages and the first message has the field "Text"= 'Value1'.
In this sample the documents with the ids '1' and '3' would be retrieved. Please notice that the document with id='id2' wouldn't be retrieved,
since the value of the text of the first message is 'Value3'.
The collection as mentioned is called sample and I am running the following Query:
"select sample.id, sample.messages, ARRAY_SLICE(sample.messages, 0, 1)[0].Text as valueOfText from sample"
As you can see in the first two images, I retrieve all Documents and every one of them have the field "valueOfText" set to value of the first message, as expected.
Now when I filter the collection (the third image), I retrieve no results at all.
Is this an expected behavior?
Following your sql, got same results:
But why you have to use ARRAY_SLICE,it is used to return truncated array.Since your requirement is specific:
trying to retrieve all the Documents, having messages and the first
message has the field "Text"= 'Value1'
Just use sql:
SELECT c.id,c.messages,c.messages[0].Text as valueOfText FROM c
where c.messages[0].Text = 'Value1'
Output:

How to correctly validate array of objects using JustinRainbow/JsonSchema

I have code that correctly validates an article returned from an endpoint that returns single articles. I'm pretty sure it's working correctly as it gives a validation error when I deliberately don't include a required field in the article.
I also have this code that tries to validate an array of articles returned from an endpoint that returns an array of articles. However, I'm pretty sure that isn't working correctly, as it always says the data is valid, even when I deliberately don't include a required field in the articles.
How do I correctly validate an array of data against the schema?
The full test code is below as a standalone runnable test. Both of the tests should fail, however only one of them does.
<?php
declare(strict_types=1);
error_reporting(E_ALL);
require_once __DIR__ . '/vendor/autoload.php';
// Return the definition of the schema, either as an array
// or a PHP object
function getSchema($asArray = false)
{
$schemaJson = <<< 'JSON'
{
"swagger": "2.0",
"info": {
"termsOfService": "http://swagger.io/terms/",
"version": "1.0.0",
"title": "Example api"
},
"paths": {
"/articles": {
"get": {
"tags": [
"article"
],
"summary": "Find all articles",
"description": "Returns a list of articles",
"operationId": "getArticleById",
"produces": [
"application/json"
],
"responses": {
"200": {
"description": "successful operation",
"schema": {
"type": "array",
"items": {
"$ref": "#/definitions/Article"
}
}
}
},
"parameters": [
]
}
},
"/articles/{articleId}": {
"get": {
"tags": [
"article"
],
"summary": "Find article by ID",
"description": "Returns a single article",
"operationId": "getArticleById",
"produces": [
"application/json"
],
"parameters": [
{
"name": "articleId",
"in": "path",
"description": "ID of article to return",
"required": true,
"type": "integer",
"format": "int64"
}
],
"responses": {
"200": {
"description": "successful operation",
"schema": {
"$ref": "#/definitions/Article"
}
}
}
}
}
},
"definitions": {
"Article": {
"type": "object",
"required": [
"id",
"title"
],
"properties": {
"id": {
"type": "integer",
"format": "int64"
},
"title": {
"type": "string",
"description": "The title for the link of the article"
}
}
}
},
"schemes": [
"http"
],
"host": "example.com",
"basePath": "/",
"tags": [],
"securityDefinitions": {
},
"security": [
{
"ApiKeyAuth": []
}
]
}
JSON;
return json_decode($schemaJson, $asArray);
}
// Extract the schema of the 200 response of an api endpoint.
function getSchemaForPath($path)
{
$swaggerData = getSchema(true);
if (isset($swaggerData["paths"][$path]['get']["responses"][200]['schema']) !== true) {
echo "response not defined";
exit(-1);
}
return $swaggerData["paths"][$path]['get']["responses"][200]['schema'];
}
// JsonSchema needs to know about the ID used for the top-level
// schema apparently.
function aliasSchema($prefix, $schemaForPath)
{
$aliasedSchema = [];
foreach ($schemaForPath as $key => $value) {
if ($key === '$ref') {
$aliasedSchema[$key] = $prefix . $value;
}
else if (is_array($value) === true) {
$aliasedSchema[$key] = aliasSchema($prefix, $value);
}
else {
$aliasedSchema[$key] = $value;
}
}
return $aliasedSchema;
}
// Test the data matches the schema.
function testDataMatches($endpointData, $schemaForPath)
{
// Setup the top level schema and get a validator from it.
$schemaStorage = new \JsonSchema\SchemaStorage();
$id = 'file://example';
$swaggerClass = getSchema(false);
$schemaStorage->addSchema($id, $swaggerClass);
$factory = new \JsonSchema\Constraints\Factory($schemaStorage);
$jsonValidator = new \JsonSchema\Validator($factory);
// Alias the schema for the endpoint, so JsonSchema can work with it.
$schemaForPath = aliasSchema($id, $schemaForPath);
// Validate the things
$jsonValidator->check($endpointData, (object)$schemaForPath);
// Process the result
if ($jsonValidator->isValid()) {
echo "The supplied JSON validates against the schema definition: " . \json_encode($schemaForPath) . " \n";
return;
}
$messages = [];
$messages[] = "End points does not validate. Violations:\n";
foreach ($jsonValidator->getErrors() as $error) {
$messages[] = sprintf("[%s] %s\n", $error['property'], $error['message']);
}
$messages[] = "Data: " . \json_encode($endpointData, JSON_PRETTY_PRINT);
echo implode("\n", $messages);
echo "\n";
}
// We have two data sets to test. A list of articles.
$articleListJson = <<< JSON
[
{
"id": 19874
},
{
"id": 19873
}
]
JSON;
$articleListData = json_decode($articleListJson);
// A single article
$articleJson = <<< JSON
{
"id": 19874
}
JSON;
$articleData = json_decode($articleJson);
// This passes, when it shouldn't as none of the articles have a title
testDataMatches($articleListData, getSchemaForPath("/articles"));
// This fails correctly, as it is correct for it to fail to validate, as the article doesn't have a title
testDataMatches($articleData, getSchemaForPath("/articles/{articleId}"));
The minimal composer.json is:
{
"require": {
"justinrainbow/json-schema": "^5.2"
}
}
Edit-2: 22nd May
I have been digging further turns out that the issue is because of your top level conversion to object
$jsonValidator->check($endpointData, (object)$schemaForPath);
You shouldn't have just done that and it would have all worked
$jsonValidator->check($endpointData, $schemaForPath);
So it doesn't seem to be a bug it was just a wrong usage. If you just remove (object) and run the code
$ php test.php
End points does not validate. Violations:
[[0].title] The property title is required
[[1].title] The property title is required
Data: [
{
"id": 19874
},
{
"id": 19873
}
]
End points does not validate. Violations:
[title] The property title is required
Data: {
"id": 19874
}
Edit-1
To fix the original code you would need to update the CollectionConstraints.php
/**
* Validates the items
*
* #param array $value
* #param \stdClass $schema
* #param JsonPointer|null $path
* #param string $i
*/
protected function validateItems(&$value, $schema = null, JsonPointer $path = null, $i = null)
{
if (is_array($schema->items) && array_key_exists('$ref', $schema->items)) {
$schema->items = $this->factory->getSchemaStorage()->resolveRefSchema((object)$schema->items);
var_dump($schema->items);
};
if (is_object($schema->items)) {
This will handle your use case for sure but if you don't prefer changing code from the dependency then use my original answer
Original Answer
The library has a bug/limitation that in src/JsonSchema/Constraints/CollectionConstraint.php they don't resolve a $ref variable as such. If I updated your code like below
// Alias the schema for the endpoint, so JsonSchema can work with it.
$schemaForPath = aliasSchema($id, $schemaForPath);
if (array_key_exists('items', $schemaForPath))
{
$schemaForPath['items'] = $factory->getSchemaStorage()->resolveRefSchema((object)$schemaForPath['items']);
}
// Validate the things
$jsonValidator->check($endpointData, (object)$schemaForPath);
and run it again, I get the exceptions needed
$ php test2.php
End points does not validate. Violations:
[[0].title] The property title is required
[[1].title] The property title is required
Data: [
{
"id": 19874
},
{
"id": 19873
}
]
End points does not validate. Violations:
[title] The property title is required
Data: {
"id": 19874
}
You either need to fix the CollectionConstraint.php or open an issue with developer of the repo. Or else manually replace your $ref in the whole schema, like had shown above. My code will resolve the issue specific to your schema, but fixing any other schema should not be a big issue
EDIT: Important thing here is that provided schema document is instance of Swagger Schema, which employs extended subset of JSON Schema to define some cases of request and response. Swagger 2.0 Schema itself can be validated by its JSON Schema, but it can not act as a JSON Schema for API Response structure directly.
In case entity schema is compatible with standard JSON Schema you can perform validation with general purpose validator, but you have to provide all relevant definitions, it can be easy when you have absolute references, but more complicated for local (relative) references that start with #/. IIRC they must be defined in the local schema.
The problem here is that you are trying to use schema references detached from resolution scope. I've added id to make references absolute, therefore not requiring being in scope.
"$ref": "http://example.com/my-schema#/definitions/Article"
The code below works well.
<?php
require_once __DIR__ . '/vendor/autoload.php';
$swaggerSchemaData = json_decode(<<<'JSON'
{
"id": "http://example.com/my-schema",
"swagger": "2.0",
"info": {
"termsOfService": "http://swagger.io/terms/",
"version": "1.0.0",
"title": "Example api"
},
"paths": {
"/articles": {
"get": {
"tags": [
"article"
],
"summary": "Find all articles",
"description": "Returns a list of articles",
"operationId": "getArticleById",
"produces": [
"application/json"
],
"responses": {
"200": {
"description": "successful operation",
"schema": {
"type": "array",
"items": {
"$ref": "http://example.com/my-schema#/definitions/Article"
}
}
}
},
"parameters": [
]
}
},
"/articles/{articleId}": {
"get": {
"tags": [
"article"
],
"summary": "Find article by ID",
"description": "Returns a single article",
"operationId": "getArticleById",
"produces": [
"application/json"
],
"parameters": [
{
"name": "articleId",
"in": "path",
"description": "ID of article to return",
"required": true,
"type": "integer",
"format": "int64"
}
],
"responses": {
"200": {
"description": "successful operation",
"schema": {
"$ref": "http://example.com/my-schema#/definitions/Article"
}
}
}
}
}
},
"definitions": {
"Article": {
"type": "object",
"required": [
"id",
"title"
],
"properties": {
"id": {
"type": "integer",
"format": "int64"
},
"title": {
"type": "string",
"description": "The title for the link of the article"
}
}
}
},
"schemes": [
"http"
],
"host": "example.com",
"basePath": "/",
"tags": [],
"securityDefinitions": {
},
"security": [
{
"ApiKeyAuth": []
}
]
}
JSON
);
$schemaStorage = new \JsonSchema\SchemaStorage();
$schemaStorage->addSchema('http://example.com/my-schema', $swaggerSchemaData);
$factory = new \JsonSchema\Constraints\Factory($schemaStorage);
$validator = new \JsonSchema\Validator($factory);
$schemaData = $swaggerSchemaData->paths->{"/articles"}->get->responses->{"200"}->schema;
$data = json_decode('[{"id":1},{"id":2,"title":"Title2"}]');
$validator->validate($data, $schemaData);
var_dump($validator->isValid()); // bool(false)
$data = json_decode('[{"id":1,"title":"Title1"},{"id":2,"title":"Title2"}]');
$validator->validate($data, $schemaData);
var_dump($validator->isValid()); // bool(true)
I'm not sure I fully understand your code here, but I have an idea based on some assumptions.
Assuming $typeForEndPointis the schema you're using for validation, your item key word needs to be an object rather than an array.
The items key word can be an array or an object. If it's an object, that schema is applicable to every item in the array. If it is an array, each item in that array is applicable to the item in the same position as the array being validated.
This means you're only validating the first item in the array.
If "items" is a schema, validation succeeds if all elements in the
array successfully validate against that schema.
If "items" is an array of schemas, validation succeeds if each element
of the instance validates against the schema at the same position, if
any.
https://datatracker.ietf.org/doc/html/draft-handrews-json-schema-validation-01#section-6.4.1
jsonValidator don't like mixed of object and array association,
You can use either:
$jsonValidator->check($endpointData, $schemaForPath);
or
$jsonValidator->check($endpointData, json_decode(json_encode($schemaForPath)));

How to query microsoft academic graph for citation and co-citation?

Reading through:
https://www.microsoft.com/cognitive-services/en-us/Academic-Knowledge-API/documentation/GraphSearchMethod
It is a bit obscure the meaning of "path":
"path": "/paper/AuthorIDs/author" - I don't see authorIds object in the returned results.
# post data query
{
"path": "/paper/AuthorIDs/author",
"paper": {
"type": "Paper",
"NormalizedTitle": "graph engine",
"select": [
"OriginalTitle"
]
},
"author": {
"return": {
"type": "Author",
"Name": "bin shao"
}
}
}
#results
{
"Results": [
[
{
"CellID": 2160459668,
"OriginalTitle": "Trinity: a distributed graph engine on a memory cloud"
},
{
"CellID": 2093502026
}
],
[
{
"CellID": 2171539317,
"OriginalTitle": "A distributed graph engine for web scale RDF data"
},
{
"CellID": 2093502026
}
],
[
{
"CellID": 2411554868,
"OriginalTitle": "A distributed graph engine for web scale RDF data"
},
{
"CellID": 2093502026
}
],
[
{
"CellID": 73304046,
"OriginalTitle": "The Trinity graph engine"
},
{
"CellID": 2093502026
}
]
]
}
Which is the correct path (or data to post) to query for citation and co-citation of an article, and paginate results?
You will find AuthorIDs on the graph schema from Microsoft Academic Search:
Assuming you know the ID of the source paper (2118322263 in the following example), here is the POST part of the request:
{
"path": "/paper/CitationIDs/citation",
"paper": {
"type": "Paper",
"id": [ 2118322263 ],
"select": [
"OriginalTitle"
]
},
"citation": {
"return": {
"type": "Paper"
},
"select": [
"OriginalTitle"
]
}
}
This returns 634 results in one response, while a query to the paper itself shows a citation count of 732. I have no idea why there is a difference, nor how to do pagination.

Finding similar documents with Elasticsearch

I'm using ElasticSearch to develop service that will store uploaded files or web pages as attachment (file is one field in document). This part works fine as I can search these files using like_text as input. However, the second part of this service should compare the file that is just uploaded with the existing files in order to find duplicates or very similar files, so it doesn't recommend users same files or same web pages. The problem is that I can't get expected results for documents that are the same. Similarity between same files varies, but is never more then 0.4. Even worse, sometimes I get better scores for files which are not the same then for two exactly the same files. The java code give bellow gives me always the set of documents which are in the same order, regardless of the input. It looks like like_text extracted from uploaded file is always the same.
String mapping = copyToStringFromClasspath("/org/prosolo/services/indexing/documents- mapping.json");
byte[] txt = org.elasticsearch.common.io.Streams.copyToByteArray(file);
Client client = ElasticSearchFactory.getClient();
client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet();
IndexResponse iResponse = client.index(indexRequest(indexName).type(indexType)
.source(jsonBuilder()
.startObject()
.field("file", txt)
.field("title",title)
.field("visibility",visibilityType.name().toLowerCase())
.field("ownerId",ownerId)
.field("description",description)
.field("contentType",DocumentType.DOCUMENT.name().toLowerCase())
.field("dateCreated",dateCreated)
.field("url",link)
.field("relatedToType",relatedToType)
.field("relatedToId",relatedToId)
.endObject()))
.actionGet();
client.admin().indices().refresh(refreshRequest()).actionGet();
MoreLikeThisRequestBuilder mltRequestBuilder=new MoreLikeThisRequestBuilder(client, ESIndexNames.INDEX_DOCUMENTS, ESIndexTypes.DOCUMENT, iResponse.getId());
mltRequestBuilder.setField("file");
SearchResponse response = client.moreLikeThis(mltRequestBuilder.request()).actionGet();
SearchHits searchHits= response.getHits();
System.out.println("getTotalHits:"+searchHits.getTotalHits());
Iterator<SearchHit> hitsIter=searchHits.iterator();
while(hitsIter.hasNext()){
SearchHit searchHit=hitsIter.next();
System.out.println("FOUND DOCUMENT:"+searchHit.getId()+" title:"+searchHit.getSource().get("title")+" score:"+searchHit.score());
}
And the query from browser which looks like:
http://localhost:9200/documents/document/m2HZM3hXS1KFHOwvGY1pVQ/_mlt?mlt_fields=file&min_doc_freq=1
Gives me results:
{"took":120,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},
"hits":{"total":4,
"max_score":0.41059873,
"hits":
[{"_index":"documents","_type":"document",
"_id":"gIe6NDEWRXWTMi4kMPRbiQ",
"_score":0.41059873,
"_source" :
{"file":"PCFET0NUWVBFIGh..._skiping_the_file_content_here...",
"title":"Univariate Analysis",
"visibility":"public",
"description":"Univariate Analysis Simple Tools for Description ",
"contentType":"webpage",
"dateCreated":"null",
"url":"http://www.slideshare.net/christineshearer/univariate-analysis"}}
This is exactly the same web page, so I'm expecting the score to be 1.0, not 0.41 as there is not difference between two documents except in _id. The results are even worse with files.
Mapping I was using is:
{
"document":{
"properties":{
"title":{
"type":"string",
"store":true
},
"description":{
"type":"string",
"store":"yes"
},
"contentType":{
"type":"string",
"store":"yes"
},
"dateCreated":{
"store":"yes",
"type":"date"
},
"url":{
"store":"yes",
"type":"string"
},
"visibility": {
"store":"yes",
"type":"string"
},
"ownerId": {
"type": "long",
"store":"yes"
},
"relatedToType": {
"type": "string",
"store":"yes"
},
"relatedToId": {
"type": "long",
"store":"yes"
},
"file":{
"path": "full",
"type":"attachment",
"fields":{
"author": {
"type": "string"
},
"title": {
"store": true,
"type": "string"
},
"keywords": {
"type": "string"
},
"file": {
"store": true,
"term_vector": "with_positions_offsets",
"type": "string"
},
"name": {
"type": "string"
},
"content_length": {
"type": "integer"
},
"date": {
"format": "dateOptionalTime",
"type": "date"
},
"content_type": {
"type": "string"
}
} } } } }
Does anyone have an idea what could be wrong here?
Thanks