How to write a De-Identifying template in Google BigQuery - google-bigquery

I am trying to De Identify certain columns from a CSV file in Google Cloud Services. The CSV file contains 10 columns having ID, FirstName, LastName, D-O-B, etc. I am trying to mask the fields FirstName and LastName to replace them with * character.
I read the procedure to write a deidentifying template from this link.
I am trying to mask only the First Name and Last Name fields using record transformations, however I am getting an ArrayOutOf Bounds error when i'm running the job.
Is It necessary that I have to mention all the columns in the De identification template or only those fields that I need to mask.
The CSV file is something as follows:
ID FirstName LastName D_O_B Facility EncounterNum EncounterDate EncounterTime visitNum
101 Sean John 8/27/1968 LI 333 4/8/2016 2018-09-02 13:00:00 UTC 1
501 bla bla 7/13/1947 LI 337 3/14/2016 2018-09-03 21:05:00 UTC 67
851 Julius Caesar 8/15/1988 LI 339 5/17/2016 2018-09-03 21:25:00 UTC 89
The Deidentfication template I am using is as follows:
{
"deidentifyTemplate": {
"description": "Record transformation on Names trial",
"deidentifyConfig": {
"recordTransformations": {
"fieldTransformations": [
{
"fields": [
{
"name": "FirstName"
},
{
"name": "LastName"
}
],
"primitiveTransformation": {
"characterMaskConfig": {
"maskingCharacter": "*"
}
}
}
]
}
}
}
}
I expect the output to be a tabe in BigQuery containing masked FirstName and Lastname columns. I am however getting an Array out of bounds error.

Not an exert of the DLP API but I tried the following de-identify configuration and it worked for me. Using the following endpoint for Cloud DLP API.
{
"item": {
"value": "My name is John Doe and I live nowhere."
},
"inspectConfig": {
"includeQuote": true,
"infoTypes": [
{
"name": "FIRST_NAME"
},
{
"name": "LAST_NAME"
}
]
},
"deidentifyConfig": {
"infoTypeTransformations": {
"transformations": [
{
"infoTypes": [
{
"name": "FIRST_NAME"
},
{
"name": "LAST_NAME"
}
],
"primitiveTransformation": {
"characterMaskConfig": {
"maskingCharacter": "*"
}
}
}
]
}
}
}
Results:
{
"item": {
"value": "My name is **** *** and I live nowhere."
},
"overview": {
"transformedBytes": "7",
"transformationSummaries": [
{
"infoType": {
"name": "FIRST_NAME"
},
"transformation": {
"characterMaskConfig": {
"maskingCharacter": "*"
}
},
"results": [
{
"count": "1",
"code": "SUCCESS"
}
],
"transformedBytes": "4"
},
{
"infoType": {
"name": "LAST_NAME"
},
"transformation": {
"characterMaskConfig": {
"maskingCharacter": "*"
}
},
"results": [
{
"count": "1",
"code": "SUCCESS"
}
],
"transformedBytes": "3"
}
]
}
}

Related

How to check a particular value on basis of condition in karate

Goal: Match the check value is correct for 123S and 123O response in API
First check the value on this location x.details[0].user.school.name[0].codeable.text if it is 123S then check if x.details[0].data.check value is abc
Then check if the value on this location x.details[1].user.school.name[0].codeable.text is 123O then check if x.details[1].data.check is xyz
The response in array inter changes it is not mandatory first element is 123S sometime API returns 123O as first array response.
Sample JSON.
{
"type": "1",
"array": 2,
"details": [
{
"path": "path",
"user": {
"school": {
"name": [
{
"value": "this is school",
"codeable": {
"details": [
{
"hello": "yty",
"condition": "check1"
}
],
"text": "123S"
}
}
]
},
"sample": "test1",
"id": "22222"
},
"data": {
"check": "abc"
}
},
{
"path": "path",
"user": {
"school": {
"name": [
{
"value": "this is school",
"codeable": {
"details": [
{
"hello": "def",
"condition": "check2"
}
],
"text": "123O"
}
}
]
},
"sample": "test",
"id": "11111"
},
"data": {
"check": "xyz"
}
}
]
}
How I did in Postman but how to replicate same in Karate?
var jsonData = pm.response.json();
pm.test("Body matches string", function () {
for(var i=0;i<jsonData.details.length;i++){
if(jsonData.details[i].user.school.name[0].codeable.text == '123S')
{
pm.expect(jsonData.details[i].data.check).to.equal('abc');
}
if(jsonData.details[i].user.school.name[0].codeable.text == '123O')
{
pm.expect(jsonData.details[i].data.check).to.equal('xyz');
}
}
});
2 lines. And this takes care of any number of combinations of lookup values :)
* def lookup = { '123S': 'abc', '123O': 'xyz' }
* match each response.details contains { data: { check: '#(lookup[_$.user.school.name[0].codeable.text])' } }

Error creating protected columns with google sheet create api

I am following the google sheet v4 api doumentation to create google sheet with protected columns (https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/create)
I am able to create sheet without using protectedRange in api, using it always gives error, below are request /response i am getting
"properties": {
"title": "NEW SHEET"
},
"sheets": [
{
"data": [
{
"rowData": [
{
"values": [
{
"userEnteredValue": {
"numberValue": 10
}
},
{
"userEnteredValue": {
"numberValue": 20
}
},
{
"userEnteredValue": {
"numberValue": 30
}
}
]
}
]
}
]
},
{
"protectedRanges": [
{
"description": "Locked columns",
"range": {
"sheetId": 0,
"startColumnIndex": 0,
"endColumnIndex": 2
}
}
]
}
]
}
response
{
"error": {
"code": 400,
"message": "Invalid sheets[1].protectedRanges[0]: No grid with id: 0",
"status": "INVALID_ARGUMENT"
}
}
You want to create new Spreadsheet.
When the new Spreadsheet is created, you want to add the protected ranges.
In your sample, you want to create new Spreadsheet including a sheet which has the protected columns "A" and "B" and values of 10, 20, 30 in the cells "A1:C1".
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
Issue and solution:
In your request body, please include the property of protectedRanges to the 1st index of sheets.
Please set the sheet ID at properties.
Modified request body:
{
"properties": {
"title": "NEW SHEET"
},
"sheets": [
{
"data": [
{
"rowData": [
{
"values": [
{
"userEnteredValue": {
"numberValue": 10
}
},
{
"userEnteredValue": {
"numberValue": 20
}
},
{
"userEnteredValue": {
"numberValue": 30
}
}
]
}
]
}
],
"protectedRanges": [
{
"description": "Locked columns",
"range": {
"startColumnIndex": 0,
"endColumnIndex": 2,
"sheetId": 0
}
}
],
"properties": {
"sheetId": 0
}
}
]
}
For example, when "sheetId": 123 is set, the sheet is created as the sheet ID of 123.
You can also test above request body at Try this API.
Reference:
Method: spreadsheets.create
If I misunderstood your question and this was not the direction you want, I apologize.

ES6: Joining of subqueries to two different rows through the AND operator

I have following index:
+-----+-----+-------+
| oid | tag | value |
+-----+-----+-------+
| 1 | t1 | aaa |
| 1 | t2 | bbb |
| 2 | t1 | aaa |
| 2 | t2 | ddd |
| 2 | t3 | eee |
+-----+-----+-------+
where: oid - object ID, tag - property name, value - property value.
Mappings:
"mappings": {
"document": {
"_all": { "enabled": false },
"properties": {
"oid": { "type": "integer" },
"tag": { "type": "text" }
"value": { "type": "text" },
}
}
}
This simple structure allows store any number of object properties and it is a quite simple to search by one property or by more using OR logical operator.
E.g. get object oid's where:
(tag='t1' AND value='aaa') OR (tag='t2' AND value='ddd')
ES query:
{
"_source": { "includes":["oid"] },
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{ "term": { "tag": "t1" } },
{ "term": { "value": "aaa" } }
]
}
},
{
"bool": {
"must": [
{ "term": { "tag": "t2" } },
{ "term": { "value": "ddd" } }
]
}
}
],
"minimum_should_match": "1"
}
}
}
But it is hard to search by two or more properties using AND logical operator. So the question is how to join two sub-queries to two different records through the AND operator. E.g. get object oid's where:
(tag='t1' AND value='aaa') AND (tag='t2' AND value='ddd')
In this case result must be: { "oid": "2" }
Searching data contains in two different records and applying MUST instead of SHOULD from the previous example returns nothing in this case.
I have two equivalents in SQL of what I need:
SELECT i1.[oid]
FROM [index] i1 INNER JOIN [index] i2 ON i1.oid = i2.oid
WHERE
(i1.tag='t1' AND i1.value='aaa')
AND
(i2.tag='t2' AND i2.value='ddd')
---------
SELECT [oid] FROM [index] WHERE tag='t1' AND value='aaa'
INTERSECT
SELECT [oid] FROM [index] WHERE tag='t2' AND value='ddd'
Do the two requests and merge them on the client is not the option.
Elastic Search version is 6.1.1
In order to achieve what you want, you need to use the nested type, i.e. your mapping should look like this:
PUT my-index
{
"mappings": {
"doc": {
"properties": {
"oid": {
"type": "keyword"
},
"data": {
"type": "nested",
"properties": {
"tag": {
"type": "keyword"
},
"value": {
"type": "text"
}
}
}
}
}
}
}
The documents would be indexed like this:
PUT /my-index/doc/_bulk
{ "index": {"_id": 1}}
{ "oid": 1, "data": [ {"tag": "t1", "value": "aaa"}, {"tag": "t2", "value": "bbb"}] }
{ "index": {"_id": 2}}
{ "oid": 2, "data": [ {"tag": "t1", "value": "aaa"}, {"tag": "t2", "value": "ddd"}, {"tag": "t3", "value": "eee"}] }
Then you can make your query work like this:
POST my-index/_search
{
"query": {
"bool": {
"filter": [
{
"nested": {
"path": "data",
"query": {
"bool": {
"filter": [
{
"term": {
"data.tag": "t1"
}
},
{
"term": {
"data.value": "aaa"
}
}
]
}
}
}
},
{
"nested": {
"path": "data",
"query": {
"bool": {
"filter": [
{
"term": {
"data.tag": "t2"
}
},
{
"term": {
"data.value": "ddd"
}
}
]
}
}
}
}
]
}
}
}
There might be one way, which is a little ugly: adding terms aggregations to your query body.
{
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{ "term": { "tag": "t1" } },
{ "term": { "value": "aaa" } }
]
}
},
{
"bool": {
"must": [
{ "term": { "tag": "t2" } },
{ "term": { "value": "ddd" } }
]
}
}
],
"minimum_should_match": "1"
}
},
"size": 0,
"aggs": {
"find_joined_oid": {
"terms": {
"field": "oid.keyword"
}
}
}
}
If everything goes right, this will output something like
{
"took": 123,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 123,
"max_score": 0,
"hits": []
},
"aggregations": {
"find_joined_oid": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 1
},
{
"key": "2",
"doc_count": 2
}
}
}
}
Here, in the "aggregations" part,
"key": "1"
means your "oid":"1", and
"doc_counts": 1
means there is 1 hit in query with "oid":"1".
As you know how many tags you are querying to match, say N, in the aggregations result body, only those "key"s with "doc_count" equal to N are the result you're pursuing. In this example, you are querying tag:t1 (with value aaa) and tag:t2 (with value ddd), thus N=2. You can iterate in the result bucket list to find out those "key"s who have "doc_count" equal to 2.
However, there should be a better way. If you would alter your mapping to a document like style, ie. store all fields of one oid in one doc, life will be much easier.
{
"properties": {
"oid": { "type": "integer" },
"tag-1": { "type": "text" }
"value-1": { "type": "text" },
"tag-2": { "type": "text" }
"value-2": { "type": "text" }
}
}
When you want to add new tag-value pairs, just get the original doc with oid concerned, put new tag-pair into the doc, and put the whole new doc back into Elasticsearch with the same _id which you get from the original one. Most of the time dynamic mapping will work properly in your case, which means you don't need to assert mapping for new fields explicitly.
No-SQL databases like Elasticsearch and others are not designed to handle such SQL style query you are asking.

hierarchical faceting with Elasticsearch

I'm using elasticsearch and need to implement facet search for hierarchical object as follow:
category 1 (10)
subcategory 1 (4)
subcategory 2 (6)
category 2 (X)
...
So I need to get facets for two related objects. Documentation says that it's possible to get such kind of facets for numeric value, but I need it for strings http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-terms-stats-facet.html
Here is another interesting topic, unfortunately it's old: http://elasticsearch-users.115913.n3.nabble.com/Pivot-facets-td2981519.html
Does it possible with elastic search?
If so, how can I do that?
The previous solution works really well until you have no more than a multi-level tag on a single-document. In this case a simple aggregation doesn't work, because the flat structure of the lucene fields mix the results on the internal aggregation.
See the example below:
DELETE /test_category
POST /test_category
# Insert a doc with 2 hierarchical tags
POST /test_category/test/1
{
"categories": [
{
"cat_1": "1",
"cat_2": "1.1"
},
{
"cat_1": "2",
"cat_2": "2.2"
}
]
}
# Simple two-levels aggregations query
GET /test_category/test/_search?search_type=count
{
"aggs": {
"main_category": {
"terms": {
"field": "categories.cat_1"
},
"aggs": {
"sub_category": {
"terms": {
"field": "categories.cat_2"
}
}
}
}
}
}
That's the WRONG response that I have got on ES 1.4, where the fields on the internal aggregation are mixed at a document level:
{
...
"aggregations": {
"main_category": {
"buckets": [
{
"key": "1",
"doc_count": 1,
"sub_category": {
"buckets": [
{
"key": "1.1",
"doc_count": 1
},
{
"key": "2.2", <= WRONG
"doc_count": 1
}
]
}
},
{
"key": "2",
"doc_count": 1,
"sub_category": {
"buckets": [
{
"key": "1.1", <= WRONG
"doc_count": 1
},
{
"key": "2.2",
"doc_count": 1
}
]
}
}
]
}
}
}
A Solution can be to use nested objects. These are the steps to do:
1) Define a new type in the schema with nested objects
POST /test_category/test2/_mapping
{
"test2": {
"properties": {
"categories": {
"type": "nested",
"properties": {
"cat_1": {
"type": "string"
},
"cat_2": {
"type": "string"
}
}
}
}
}
}
# Insert a single document
POST /test_category/test2/1
{"categories":[{"cat_1":"1","cat_2":"1.1"},{"cat_1":"2","cat_2":"2.2"}]}
2) Run a nested aggregation query:
GET /test_category/test2/_search?search_type=count
{
"aggs": {
"categories": {
"nested": {
"path": "categories"
},
"aggs": {
"main_category": {
"terms": {
"field": "categories.cat_1"
},
"aggs": {
"sub_category": {
"terms": {
"field": "categories.cat_2"
}
}
}
}
}
}
}
}
That's the response, now correct, that I have got:
{
...
"aggregations": {
"categories": {
"doc_count": 2,
"main_category": {
"buckets": [
{
"key": "1",
"doc_count": 1,
"sub_category": {
"buckets": [
{
"key": "1.1",
"doc_count": 1
}
]
}
},
{
"key": "2",
"doc_count": 1,
"sub_category": {
"buckets": [
{
"key": "2.2",
"doc_count": 1
}
]
}
}
]
}
}
}
}
The same solution can be extended to a more than two-levels hierarchy facet.
Currently, elasticsearch does not support hierarchical facetting out-of-the-box. But the upcoming 1.0 release features a new aggregations module, that can be used to get these kind of facets (which are more like pivot-facets rather than hierarchical facets). Version 1.0 is currently in beta, you can download the second beta and test out aggregatins by yourself. Your example might look like
curl -XPOST 'localhost:9200/_search?pretty' -d '
{
"aggregations": {
"main category": {
"terms": {
"field": "cat_1",
"order": {"_term": "asc"}
},
"aggregations": {
"sub category": {
"terms": {
"field": "cat_2",
"order": {"_term": "asc"}
}
}
}
}
}
}'
The idea is, to have a different field for each level of facetting and bucket your facets based on the terms of the first level (cat_1). These aggregations then would have sub-buckets, based on the terms of the second level (cat_2). The result may look like
{
"aggregations" : {
"main category" : {
"buckets" : [ {
"key" : "category 1",
"doc_count" : 10,
"sub category" : {
"buckets" : [ {
"key" : "subcategory 1",
"doc_count" : 4
}, {
"key" : "subcategory 2",
"doc_count" : 6
} ]
}
}, {
"key" : "category 2",
"doc_count" : 7,
"sub category" : {
"buckets" : [ {
"key" : "subcategory 1",
"doc_count" : 3
}, {
"key" : "subcategory 2",
"doc_count" : 4
} ]
}
} ]
}
}
}

ElasticSearch:filtering documents based on field length?

Is there a way to filter ElasticSearch documents based on the length of a specific field?
For instance, I have a bunch of documents with the field "body", and I only want to return results where the number of characters in body is > 1000. Is there a way to do this in ES without having to add an extra column with the length in the index?
Use the script filter, like this:
"filtered" : {
"query" : {
...
},
"filter" : {
"script" : {
"script" : "doc['body'].length > 1000"
}
}
}
EDIT
Sorry, meant to reference the query DSL guide on script filters
You can also create a custom tokenizer and use it in a multifields property as in the following:
PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"character_analyzer": {
"type": "custom",
"tokenizer": "character_tokenizer"
}
},
"tokenizer": {
"character_tokenizer": {
"type": "nGram",
"min_gram": 1,
"max_gram": 1
}
}
}
},
"mappings": {
"person": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"words_count": {
"type": "token_count",
"analyzer": "standard"
},
"length": {
"type": "token_count",
"analyzer": "character_analyzer"
}
}
}
}
}
}
}
PUT test_index/person/1
{
"name": "John Smith"
}
PUT test_index/person/2
{
"name": "Rachel Alice Williams"
}
GET test_index/person/_search
{
"query": {
"term": {
"name.length": 10
}
}
}