import.io json API: get the list of columns, with subfields - import.io

I'm using the import.io API and have noticed that some field types return several columns in the generated json. For instance a field foo of type Money will return three columns: foo, foo/_currency and foo/_source.
Is there a reference somewhere? I found some documentation here http://blog.import.io/post/11-columns-of-importio through an incomplete example:
{
"whole_number_field": 123,
"whole_number_field/_source": "123",
"language_field": "ben",
"language_field/_source": "bn",
"country_field": "CHN",
"country_field/_source": "China",
"boolean_field": false,
"boolean_field/_source": "false",
"currency_field/_currency": "GBP",
"currency_field/_source": "£123.45",
"link_field": "http://chris-alexander.co.uk",
"link_field/_text": "Blog",
"link_field/_title": "linktitle",
"datetime_field": 611368440000,
"datetime_field/_source": "17/05/89 12:34",
"datetime_field/_utc": "Wed May 17 00:34:00 GMT 1989",
"image_field": "http://io.chris-alexander.co.uk/gif2.gif",
"image_field/_alt": "imgalt",
"image_field/_title": "imgtitle",
"image_field/_source": "gif2.gif"
}

The columns are documented in the API docs:
http://api.docs.import.io/
For example, for currency, the columns are:
myvar <== Extracted value
myvar/_currency <== ISO currency code
myvar/_source <== Original value
The ISO currency code is returned as myvar/_currency, the numeric value in myvar

I established this through several tests, I'd like to know if I'm missing something:
{
'DATE': ['_source', '_utc'],
# please tell me if you have an example of an import.io API with a date!
'BOOLEAN': ['_source'],
'LANG': ['_source'],
'COUNTRY': ['_source'],
'HTML':[],
'STRING':[],
'URL': ['_text', '_source', '_title'],
'IMAGE': ['_alt', '_title', '_source'],
'DOUBLE': ['_source'],
'CURRENCY': ['_currency', '_source'],
}

Related

AWS Glue / Hive struct with undetermined struct

Adding data to a AWS Glue table where one of the columns is a struct where one of the values has undetermined form.
More specifically there's a known key called 'name', that is a string and another called 'metadata' that can be a dict with any structure.
Ex:
# Row 1
{
"name": "Jane",
"metadata": {
"foo": 123,
"bar": "something"
}
}
# Row 2
{
"name": "Bill",
"metadata": {
"baz": "something else"
}
}
Note how metadata is a different dictionary in the two entries.
How can this be specified as a struct?
struct<
name:string,
metadata:?
>
Ended up doing what I mentioned in the comment, which is to make the column a string and have the JSON blob serialized to string.
SQL queries will then need to deserialize the JSON blob, which is supported in several different implementations, including AWS Athena (the one I'm using).

How to remove a value of multi-valued SCIM 2.0 sub-attribute?

I have a complex SCIM attribute that looks like follows:
"myattr1": {
"subattr1": 5,
"subattr2": [1, 2, 3]
}
I want to modify this to become
"myattr1": {
"subattr1": 5,
"subattr2": [1, 3]
}
How can I do this using PATCH ? Should I replace the entire sub-attribute or can I just remove the value 2 from it using PATCH ?
I know how to do this with multi-valued attributes. But I don't know how to do it for sub-attributes.
[EDIT: This is wrong..]
I believe this will work:
PATCH /resource/id
{ "schemas":
["urn:ietf:params:scim:api:messages:2.0:PatchOp"],
"Operations":[
{
"op":"remove",
"path":"myattr1[subattr2 eq \"2\"]"
}
]
}
Example on path was taken from https://datatracker.ietf.org/doc/html/rfc7644#page-33 where it mentions "path":"members[value eq
"2819c223-7f76-453a-919d-413861904646"].displayName" as a way to target the displayName sub-attribute for the complex multi-valued group attribute "members".
The escaped quotes around 2 are necessary if it's a string - if the value is in fact an integer, they won't be necessary.
As per the PingIdentity documentation https://github.com/pingidentity/scim2/wiki/Working-with-SCIM-paths#the-value-sub-attribute, Simple multivalued attributes have a special implicit sub-attribute called "value".
If that is the way, your PATCH request payload should be as follows.
{
"schemas": [
"urn:ietf:params:scim:api:messages:2.0:PatchOp"
],
"Operations": [
{
"op": "remove",
"path": "myattr1.subattr2[value eq \"2\"]"
}
]
}
However, this type of patch operation is not clearly defined in RFC 7644 (https://datatracker.ietf.org/doc/html/rfc7644#section-3.5.2)
You would be able to confirm this if the question is raised in SCIM mailing list https://mailarchive.ietf.org/arch/browse/scim/

Pymongo: Best way to remove $oid in Response

I have started using Pymongo recently and now I want to find the best way to remove $oid in Response
When I use find:
result = db.nodes.find_one({ "name": "Archer" }
And get the response:
json.loads(dumps(result))
The result would be:
{
"_id": {
"$oid": "5e7511c45cb29ef48b8cfcff"
},
"about": "A jazz pianist falls for an aspiring actress in Los Angeles."
}
My expected:
{
"_id": "5e7511c45cb29ef48b8cfcff",
"about": "A jazz pianist falls for an aspiring actress in Los Angeles."
}
As you seen, we can use:
resp = json.loads(dumps(result))
resp['id'] = resp['id']['$oid']
But I think this is not the best way. Hope you guys have better solution.
You can take advantage of aggregation:
result = db.nodes.aggregate([{'$match': {"name": "Archer"}}
{'$addFields': {"Id": '$_id.oid'}},
{'$project': {'_id': 0}}])
data = json.dumps(list(result))
Here, with $addFields I add a new field Id in which I introduce the value of oid. Then I make a projection where I eliminate the _id field of the result. After, as I get a cursor, I turn it into a list.
It may not work as you hope but the general idea is there.
First of all, there's no $oid in the response. What you are seeing is the python driver represent the _id field as an ObjectId instance, and then the dumps() method represent the the ObjectId field as a string format. the $oid bit is just to let you know the field is an ObjectId should you need to use for some purpose later.
The next part of the answer depends on what exactly you are trying to achieve. Almost certainly you can acheive it using the result object without converting it to JSON.
If you just want to get rid of it altogether, you can do :
result = db.nodes.find_one({ "name": "Archer" }, {'_id': 0})
print(result)
which gives:
{"name": "Archer"}
import re
def remove_oid(string):
while True:
pattern = re.compile('{\s*"\$oid":\s*(\"[a-z0-9]{1,}\")\s*}')
match = re.search(pattern, string)
if match:
string = string.replace(match.group(0), match.group(1))
else:
return string
string = json_dumps(mongo_query_result)
string = remove_oid(string)
I am using some form of custom handler. I managed to remove $oid and replace it with just the id string:
# Custom Handler
def my_handler(x):
if isinstance(x, datetime.datetime):
return x.isoformat()
elif isinstance(x, bson.objectid.ObjectId):
return str(x)
else:
raise TypeError(x)
# parsing
def parse_json(data):
return json.loads(json.dumps(data, default=my_handler))
result = db.nodes.aggregate([{'$match': {"name": "Archer"}}
{'$addFields': {"_id": '$_id'}},
{'$project': {'_id': 0}}])
data = parse_json(result)
In the second argument of find_one, you can define which fields to exclude, in the following way:
site_information = mongo.db.sites.find_one({'username': username}, {'_id': False})
This statement will exclude the '_id' field from being selected from the returned documents.

Mulesoft 4 Dataweave for loop and key value pair inside EXCEL TO JSON transformation

I have a data weave transformation converting an excel file to JSON, I had to change the value of an element (column of the file) as per the key-value pair values stored in a variable.
Please let me know how to achieve this.
Below is my data weave which converts 600 rows in the file to a JSON. However, I need to change the value for Brand as per the key-value pair mapping I stored in a variable.
%dw 2.0
output application/json
---
payload map(payload01,index01)->{
city: payload01.City,
province: payload01.Province,
phone: payload01.Phone,
fax: payload01.FAX,
email: payload01.EMAIL,
Brand: payload01.'Fuel Brand'
}
I understand that you want to use the value in attribute 'Fuel Brand' of the input payload to be used as the index in a variable:
%dw 2.0
output application/json
---
payload."Sheet Name" map(payload01,index01)-> {
city: payload01.City,
...
Brand: vars.brandsMapping[payload01.'Fuel Brand']
}
For example if the input is (note: I removed other attributes to simplify the example):
[{City=City A, Fuel Brand=brand1}, {City=City B, Fuel Brand=brand3}]
And the variable vars.brandsMapping contains:
{brand1=The brand1, brand2=The brand2, brand3=The brand3}
The output will be:
[
{
"city": "City A",
"Brand": "The brand1"
},
{
"city": "City B",
"Brand": "The brand3"
}
]
UPDATE: Since you clarified that you want to a dynamic mapping, the method that can be used for that is mentioned at the documentation page: https://docs.mulesoft.com/mule-runtime/4.1/dataweave-cookbook-map-based-on-an-external-definition

Get Most Recent Column Value With Nested And Repeated Fields

I have a table with the following structure:
and the following data in it:
[
{
"addresses": [
{
"city": "New York"
},
{
"city": "San Francisco"
}
],
"age": "26.0",
"name": "Foo Bar",
"createdAt": "2016-02-01 15:54:25 UTC"
},
{
"addresses": [
{
"city": "New York"
},
{
"city": "San Francisco"
}
],
"age": "26.0",
"name": "Foo Bar",
"createdAt": "2016-02-01 15:54:16 UTC"
}
]
What I'd like to do is recreate the same table (same structure) but with only the latest version of a row. In this example let's say that I'd like to group by everything by name and take the row with the most recent createdAt.
I tried to do something like this: Google Big Query SQL - Get Most Recent Column Value but I couldn't get it to work with record and repeated fields.
I really hoped someone from Google Team will provide answer on this question as it is very frequent topic/problem asked here on SO. BigQuery definitelly not friendly enough with writing Nested / Repeated stuff back to BQ off of BQ query.
So, I will provide the workaround I found relatively long time ago. I DO NOT like it, but (and that is why I hoped for the answer from Google Team) it works. I hope you will be able to adopt it for you particular scenario
So, based on your example, assume you have table as below
and you expect to get most recent records based on createdAt column, so result will look like:
Below code does this:
SELECT name, age, createdAt, addresses.city
FROM JS(
( // input table
SELECT name, age, createdAt, NEST(city) AS addresses
FROM (
SELECT name, age, createdAt, addresses.city
FROM (
SELECT
name, age, createdAt, addresses.city,
MAX(createdAt) OVER(PARTITION BY name, age) AS lastAt
FROM yourTable
)
WHERE createdAt = lastAt
)
GROUP BY name, age, createdAt
),
name, age, createdAt, addresses, // input columns
"[ // output schema
{'name': 'name', 'type': 'STRING'},
{'name': 'age', 'type': 'INTEGER'},
{'name': 'createdAt', 'type': 'INTEGER'},
{'name': 'addresses', 'type': 'RECORD',
'mode': 'REPEATED',
'fields': [
{'name': 'city', 'type': 'STRING'}
]
}
]",
"function(row, emit) { // function
var c = [];
for (var i = 0; i < row.addresses.length; i++) {
c.push({city:row.addresses[i]});
};
emit({name: row.name, age: row.age, createdAt: row.createdAt, addresses: c});
}"
)
the way above code works is: it implicitely flattens original records; find rows that belong to most recent records (partitioned by name and age); assembles those rows back into respective records. final step is processing with JS UDF to build proper schema that can be actually written back to BigQuery Table as nested/repeated vs flatten
The last step is the most annoying part of this workaround as it needs to be customized each time for specific schema(s)
Please note, in this example - it is only one nested field inside addresses record, so NEST() fuction worked. In scenarious when you have more than just one
field inside - above approach still works, but you need to involve concatenation of those fields to put them inside nest() and than inside js function to do extra splitting those fields, etc.
You can see examples in below answers:
Create a table with Record type column
create a table with a column type RECORD
How to store the result of query on the current table without changing the table schema?
I hope this is good foundation for you to experiment with and make your case work!