How to convert in INCEpTION annotated text NER for spaCy? (CoNLL-U to json) - spacy

I'm using INCEpTION to annotate Named Entities which I want to use to train a model with spaCy. There are several options (e.g. CoNLL 2000, CoNLL CoreNLP, CoNLL-U) in INCEpTION to export the annotated text. I have exported the file as CoNLL-U and I want to convert it to json since this file format is required to train spaCy's NER module.
Someone has asked a similar question but the answer doesn't help me (here).
This is the annotated test text that I am using
spaCy's convert script is:
python -m spacy convert [input_file] [output_dir] [--file-type] [--converter]
[--n-sents] [--morphology] [--lang]
My first problem is, that I can't convert the file to .json. When I use the code below, I only get an output without any Named Entities (see last output):
!python -m spacy convert Teest.conllu
I also tried to add a output path and json
!python -m spacy convert Teest.conllu C:\Users json
But then I get the following error:
usage: spacy convert [-h] [-t json] [-n 1] [-s] [-b None] [-m] [-c auto]
[-l None]
input_file [output_dir]
spacy convert: error: unrecognized arguments: Users json
My second problem is, that the output does not contain any Named Entities, nor start and end index:
[
{
"id":0,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"Hallo",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":1,
"orth":",",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":2,
"orth":"dies",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":3,
"orth":"ist",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":4,
"orth":"ein",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":5,
"orth":"Test",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":6,
"orth":"um",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":7,
"orth":"zu",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":8,
"orth":"schauen",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":9,
"orth":",",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":10,
"orth":"wie",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":11,
"orth":"in",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":12,
"orth":"Inception",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":13,
"orth":"annotiert",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":14,
"orth":"wird",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":15,
"orth":".",
"tag":"_",
"head":0,
"dep":"_"
}
]
}
]
}
]
},
{
"id":1,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"Funktioniert",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":1,
"orth":"es",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":2,
"orth":"?",
"tag":"_",
"head":0,
"dep":"_"
}
]
}
]
}
]
},
{
"id":2,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"Simon",
"tag":"_",
"head":0,
"dep":"_"
}
]
}
]
}
]
}
]
I am using spaCy version 2.3.0 and Python version 3.8.3.
UPDATE:
I have used a new file since I wanted to find out if there are any issues with the language.
When I'm exporting the file as CoNNL Core NLP, the file contains Named entities:
1 I'm _ _ O _ _
2 trying _ _ Verb _ _
3 to _ _ O _ _
4 fix _ _ Verb _ _
5 some _ _ O _ _
6 problems _ _ O _ _
7 . _ _ O _ _
1 But _ _ O _ _
2 why _ _ O _ _
3 it _ _ O _ _
4 this _ _ O _ _
5 not _ _ O _ _
6 working _ _ Verb _ _
7 ? _ _ O _ _
1 Simon _ _ Name _ _
However, wen I try to comvert the CoNNL Core NLP file with
!python -m spacy convert Teest.conll
the error
line 68, in read_conllx
id_, word, lemma, pos, tag, morph, head, dep, _1, iob = parts
ValueError: not enough values to unpack (expected 10, got 7)
shows up.
UPDATE:
By adding 3 more lines of tab separated "_" before the ner the conversion works. The output is:
[
{
"id":0,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"I'm",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":1,
"orth":"trying",
"tag":"_",
"head":0,
"dep":"_",
"ner":"U-rb"
},
{
"id":2,
"orth":"to",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":3,
"orth":"fix",
"tag":"_",
"head":0,
"dep":"_",
"ner":"U-rb"
},
{
"id":4,
"orth":"some",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":5,
"orth":"problems",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":6,
"orth":".",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
}
]
}
]
}
]
},
{
"id":1,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"But",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":1,
"orth":"why",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":2,
"orth":"it",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":3,
"orth":"this",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":4,
"orth":"not",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":5,
"orth":"working",
"tag":"_",
"head":0,
"dep":"_",
"ner":"U-rb"
},
{
"id":6,
"orth":"?",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
}
]
}
]
}
]
},
{
"id":2,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"Simon",
"tag":"_",
"head":0,
"dep":"_",
"ner":"U-me"
}
]
}
]
}
]
}
]
Still, I can't convert this directly to a .json file and as far as I know, tuples are required to train spaCy's NER module. E.g.:
[('Berlin is a city.', {'entities': (0, 5, 'LOC'), (7, 8, 'VERB'), (12, 15, 'NOUN')]})]

I understand your pain.
You need to manually write a script to convert your last output into a spacy formatted output.
A better solution would be to use the spacy-annotator which allows you to annotate entitites and get an output in a format that spaCy likes. Here is how it looks like:

I have found a solution to use INCEpTION as an annotation tool to train spaCy's NER module. I have tried various file formats but in my opinion, it is only possible with CoNLL 2002 and using spaCy at the Command Line Interface.
Annotate the text in INCEpTION
Export the annotated text as a CoNLL 2002 file
Setup the Command Line Interface (here Windows is used)
python -m venv .venv
.venv\Scripts\activate.bat
Install spaCy if necessary and download the required language model (I'm using the large english model)
pip install spacy
python -m spacy download en_core_web_lg
Convert the CoNLL 2002 to spaCy's required input format
python -m spacy convert --converter ner file_name.conll [output file direction]
This step shouldn't work since CoNLL 2002 uses IOB2 and spaCy's converter requires IOB. However, I didn't have any problems and the .json output file is correct.
Debug-Data tool, training and evaluation
Here is a pretty good example how you can process with the converted file.

Related

How to issue ticket in Amadeus after flight-order request?

{
"data":{
"type":"flight-order",
"id":"eJzTd9cPijL1Cg8FAAuUAn0%3D",
"associatedRecords":[
{
"reference":"RZ5JWU",
"creationDate":"2022-01-13T05:40:00.000",
"originSystemCode":"GDS",
"flightOfferId":"1"
}
],
"flightOffers":[
{
"type":"flight-offer",
"id":"1",
"source":"GDS",
"nonHomogeneous":false,
"lastTicketingDate":"2022-03-31",
"itineraries":[
{
"segments":[
{
"departure":{
"iataCode":"ISB",
"at":"2022-03-30T01:40:00"
},
"arrival":{
"iataCode":"DXB",
"terminal":"1",
"at":"2022-03-31T03:55:00"
},
"carrierCode":"PK",
"number":"233",
"aircraft":{
"code":"320"
},
"operating":{
},
"id":"1",
"numberOfStops":0,
"co2Emissions":[
{
"weight":141,
"weightUnit":"KG",
"cabin":"ECONOMY"
}
]
}
]
}
],
"price":{
"currency":"PKR",
"total":"25235.00",
"base":"15190.00",
"fees":[
{
"amount":"0.00",
"type":"TICKETING"
},
{
"amount":"0.00",
"type":"SUPPLIER"
},
{
"amount":"0.00",
"type":"FORM_OF_PAYMENT"
}
],
"grandTotal":"25234.00",
"billingCurrency":"PKR"
},
"pricingOptions":{
"fareType":[
"PUBLISHED"
],
"includedCheckedBagsOnly":true
},
"validatingAirlineCodes":[
"PK"
],
"travelerPricings":[
{
"travelerId":"1",
"fareOption":"STANDARD",
"travelerType":"ADULT",
"price":{
"currency":"PKR",
"total":"25234.00",
"base":"15190.00",
"taxes":[
{
"amount":"5000.00",
"code":"RG"
},
{
"amount":"2000.00",
"code":"SP"
},
{
"amount":"2800.00",
"code":"YD"
},
{
"amount":"244.00",
"code":"ZR"
}
],
"refundableTaxes":"10044.00"
},
"fareDetailsBySegment":[
{
"segmentId":"1",
"cabin":"ECONOMY",
"fareBasis":"VLOWPK",
"class":"V",
"includedCheckedBags":{
"weight":30,
"weightUnit":"KG"
}
}
]
}
]
}
],
"travelers":[
{
"id":"1",
"dateOfBirth":"2003-01-03",
"gender":"FEMALE",
"name":{
"firstName":"Fakhar",
"lastName":"Khan"
},
"documents":[
{
"number":"AG324234234",
"issuanceDate":"2015-01-17",
"expiryDate":"2025-01-17",
"issuanceCountry":"PK",
"issuanceLocation":"Pakistan",
"nationality":"PK",
"documentType":"PASSPORT",
"holder":true
}
],
"contact":{
"purpose":"STANDARD",
"phones":[
{
"deviceType":"MOBILE",
"countryCallingCode":"92",
"number":"3452345678"
}
],
"emailAddress":"hamidafridi.droidor#gmail.com"
}
}
],
"remarks":{
"general":[
{
"subType":"GENERAL_MISCELLANEOUS",
"text":"ONLINE BOOKING FROM INCREIBLE VIAJES"
}
]
},
"ticketingAgreement":{
"option":"DELAY_TO_CANCEL",
"delay":"6D"
},
"contacts":[
{
"addresseeName":{
"firstName":"PABLO RODRIGUEZ"
},
"address":{
"lines":[
"Calle Prado, 16"
],
"postalCode":"28014",
"countryCode":"ES",
"cityName":"Madrid"
},
"purpose":"STANDARD",
"phones":[
{
"deviceType":"LANDLINE",
"countryCallingCode":"34",
"number":"480080071"
},
{
"deviceType":"MOBILE",
"countryCallingCode":"33",
"number":"480080072"
}
],
"companyName":"INCREIBLE VIAJES",
"emailAddress":"support#increibleviajes.es"
}
]
},
"dictionaries":{
"locations":{
"ISB":{
"cityCode":"ISB",
"countryCode":"PK"
},
"DXB":{
"cityCode":"DXB",
"countryCode":"AE"
}
}
}
}
I request for create-order then returned #PNR and now want to issue ticket. #Amadeus
As of now, this Flight Create Orders API allows you to book a flight and generate a PNR, but it does not allow for ticketing. Therefore, one of the requirements in order to use the API in production is to sign a contract with an airline consolidator to issue tickets.
Please check the requirements on the API page. If you want help to find a consolidator get in touch with us via the support channel and we can recommend you one.

python: create directory structure in Json format from s3 bucket objects

Am getting objects in a s3 buckets using following
s3 = boto3.resource(
service_name='s3',
aws_access_key_id=key_id,
aws_secret_access_key=secret
)
for summary_obj in s3.Bucket(bucket_name).objects.all():
print(summary_obj.key)
Its giving me all object like this
'sub1/sub1_1/file1.zip',
'sub1/sub1_2/file2.zip',
'sub2/sub2_1/file3.zip',
'sub3/file4.zip',
'sub4/sub4_1/file5.zip',
'sub5/sub5_1/file6.zip',
'sub5/sub5_2/file7.zip',
'sub5/sub5_3/file8.zip',
'sub6/'
But i want to have a list of json of all objects with proper directory structure like this to show in my app
[
{'sub1': [
{
'sub1_1': ['file1.zip'] // All files in sub1_1 folder
},
{
'sub1_2': ['file2.zip'] // All files in sub1_2 folder
},
]},
{'sub2': [
{
'sub2_1': [
'file3.zip'
]
}
]},
{'sub3': [
'file4.zip'
]},
{'sub4': [
{
'sub4_1': [
'file5.zip'
]
}
]},
{'sub5': [
{
'sub5_1': [
'file6.zip'
]
},
{
'sub5_2': [
'file7.zip'
]
},
{
'sub5_3': [
'file8.zip'
]
}
]},
{'sub6': []}
]
what is the best way to do this in python3.8?
I give it a try and the closest I could get to your json was through recursion which works with any level of sub-folders and folders:
from collections import defaultdict
objects=['sub1/sub1_1/file1.zip',
'sub1/sub1_2/file2.zip',
'sub2/sub2_1/file3.zip',
'sub3/file4.zip',
'sub4/sub4_1/file5.zip',
'sub5/sub5_1/file6.zip',
'sub5/sub5_2/file7.zip',
'sub5/sub5_3/file8.zip',
'sub5/sub5_3/file9.zip',
'sub5/sub5_3/sub5_4/file1.zip',
'sub5/sub5_3/sub5_4/file2.zip',
'sub6/']
#print(objects)
def construct_dict(in_list, accumulator):
if not in_list:
return
else:
if in_list[0] not in accumulator:
accumulator[in_list[0]] = defaultdict(list)
return construct_dict(in_list[1::], accumulator[in_list[0]])
accumulator = defaultdict(list)
for obj in objects:
construct_dict(obj.split('/'), accumulator)
print(json.dumps(accumulator))
Which gives (the content is same, but structure a bit different):
{
"sub1": {
"sub1_1": {
"file1.zip": {}
},
"sub1_2": {
"file2.zip": {}
}
},
"sub2": {
"sub2_1": {
"file3.zip": {}
}
},
"sub3": {
"file4.zip": {}
},
"sub4": {
"sub4_1": {
"file5.zip": {}
}
},
"sub5": {
"sub5_1": {
"file6.zip": {}
},
"sub5_2": {
"file7.zip": {}
},
"sub5_3": {
"file8.zip": {},
"file9.zip": {},
"sub5_4": {
"file1.zip": {},
"file2.zip": {}
}
}
},
"sub6": {
"": {}
}
}

How to implement group by in Dataweave based on first column in CSV

I have an incoming CSV file that looks like this (notice that the first field is common - this is the order number)
36319602,100,12458,HARVEY NORMAN,
36319602,101,12459,HARVEY NORMAN,
36319602,102,12457,HARVEY NORMAN,
36319601,110,12458,HARVEY NORMAN,
36319601,111,12459,HARVEY NORMAN,
36319601,112,12457,HARVEY NORMAN,
36319603,110,12458,HARVEY NORMAN,
36319603,121,12459,HARVEY NORMAN,
36319603,132,12457,HARVEY NORMAN,
This is my current Dataweave code
list_of_orders: {
order: payload map ((payload01 , indexOfPayload01) -> {
order_dtl:
[{
seq_nbr: payload01[1],
route_nbr: payload01[2]
}],
order_hdr: {
ord_nbr: payload01[0],
company: payload01[3],
city: payload01[4],
}
})
}
An example of the desired output would be something like this ... (this is just mocked up). Notice how I would like a single header grouped by the first column which is the order number - but with multiple detail lines
"list_of_orders": {
"order": [
{
"order_dtl": [
{
seq_nbr: 100,
route_nbr: 12458
},
{
seq_nbr: 101,
route_nbr: 12459
},
{
seq_nbr: 102,
route_nbr: 12457
}
],
"order_hdr":
{
ord_nbr: 36319602,
company: HARVEY NORMAN
}
}
]
}
It works fine except that it is repeating the order_hdr key.
What they would like is a single header key with multiple details beneath.
The grouping is to be based on "ord_nbr: payload01[0]"
Any help appreciated
Thanks
I think you're using Dataweave 1. In dw1, this groupBy gets the desired output(Note you can change the field pointers [0],1 etc to field name mappings if you have them set up as metadata etc):
%dw 1.0
%output application/json
---
list_of_orders: {
order: (payload groupBy ($[0])) map {
order_dtl: $ map {
seq_nbr: $[1],
route_nbr: $[2]
},
order_hdr:
{
ord_nbr: $[0][0],
company: $[0][3]
}
}}
UPDATE
Here is the output for the new input sample with multiple orders:
{
"list_of_orders": {
"order": [
{
"order_dtl": [
{
"seq_nbr": "110",
"route_nbr": "12458"
},
{
"seq_nbr": "121",
"route_nbr": "12459"
},
{
"seq_nbr": "132",
"route_nbr": "12457"
}
],
"order_hdr": {
"ord_nbr": "36319603",
"company": "HARVEY NORMAN"
}
},
{
"order_dtl": [
{
"seq_nbr": "100",
"route_nbr": "12458"
},
{
"seq_nbr": "101",
"route_nbr": "12459"
},
{
"seq_nbr": "102",
"route_nbr": "12457"
}
],
"order_hdr": {
"ord_nbr": "36319602",
"company": "HARVEY NORMAN"
}
},
{
"order_dtl": [
{
"seq_nbr": "110",
"route_nbr": "12458"
},
{
"seq_nbr": "111",
"route_nbr": "12459"
},
{
"seq_nbr": "112",
"route_nbr": "12457"
}
],
"order_hdr": {
"ord_nbr": "36319601",
"company": "HARVEY NORMAN"
}
}
]
}
}

opendaylight bgp-linkstate not making "loc-rib"

ODL version: Carbon
I'm having a problem with getting BGP-LS into the Network Topology. As you can see from below REST output, I set up "bgp-example" and homed to an external eBGP linkstate peer. "effective-rib-in", "adj-rib-in", and "adj-rib-out" all populate - but "loc-rib" does not. For some reason, it is not inheriting the linkstate afi/safi.
I tried debugs for bgp & karaf but saw nothing out of the ordinary (that I could see) - any help would be much appreciated.
thanks
Erik
*bgp configuration
http://192.168.3.42:8181/restconf/config/openconfig-network-instance:network-instances/network-instance/global-bgp/protocols/protocol/openconfig-policy-types:BGP/bgp-example
{
"protocol": [
{
"name": "bgp-example",
"identifier": "openconfig-policy-types:BGP",
"bgp-openconfig-extensions:bgp": {
"global": {
"config": {
"router-id": "192.168.3.42",
"as": 65000
}
},
"neighbors": {
"neighbor": [
{
"neighbor-address": "192.168.3.41",
"config": {
"peer-type": "EXTERNAL",
"peer-as": 65111
},
"afi-safis": {
"afi-safi": [
{
"afi-safi-name": "bgp-openconfig-extensions:LINKSTATE"
}
]
}
}
]
}
}
}
]
}
*loc-rib empty
http://192.168.3.42:8181/restconf/operational/bgp-rib:bgp-rib/rib/bgp-example/loc-rib
{
"loc-rib": {
"tables": [
{
"afi": "bgp-types:ipv4-address-family",
"safi": "bgp-types:unicast-subsequent-address-family",
"bgp-inet:ipv4-routes": {}
}
]
}
}
as you can see, linkstate is making it into every rib, except loc-rib
http://192.168.3.42:8181/restconf/operational/bgp-rib:bgp-rib/rib/bgp-example
{
"rib": [
{
"id": "bgp-example",
"peer": [
{
"peer-id": "bgp://x.x.x.x",
"supported-tables": [
{
"afi": "bgp-types:ipv4-address-family",
"safi": "bgp-types:unicast-subsequent-address-family"
},
{
"afi": "bgp-linkstate:linkstate-address-family",
"safi": "bgp-linkstate:linkstate-subsequent-address-family"
}
],
"effective-rib-in": {
"tables": [
{
"afi": "bgp-linkstate:linkstate-address-family",
"safi": "bgp-linkstate:linkstate-subsequent-address-family",
"bgp-linkstate:linkstate-routes": {
"linkstate-route": [
{
"route-key": "AAMAMAIAAAAAAAAFMgEAABoCAAAEAAD+VwIBAAQAAAAAAgMABgEAFQmQAAEJAAUgCv0YAQ==",
"identifier": 1330,
"advertising-node-descriptors": {
"as-number": 65111,
"domain-id": 0,
"isis-node": {
"iso-system-id": "AQAVCZAA"
}
},
"prefix-descriptors": {
"ip-reachability-information": "x.x.x.x/32"
},
"attributes": {
"origin": {
"value": "igp"
},
"ipv4-next-hop": {
"global": "x.x.x.x"
},
"as-path": {
"segments": [
{
"as-sequence": [
65111
]
}
]
}
},
"protocol-id": "isis-level2"
}
}
rest of output truncated for brevity/readability
OK, figured this out.... turns out I had not enabled LINKSTATE afi/safi in the global config for ODL BGP. I had to DELETE my existing global config, then POST, add neighbors, peers, etc. Now I have the linkstate DB in the loc-rib, AND it's made it to the network topology - BUT - no idea how to view this topology via DLUX....

index.cache.field.max_size unable to limit field data cache in elasticsearch

I am trying to limit the field caching(resident) by setting index.cache.field.max_size: NUMBER in config/elasticsearch.yml file. I have around 1 million records, and a faceting operation is performed on 7 fields(all of them contain lots of text data), to construct a "word cloud".
curl -X POST 'http://localhost:9200/monitoring/mention_reports/_search?&pretty=true' -d '
{
"size":"0",
"query": {
"filtered":{
"query":{
"text": {
"positive_keyword": {
"query": "quora"
}
}
},
"filter":{
. . .
}
}
},
"facets": {
"tagcloud": {
"terms": {
"fields":["field1","field2","field3","field4","field5","field6","field7"],
"size":"300"
}
}
}
}
'
The heap memory(15gb allocated) gets eaten up all the time irrespective of whatever value(1000 or 100000) is specified for index.cache.field.max_size. What am I doing wrong?
Also is there a better way to build a word cloud, instead of performing faceting on such huge amount of text data?
Mapping:
curl -XPOST http://localhost:9200/monitoring/ -d '
{
"settings":{
"index":{
"number_of_shards":5,
"number_of_replicas":1
},
"analysis":{
"filter":{
"myCustomShingle":{
"type":"shingle",
"max_shingle_size":3,
"output_unigrams":true
},
"myCustomStop":{
"type":"stop",
"stopwords":["a","about","abov ... ]
}
},
"analyzer":{
"myAnalyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase",
"myCustomShingle",
"stop",
"myCustomStop"
]
}
}
}
},
"mappings":{
"mention_reports":{
"_source":{
"enabled":true
},
"_all":{
"enabled":false
},
"index.query.default_field":"post_message",
"properties":{
"id":{
"type":"string",
"index":"not_analyzed",
"include_in_all" : "false",
"null_value" : "null"
},
"creation_time":{
"type":"date"
},
"field1":{
"type":"string",
"analyzer":"standard",
"include_in_all":"false",
"null_value":0
},
"field2":{
"type":"string",
"index":"not_analyzed",
"include_in_all":"false",
"null_value":"null"
},
. . .
"field7":{
"type":"string",
"analyzer":"myAnalyzer",
"term_vector":"with_positions_offsets",
"null_value" : "null"
}
}
}
}
}
'