I'm trying to create a simple program with Rasa which extracts a (French) street address from a text input.
Following the advice in Rasa-NLU doc (http://rasa-nlu.readthedocs.io/en/latest/entities.html), I want to use spaCy to do the address detection.
I saw (https://spacy.io/usage/training) that the corresponding spaCy prebuilt entity would be LOC.
However, I don't understand how to create a training dataset with this entity.
Here is an excerpt from my current JSON training dataset :
{
"text" : "je vis au 2 Rue des Platanes",
"intent" : "donner_adresse",
"entities" : [
{
"start" : 10,
"end" : 28,
"value" : 2 Rue des Platanes",
"entity" : "adresse"
}
]
}
If I train the program and run it with the text input "je vis au 2 Rue des Hetres", I get this output :
{
"entities": [
"end": 26,
"entity": "adresse",
"extractor": "ner_crf",
"start": 10,
"value": "2 rue des hetres"
],
"intent": null,
"intent_ranking": [],
"text": "je vis au 2 Rue des Hetres"
}
Which is fine given my training dataset. But I would like to use spaCy's LOC entity.
How can I achieve that ? (What am I doing wrong ?)
Here is a relevant summary of my config file, if needed :
{
"pipeline" : "spacy_sklearn",
"language" : "fr",
"spacy_model_name" : "fr_core_news_md"
}
If you want to use spaCy's pre-trained NER, you just need to add it to your pipeline, e.g.
pipeline = ["nlp_spacy", "tokenizer_spacy", "ner_spacy"]
But depending on what you need, you might want to just copy one of the preconfigured pipelines and add "ner_spacy" at the end
Related
I'd like to import bigquery data to bigtable using Google Composer.
Exporting bigquery rows in Avro format to GCS was successful. However, import Avro data to Bigtable was not.
The error says
Caused by: org.apache.avro.AvroTypeException: Found Root, expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key
I guess the schema between bigquery and bigtable should match each other. But I have no idea how to do this.
For every record read from the Avro files:
Attributes present in the files and in the table are loaded into the table.
Attributes present in the file but not in the table are subject to ignore_unknown_fields,
Attributes that exist in the table but not in the file will use their default value, if there is one set.
The below links are helpful.
[1] https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#cloud-storage-avro-to-bigtable
[2] https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/resources/schema/avro/bigtable.avsc
[3] Avro to BigTable - Schema issue?
For those of you who still have problem like me because they are not familiar with avro, here is one working schema transformation that I found after some tinkering.
For example, if you have table from bigquery like this
And you want to use user_id as the bigtable row_key and ingest all columns, here is the example code to encode them as avro file.
from avro.schema import Parse
from avro.io import DatumWriter
from avro.datafile import DataFileWriter
bigtable_schema = {
"name" : "BigtableRow",
"type" : "record",
"namespace" : "com.google.cloud.teleport.bigtable",
"fields" : [
{ "name" : "key", "type" : "bytes"},
{ "name" : "cells",
"type" : {
"type" : "array",
"items": {
"name": "BigtableCell",
"type": "record",
"fields": [
{ "name" : "family", "type" : "string"},
{ "name" : "qualifier", "type" : "bytes"},
{ "name" : "timestamp", "type" : "long", "logicalType" : "timestamp-micros"},
{ "name" : "value", "type" : "bytes"}
]
}
}
}
]
}
parsed_schema = Parse(json.dumps(bigtable_schema))
row_key = 'user_id'
family_name = 'feature_name'
feature_list = ['channel', 'zip_code', 'history']
with open('features.avro', 'wb') as f:
writer = DataFileWriter(f, DatumWriter(), parsed_schema)
for item in df.iterrows():
row = item[1]
ts = int(datetime.now().timestamp()) * 1000 * 1000
for feat in feature_list:
writer.append({
"key": row[row_key].encode('utf-8'),
"cells": [{"family": family_name,
"qualifier": feat.encode('utf-8'),
"timestamp": ts,
"value": str(row[feat]).encode('utf-8')}]
})
writer.close()
Then you can use dataflow template job to run the ingestion.
Complete code can be found here: https://github.com/mitbal/sidu/blob/master/bigquery_to_bigtable.ipynb
I trained a Logistic Regression model on Bigquery and downloaded it locally.
Afterwards I wanted to load it and make a prediction but it gives me an error which I cannot solve.
This is the simple code I've written. In particular, query contains the predictors (I've read to submit them in JSON, that's why I've encoded them like this) and I added .signatures['serving_default'] since otherwise it gives me an error, i.e. 'AutoTrackable' object is not callable.
import tensorflow as tf
model = tf.saved_model.load('./log_reg')
query = [{
"pcoordinate_x": "11.191853",
"pcoordinate_y": "45.892605",
"mvalue": "0",
"pcode": "IT*TNK*ETN046",
"porigin": "route220",
"scode": "IT*TNK*ETN046-IT*TNK*ETN046",
"pmetadata_provider": "Route220",
"pmetadata_accessType": "PRIVATE_WITHPUBLICACCESS",
"pmetadata_capacity": "1",
"pmetadata_categories": "['EAT&CHARGE']",
"smetadata_outlets_outletTypeCode": "Schuko",
"smetadata_outlets_maxPower": "3.7",
"smetadata_outlets_maxCurrent": "0.0",
"smetadata_outlets_minCurrent": "0.0",
"mvalue_p": "0.0",
"mvalue_t": "13.3",
"season": "2",
"altitude": "1440.0",
"hour": "09",
"day": "12",
"month": "09"
}]
model.signatures['serving_default'](query)
Running the code now it gives me this error: enter image description here
Has someone been able to make a prediction with their model and can help me? Thanks in advance!
Python is expecting this be called as:
model.signatures['serving_default'](pcoordinate_x = 11.191853, pcoordinate_y = 45.892605, mvalue=0, ...)
I am trying to take output from Salesforce & transform it to a json. here is my code:
%dw 1.0
%output application/json
payload map {
headerandlines:{ id : $.Id,
agreementLineID : $.LineItems__r.Id,
netPrice : $.LineItems__r.Price__c,
volume : $.Volume__c,
name : $.Name,
StartDate : $.Start_Date__c,
EndDate : $.End_Date__c,
poField : $.PO_Field__c,
ConsoleNumber : $.Console_Number__c,
Term : $.Term__c,
ownerID : $.OwnerId,
Unit : $.Unit__c,
siteNumber : $.Site_Num__c,
customerNumber : $.Customer_Num__c
}
}
input payload looks like this.. it is a collection of objects. Somehow after the transformation only the first object is sent & rest is clobbered.
[
{
"id": "DA0YAAW",
"LineID": [
"jGEAU",
"jBEAU",
"j6EAE"
],
"Price": [
"50000.0",
"12000.0",
"45000.0"
],
"netPrice": null,
"volume": null,
"name": " Test 2.24",
"StartDate": "2017-02-17",
"EndDate": "2018-02-17",
"poField": "123456",
"ConsoleNumber": "8888888",
"PaymentTerm": "thirty (30)",
"ownerID": “abcd”,
"OperatingUnit": " International Company",
"siteNumber": null,
"customerNumber": null
},
{
"id": "a37n0000000DAMAAA4",
"LineID": [
"JunEAE",
"JuiEAE",
"KdMEAU",
"JuYEAU"
],
"Price": [
"5000.0",
"8000.0",
"5000.0",
"5000.0"
],
"netPrice": null,
"volume": null,
"name": " Test 3.6",
"StartDate": "2017-03-06",
"EndDate": "2018-03-16",
"poField": "12345",
"ConsoleNumber": "123456-",
"PaymentTerm": "30 NET",
"ownerID": “dfgh”,
"OperatingUnit": ", inc.",
"siteNumber": null,
"customerNumber": null
},
….
]
When I call this code from the browser (using API testing) I get the complete payload with multiple objects. When I call this from another API I get only one 1 object indicating it is not looping through. I can confirm that the payload has multiple objects . Is there anything I am missing in terms of looping through this code to extract multiple objects? I assume that '$' notation is good enough for iteration.
#insaneyogi, your input is either incorrect or your dataweave is incorrect.
Here in the input you have specified id in the small. but in dataweave, it is mentioned in capital.
I think the problem here is with your Lineitem and Price type elements. They are collection within and element. In your data mapping $. will take care of the outer object. However, i think the mapping like LineItems__r.Price__c is not correct. It should have proper index , probably LineItems__r.Price__c[0]. Please try that and it should work. First change the input with single element for price or line-item and test.
It looks like the agreementLineID and netPrice are arrays and you need to loop through them with a map operator within the bigger outer map to get all the line items. That should work.
Here is my problem:
I have a field called product_id that is in a format similar to:
A+B-12321412
If I used the standard text analyzer it splits it into tokens like so:
/_analyze/?analyzer=standard&pretty=true" -d '
A+B-1232412
'
{
"tokens" : [ {
"token" : "a",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "b",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "1232412",
"start_offset" : 5,
"end_offset" : 12,
"type" : "<NUM>",
"position" : 3
} ]
}
Ideally, I would like to sometimes search for an exact product id and other times use a sub string and or just do a query for part of the product id.
My understanding of mappings and analyzers is that I can only specify one analyzer per field.
Is there a way to store a field as both analyzed and exact match?
Yes, you can use the fields parameter. In your case:
"product_id": {
"type": "string",
"fields": {
"raw": { "type": "string", "index": "not_analyzed" }
}
}
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html
This allows you to index the same data twice, using two different definitions. In this case it will be indexed via both the default analyzer and not_analyzed which will only pick up exact matches. This is also useful for sorting return results:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/multi-fields.html
However, you will need to spend some time thinking about how you want to search. In particular, given part numbers with a mix of alpha, numeric and punctuation or special characters you may need to get creative to tune your queries and matches.
I am new to PIG scripting and working with JSONs. I am in the need of parsing multi-level json files in PIG. Say,
{
"firstName": "John",
"lastName" : "Smith",
"age" : 25,
"address" :
{
"streetAddress": "21 2nd Street",
"city" : "New York",
"state" : "NY",
"postalCode" : "10021"
},
"phoneNumber":
[
{
"type" : "home",
"number": "212 555-1234"
},
{
"type" : "fax",
"number": "646 555-4567"
}
]
}
I am able to parse a single level json through JsonLoader() and do join and other operations and get the desired results as JsonLoader('name:chararray,field1:int .....');
Is it possible to parse the above mentioned JSON file using the built-in JsonLoader() function of PIG 0.10.0. If it is. Please explain me how it is done and accessing fields of the particular JSON?
You can handle nested json loading with Twitter's Elephant Bird: https://github.com/kevinweil/elephant-bird
a = LOAD 'file3.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad')
This will parse the JSON into a map http://pig.apache.org/docs/r0.11.1/basic.html#map-schema the JSONArray gets parsed into a DataBag of maps.
It is possible by creating your own UDF. A simple UDF example is shown in below link
http://pig.apache.org/docs/r0.9.1/udf.html#udf-java
C = load 'path' using JsonLoader('firstName:chararray,lastName:chararray,age:int,address:(streetAddress:chararray,city:chararray,state:chararray,postalCode:chararray),
phoneNumber:{(type:chararray,number:chararray)}')