removing pattern from lines in a file using sed or awk - awk

I am trying to remove a pattern from all lines in a file. The pattern is 'id': null and the two sed attempts I have made execute but the file is unchanged.. Thank you :).
file
{
"objects": [
{
"version": "1a",
"chromosome": "chr1",
"id": null,
"peptide": "123",
},
{
"version": "1a",
"chromosome": "chr1",
"id": "This line has text and is printed.",
"peptide": null,
},
{
"version": '1a',
"chromosome": "chr17",
"id": null,
"peptide": null},
"id": 'This has text in it and this line is printed as well',
"end": 460
}
]
}
desired
{
"objects": [
{
"version": "1a",
"chromosome": "chr1",
"peptide": "123",
},
{
"version": "1a",
"chromosome": "chr1",
"id": "This line has text and is printed.",
"peptide": null,
},
{
"version": '1a',
"chrmosome": "chr17",
"id": null,
"peptide": null},
"id": 'This has text in it and this line is printed as well',
"end": 460
}
]
}
sed
sed '/"id": *null/s/, [^,]*/ /' file --- if "id": null found substitute with blank up to the ending ,
sed -E "s/"id": *null, *//" file

You may use this gnu-sed:
sed 0,'/"id": null,/{//d}' file
This will remove first instance of "id": null, from file
Original answer based on original question:
sed -E "s/'id': *None, *//" file
{'version': '1a', 'chr': '17', 'xref_json': None}, 'id': 'This has text in it and this line is printed', 'end': 460}
{'version': '1a', 'chr': '17', 'xref_json': None}
s/'id': *None, *// searches for pattern 'id':<0 or more spaces>None,<0 or more spaces> and replaces that with empty string.

Related

TextMate grammar - clashing multi-line captures

I have the following syntax:
test: name {
param_name: value
another_param: value2
test: [12, "asd"]
test2: [
"test__",
"test3"
]
}
My logic here is as follows:
Detect scopes as multi-line match
"begin": "([a-z_]+)\\s?:\\s?([a-z_\\+]+)?\\s?(\\{)",
"end": "(\\})",
In the patterns section of the above, add parameters with multiline matching
"begin": "(?!sql)([a-z\\_]*)\\s?:",
"end": "(?<=\\n)",
And then in the patterns of that I have array
"begin": "\\[",
"end": "\\]",
The problem is that test: [12, "asd"] is correctly defined as
test - parameter name
[12, "asd"] - parameter value + array
but I can't get it to work on the multi-line value. It only recognises the opening [ as array.
At first I thought I understood the reason why. The parameters finishes when it sees a new line, hence the second line of an array will not be matched. So I added array to the main scope pattern and that's when my understanding ends.
Full file:
{
"$schema": "https://raw.githubusercontent.com/martinring/tmlanguage/master/tmlanguage.json",
"name": "QQQL",
"patterns": [
{"include": "#scopes"},
{"include": "#parameters"}
],
"repository": {
"scopes": {
"name": "source.qqql.scope",
"begin": "([a-z_]+)\\s?:\\s?([a-z_\\+]+)?\\s?(\\{)",
"end": "(\\})",
"patterns": [
{"include": "#scopes"},
{"include": "#array"},
{"include": "#parameters"}
]
},
"parameters": {
"name": "source.qqql.parameter",
"begin": "(?!sql)([a-z\\_]*)\\s?:",
"end": "(?<=\\n)",
"beginCaptures": {
"1": {
"name": "source.qqql.parameter.name"
}
},
"patterns": [
{"include": "#array"},
{
"name": "source.qqql.parameter.value",
"match": "(.*)",
"captures": {
"1": {
"patterns": [
{"include": "#array"}
]
}
}
}
]
},
"array": {
"name": "source.qqql.array",
"begin": "\\[",
"end": "\\]",
"patterns": [
{"include": "#strings"},
{
"name": "source.qqql.array.delimiter",
"match": "\\,"
}
]
}
},
"scopeName": "source.qqql"
}
What I expected is that the inclusion of array in scopes would solve the problem but somehow it doesn't.

azure search exact match of file name not returning exact results

I am indexing all the file names into the index. But when I search with exact file name in the search query it is returning all other file names also. below is my index definition.
{
"fields": [
{
"name": "id",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": true,
"retrievable": true,
"searchable": false,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "FileName",
"type": "Edm.String",
"facetable": false,
"filterable": false,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "keyword-analyzer",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
],
"scoringProfiles": [],
"defaultScoringProfile": null,
"corsOptions": null,
"analyzers": [
{
"name": "keyword-analyzer",
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters": [],
"tokenizer": "keyword_v2",
"tokenFilters": ["lowercase", "my_asciifolding", "my_word_delimiter"]
}
],
"tokenFilters": [
{
"#odata.type": "#Microsoft.Azure.Search.AsciiFoldingTokenFilter",
"name": "my_asciifolding",
"preserveOriginal": true
},
{
"#odata.type": "#Microsoft.Azure.Search.WordDelimiterTokenFilter",
"name": "my_word_delimiter",
"generateWordParts": true,
"generateNumberParts": false,
"catenateWords": false,
"catenateNumbers": false,
"catenateAll": false,
"splitOnCaseChange": true,
"preserveOriginal": true,
"splitOnNumerics": true,
"stemEnglishPossessive": false,
"protectedWords": []
}
],
"#odata.etag": "\"0x8D6FB2F498F9AD2\""
}
Below is my sample data
{
"value": [
{
"id": "1",
"FileName": "SamplePSDFile_1psd2680.psd"
},
{
"id": "2",
"FileName": "SamplePSDFile-1psd260.psd"
},
{
"id": "3",
"FileName": "SamplePSDFile_1psd2689.psd"
},
{
"id": "4",
"FileName": "SamplePSDFile-1psdxx2680.psd"
}
]
}
Below is the Analyze API results
{
"tokens": [
{
"token": "samplepsdfile_1psd2689.psd",
"startOffset": 0,
"endOffset": 26,
"position": 0
},
{
"token": "samplepsdfile",
"startOffset": 0,
"endOffset": 13,
"position": 0
},
{
"token": "psd",
"startOffset": 15,
"endOffset": 18,
"position": 1
},
{
"token": "psd",
"startOffset": 23,
"endOffset": 26,
"position": 2
}
]
}
When I search with the keyword "SamplePSDFile_1psd2689.psd", Azure search returning three records in the results instead of only document 3. Below is my search query and the results.
?search="SamplePSDFile_1psd2689.psd"&api-version=2019-05-06&$count=true&queryType=full&searchMode=All
{
"#odata.count": 3,
"value": [
{
"#search.score": 2.3387241,
"id": "2",
"FileName": "SamplePSDFile-1psd260.psd"
},
{
"#search.score": 2.2493405,
"id": "3",
"FileName": "SamplePSDFile_1psd2689.psd"
},
{
"#search.score": 2.2493405,
"id": "1",
"FileName": "SamplePSDFile_1psd2680.psd"
}
]
}
How I can achieve my expected results. I tried with and without double quotes around the keyword all other options, but no luck. What I am doing wrong here in this case?
Some body suggested to use $filter, but that field wasn't filterable in our case.
Please help me on this.
If you are looking for exact match then you probably don't want any analyzer involved. Give it a try with this line
"analyzer": "keyword-analyzer"
changed to
"analyzer": null
If you need to be able to do exact match on the field and also support partial keyword searches then you need to index the field twice with different names. Maybe append “Exact” to the exact match field name and don’t use an analyzer for that one. The name without exact can have an analyzer. Then search on the field using the right field name index depending on the type of search.

Traversing JSON objects with tSQL - OPENJSON

I have the following JSON code. I pull the values into a SQL database using OPENJSON, but I am having trouble with the path for the Refund object.
I am trying to pull the "amount" value in the "transactions" object (so the expected value should be 298.47).
SQL code (currently returns only null values)
OPEN(json)
WITH(
OtherJSONstuff varchar '$otherjsonstuff'
Refund int '$.refund[0].transactions.amount' <what should this be
JSON Code
"otherjsonstuff": othervalues
"otherjsonstuff": othervalues
"object": [
{
"id": 212,
"items": [
{
"id": 151,
"quantity": 3,
"item_id": 926,
"subtotal": 30.0,
"tax": 0.0,
"item": {
"id": 926,
"quantity": 3,
"price": "10.00",
"product_id": 934,
"properties": [],
"discount": "0.00",
"tax": []
}
}
],
"action": [
{
"id": 537,
"amount": "298.47", --this is the line I need
"kind": "refund",
"created": "2016-12-13",
"location_id": null,
"parent_id": 537,
}
],
}
],
Having reformatted the JSON code,
it should be $.refund[0].transactions[0].amount
Depend on the array object you want to access, just increment and decrements the values. The root cause is understanding the JSON hierarchy and JNevil has provided you some good resources

Replace newlines between two words

I have a output from a text file as below. I want to put all the contents of the someItems array under one line. So, every line would have the contents of a new someItems array. For example :
"someItems": [
{
"someId": "MountSomers-showtime.com-ETTI0000000000000003-1452005472058",
"source": "MountSomers",
"sourceAssetId": "9",
"title": "Pk_3",
"ppp": "12",
"expirationDate": "2016-01-06T14:51:12Z"
}, {
"someId": "MountSomers-ericsson.com- ETTI0000000000000005-1452005472058",
"source": "MountSomers",
"sourceAssetId": "12",
"title": "Pk_5",
"ppp": "12",
"expirationDate": "2016-01-06T14:51:12Z"
} ]
"someItems": [
{
"someId": "MountSomers-hbo.com-ETTI0000000000000002-1452005472058",
"source": "MountSomers",
"sourceAssetId": "7",
"title": "Pk_2",
"ppp": "12",
"expirationDate": "2016-01-06T14:51:12Z"
}, {
"someId": "MountSomers-showtime.com-ETTI0000000000000003-1452005472058",
"source": "MountSomers",
"sourceAssetId": "9",
"title": "Pk_3",
"ppp": "12",
"expirationDate": "2016-01-06T14:51:12Z"
}, {
"someId": "MountSomers-ericsson.com-ETTI0000000000000005-1452005472058",
"source": "MountSomers",
"sourceAssetId": "12",
"title": "Pk_5",
"ppp": "12",
"expirationDate": "2016-01-06T14:51:12Z"
} ]
would become
"someItems": [ ..... ]
"someItems": [ ..... ]
I have the below
cat file | | awk '/^"someItems": [/{p=1}/^]/{p=0} {if(p)printf "%s",$0;else printf "%s%s\n",(NR==1?"":RS),$0}'
but it does not do what I wanted...
Since the input contains the brackets [] only in the outer level the solution can be pretty simple:
awk '{gsub("\n","", $0)}1' RS=']\n' file
I'm using ]\n as the input record separator. This gives you the whole portion between "someItems: ..." until the closing ] as $0. gsub() simply replaces the newlines. 1 prints the (modified) record.
You can also use sed:
sed '/\[/{:a;N;/]/!ba;s/\n//g}' file
I'll explain it in a multiline version:
script.sed:
# Address. Matches a line containing the opening [
/\[/ { # Start of block
# Define a label 'a'
:a
# Read a new line and append it to the pattern buffer
N
# If the pattern buffer doens't contain the closing ]
# jump back to label 'a'
/]/!ba
# Replace all newlines once the closing bracket appeared
# Since we don't jump back to 'a' in this case, this means we'll
# leave the block and start a new cycle.
s/\n//g
} # End of block
$ awk '/^"someItems":/ && f { printf "\n" } { printf $0; f=1 } END { printf "\n" }' file.txt
"someItems": [{ "someId": "MountSomers-showtime.com-ETTI0000000000000003-1452005472058", "source": "MountSomers", "sourceAssetId": "9", "title": "Pk_3", "ppp": "12", "expirationDate": "2016-01-06T14:51:12Z"}, { "someId": "MountSomers-ericsson.com- ETTI0000000000000005-1452005472058", "source": "MountSomers", "sourceAssetId": "12", "title": "Pk_5", "ppp": "12", "expirationDate": "2016-01-06T14:51:12Z"} ]
"someItems": [{ "someId": "MountSomers-hbo.com-ETTI0000000000000002-1452005472058", "source": "MountSomers", "sourceAssetId": "7", "title": "Pk_2", "ppp": "12", "expirationDate": "2016-01-06T14:51:12Z"}, { "someId": "MountSomers-showtime.com-ETTI0000000000000003-1452005472058", "source": "MountSomers", "sourceAssetId": "9", "title": "Pk_3", "ppp": "12", "expirationDate": "2016-01-06T14:51:12Z"}, { "someId": "MountSomers-ericsson.com-ETTI0000000000000005-1452005472058", "source": "MountSomers", "sourceAssetId": "12", "title": "Pk_5", "ppp": "12", "expirationDate": "2016-01-06T14:51:12Z"} ]
$
Print each line without a trailing newline. Starting with the second occurrence, put a leading newline before each "someItems". Print a newline at the end to keep it classy.

ElasticSearch - return the complete value of a facet for a query

I've recently started using ElasticSearch. I try to complete some use cases. I have a problem for one of them.
I have indexed some users with their full name (e.g. "Jean-Paul Gautier", "Jean De La Fontaine").
I try to get all the full names responding to some query.
For example, I want the 100 most frequent full names beggining by "J"
{
"query": {
"query_string" : { "query": "full_name:J*" } }
},
"facets":{
"name":{
"terms":{
"field": "full_name",
"size":100
}
}
}
}
The result I get is all the words of the full names : "Jean", "Paul", "Gautier", "De", "La", "Fontaine".
How to get "Jean-Paul Gautier" and "Jean De La Fontaine" (all the full_name values begging by 'J') ? The "post_filter" option is not doing this, it only restrict this above subset.
I have to configure "how works" this full_name facet
I have to add some options to this current query
I have to do some "mapping" (very obscure for the moment)
Thanks
You just need to set "index": "not_analyzed" on the field, and you will be able to get back the full, unmodified field values in your facet.
Typically, it's nice to have one version of the field that isn't analyzed (for faceting) and another that is (for searching). The "multi_field" field type is useful for this.
So in this case, I can define a mapping as follows:
curl -XPUT "http://localhost:9200/test_index/" -d'
{
"mappings": {
"people": {
"properties": {
"full_name": {
"type": "multi_field",
"fields": {
"untouched": {
"type": "string",
"index": "not_analyzed"
},
"full_name": {
"type": "string"
}
}
}
}
}
}
}'
Here we have two sub-fields. The one with the same name as the parent will be the default, so if you search against the "full_name" field, Elasticsearch will actually use "full_name.full_name". "full_name.untouched" will give you the facet results you want.
So next I add two documents:
curl -XPUT "http://localhost:9200/test_index/people/1" -d'
{
"full_name": "Jean-Paul Gautier"
}'
curl -XPUT "http://localhost:9200/test_index/people/2" -d'
{
"full_name": "Jean De La Fontaine"
}'
And then I can facet on each field to see what is returned:
curl -XPOST "http://localhost:9200/test_index/_search" -d'
{
"size": 0,
"facets": {
"name_terms": {
"terms": {
"field": "full_name"
}
},
"name_untouched": {
"terms": {
"field": "full_name.untouched",
"size": 100
}
}
}
}'
and I get back the following:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"facets": {
"name_terms": {
"_type": "terms",
"missing": 0,
"total": 7,
"other": 0,
"terms": [
{
"term": "jean",
"count": 2
},
{
"term": "paul",
"count": 1
},
{
"term": "la",
"count": 1
},
{
"term": "gautier",
"count": 1
},
{
"term": "fontaine",
"count": 1
},
{
"term": "de",
"count": 1
}
]
},
"name_untouched": {
"_type": "terms",
"missing": 0,
"total": 2,
"other": 0,
"terms": [
{
"term": "Jean-Paul Gautier",
"count": 1
},
{
"term": "Jean De La Fontaine",
"count": 1
}
]
}
}
}
As you can see, the analyzed field returns single-word, lower-cased tokens (when you don't specify an analyzer, the standard analyzer is used), and the un-analyzed sub-field returns the unmodified original text.
Here is a runnable example you can play with:
http://sense.qbox.io/gist/7abc063e2611846011dd874648fd1b77450b19a5
Try altering the mapping for "full_name":
"properties": {
"full_name": {
"type": "string",
"index": "not_analyzed"
}
...
}
not_analyzed means that it will be kept as is, capitals, spaces, dashes etc, so that "Jean De La Fontaine" will stay findable and not be tokenized into "Jean" "De" "La" "Fontaine"
You can experiment with different analyzers using the api
Notice what the standard one does to a mulit part name:
GET /_analyze?analyzer=standard
{'Jean Claude Van Dame'}
{
"tokens": [
{
"token": "jean",
"start_offset": 2,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "claude",
"start_offset": 7,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "van",
"start_offset": 14,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "dame",
"start_offset": 18,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 4
}
]
}