merging s3 manifest files using jq - amazon-s3

I have multiple s3 manifest files each corresponding to a date for a given date range. I am looking to merge all of the manifest files to generate a single manifest file, thus allowing me to perform a single Redshift copy.
manifest file 1:
{
"entries": [
{
"url": "DFA/20161001/394007-OMD-Coles/dcm_account394007_activity_20160930_20161001_050403_294198927.csv.gz"
}
]
}
manifest file 2:
{
"entries": [
{
"url": "DFA/20161002/394007-OMD-Coles/dcm_account394007_activity_20161001_20161002_054043_294865863.csv.gz"
}
]
}
I am looking for an output like:-
{
"entries": [
{
"url": "DFA/20161001/394007-OMD-Coles/dcm_account394007_activity_20160930_20161001_050403_294198927.csv.gz"
},
{
"url": "DFA/20161002/394007-OMD-Coles/dcm_account394007_activity_20161001_20161002_054043_294865863.csv.gz"
}
]
}
I did try
jq -s '.[]' "manifest_file1.json" "manifest_file2.json"
and other suggestions posted in Stackoverflow but couldn't make it work.

Or, without resorting to reduce:
$ jq -n '{entries: [inputs.entries[]]}' manifest_file_{1,2}.json
{
"entries": [
{
"url": "DFA/20161001/394007-OMD-Coles/dcm_account394007_activity_20160930_20161001_050403_294198927.csv.gz"
},
{
"url": "DFA/20161002/394007-OMD-Coles/dcm_account394007_activity_20161001_20161002_054043_294865863.csv.gz"
}
]
}
Note that inputs was introduced in jq version 1.5. If your jq does not have inputs, you can use jq -s as follows:
$ jq -s '{entries: [.[].entries[]]}' manifest_file_{1,2}.json

So if by "merge" you mean to combine the "entries" arrays into a single array by concatenating them, you could do this:
$ jq 'reduce inputs as $i (.; .entries += $i.entries)' manifest_file{1,2}.json
Which yields:
{
"entries": [
{
"url": "DFA/20161001/394007-OMD-Coles/dcm_account394007_activity_20160930_20161001_050403_294198927.csv.gz"
},
{
"url": "DFA/20161002/394007-OMD-Coles/dcm_account394007_activity_20161001_20161002_054043_294865863.csv.gz"
}
]
}

Related

Replace text in specific position in line with system date

I have a file with a single line. I want to replace the text between positions 188 (inclusive) to 197 with the system date (YYYY-MM-DD).
I tried this but it doesn't work:
sed 's/\(.\{188\}\)\([0-9-]\{10\}\)\(.*\)/\1$(date '+%Y-%m-%d')\188/g'
I want to use sed or anything else that works in a shell script.
The input file is:
{ "agent": { "run_as_user": "root" }, "logs": { "logs_collected": { "files": { "collect_list": [ { "file_path": "/home/ec2-user/logs/**", "log_group_name": "Staging", "log_stream_name": "2020-10-24", "timestamp_format": "[%Y-%m-%d %H:%M:%S]" } ] } } } }
. . . and in the output, I want to change only the the date as shown below.
{ "agent": { "run_as_user": "root" }, "logs": { "logs_collected": { "files": { "collect_list": [ { "file_path": "/home/ec2-user/logs/**", "log_group_name": "Staging", "log_stream_name": "2020-10-25", "timestamp_format": "[%Y-%m-%d %H:%M:%S]" } ] } } } }
Could you please try following, written as per shown attempts of OP in GNU awk.
awk -v date=$(date +%Y-%m-%d) '{print substr($0,1,187) date substr($0,198)}' Input_file

Is there a way to get Step Functions input values into EMR step Args

We are running batch spark jobs using AWS EMR clusters. Those jobs run periodically and we would like to orchestrate those via AWS Step Functions.
As of November 2019 Step Functions has support for EMR natively. When adding a Step to the cluster we can use the following config:
"Some Step": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
"Parameters": {
"ClusterId.$": "$.cluster.ClusterId",
"Step": {
"Name": "FirstStep",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit",
"--class",
"com.some.package.Class",
"JarUri",
"--startDate",
"$.time",
"--daysToLookBack",
"$.daysToLookBack"
]
}
}
},
"Retry" : [
{
"ErrorEquals": [ "States.ALL" ],
"IntervalSeconds": 1,
"MaxAttempts": 1,
"BackoffRate": 2.0
}
],
"ResultPath": "$.firstStep",
"End": true
}
Within the Args List of the HadoopJarStep we would like to set arguments dynamically. e.g. if the input of the state machine execution is:
{
"time": "2020-01-08",
"daysToLookBack": 2
}
The strings in the config starting with "$." should be replaced accordingly when executing the State Machine, and the step on the EMR cluster should run command-runner.jar spark-submit --class com.some.package.Class JarUri --startDate 2020-01-08 --daysToLookBack 2. But instead it runs command-runner.jar spark-submit --class com.some.package.Class JarUri --startDate $.time --daysToLookBack $.daysToLookBack.
Does anyone know if there is a way to do this?
Parameters allow you to define key-value pairs, so as the value for the "Args" key is an array, you won't be able to dynamically reference a specific element in the array, you would need to reference the whole array instead. For example "Args.$": "$.Input.ArgsArray".
So for your use-case the best way to achieve this would be to add a pre-processing state, before calling this state. In the pre-processing state you can either call a Lambda function and format your input/output through code or for something as simple as adding a dynamic value to an array you can use a Pass State to reformat the data and then inside your task State Parameters you can use JSONPath to get the array which you defined in in the pre-processor. Here's an example:
{
"Comment": "A Hello World example of the Amazon States Language using Pass states",
"StartAt": "HardCodedInputs",
"States": {
"HardCodedInputs": {
"Type": "Pass",
"Parameters": {
"cluster": {
"ClusterId": "ValueForClusterIdVariable"
},
"time": "ValueForTimeVariable",
"daysToLookBack": "ValueFordaysToLookBackVariable"
},
"Next": "Pre-Process"
},
"Pre-Process": {
"Type": "Pass",
"Parameters": {
"FormattedInputsForEmr": {
"ClusterId.$": "$.cluster.ClusterId",
"Args": [
{
"Arg1": "spark-submit"
},
{
"Arg2": "--class"
},
{
"Arg3": "com.some.package.Class"
},
{
"Arg4": "JarUri"
},
{
"Arg5": "--startDate"
},
{
"Arg6.$": "$.time"
},
{
"Arg7": "--daysToLookBack"
},
{
"Arg8.$": "$.daysToLookBack"
}
]
}
},
"Next": "Some Step"
},
"Some Step": {
"Type": "Pass",
"Parameters": {
"ClusterId.$": "$.FormattedInputsForEmr.ClusterId",
"Step": {
"Name": "FirstStep",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args.$": "$.FormattedInputsForEmr.Args[*][*]"
}
}
},
"End": true
}
}
}
You can use the States.Array() intrinsic function. Your Parameters becomes:
"Parameters": {
"ClusterId.$": "$.cluster.ClusterId",
"Step": {
"Name": "FirstStep",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args.$": "States.Array('spark-submit', '--class', 'com.some.package.Class', 'JarUri', '--startDate', $.time, '--daysToLookBack', '$.daysToLookBack')"
}
}
}
Intrinsic functions are documented here but I don't think it explains the usage very well. The code snippets provided in the Step Functions console are more useful.
Note that you can also do string formatting on the args using States.Format(). For example, you could construct a path using an input variable as the final path segment:
"Args.$": "States.Array('mycommand', '--path', States.Format('my/base/path/{}', $.someInputVariable))"

nested select query in elasticsearch

I have to convert the following query in elasticsearch :
select * from index where observable not in (select observable from index where tags = 'whitelist')
I read that I should use a Filter in a Not Filter but I don't understand how to do.
Can anyone help me?
Thanks
EDIT:
I have to get all except those that have 'whitelist' tag but I need to check also that nothing of the blacklist element is contained into the whitelist.
Your SQL query can be simplified to this:
select * from index where tags not in ('whitelist')
As a result the "corresponding" ES query would be
curl -XPOST localhost:9200/index/_search -d '{
"query": {
"filtered": {
"filter": {
"bool": {
"must_not": {
"terms": {
"tags": [
"whitelist"
]
}
}
}
}
}
}
}'
or another using the not filter instead of bool/must_not:
curl -XPOST localhost:9200/index/_search -d '{
"query": {
"filtered": {
"filter": {
"not": {
"terms": {
"tags": [
"whitelist"
]
}
}
}
}
}
}'

ElasticSearch:filtering documents based on field length

I read couple of similar problems on SO and suggest solution not work..
I want to find all fields where word is shorter than 8
my database screen:
I tried to do this using this query
{
"query": {
"match_all": {}
},
"filter": {
"script": {
"script": "doc['word'].length < 5"
}
}
}
what I doing wrong? I miss something?
Any field used in a script is loaded entirely into memory (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html#_document_fields), so you may want to consider an alternative approach.
You can e.g. use the regexp-filter to just find terms of a certain length, with a pattern like .{0,4}.
Here's a runnable example you can play with: https://www.found.no/play/gist/2dcac474797b0b2b952a
#!/bin/bash
export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"type"}}
{"word":"bar"}
{"index":{"_index":"play","_type":"type"}}
{"word":"barf"}
{"index":{"_index":"play","_type":"type"}}
{"word":"zip"}
'
# Do searches
# This will not match barf
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
"query": {
"filtered": {
"filter": {
"regexp": {
"word": {
"value": ".{0,3}"
}
}
}
}
}
}
'

Define a dynamic not_analyzed field for a nested document

I have a document like below, the "tags" field is a nested document, and I want to make all child field for tags document to be index = not_analyzed. The problem is that field in tags will be dynamic. any tag could possible.
So how I can define dynamic mapping for this.
{
strong text'level': 'info',
'tags': {
'content': u'Nov 6 11:07:10 ja10 Keepalived_healthcheckers: Adding service [172.16.08.105:80] to VS [172.16.1.21:80]',
'id': 1755360087,
'kid': '2012121316',
'mailto': 'yanping3,chunying,pengjie',
'route': 15,
'service': 'LVS',
'subject': 'LVS_RS',
'upgrade': 'no upgrade configuration for this alert'
},
'timestamp': 1383707282.500464
}
I think you can use dynamic templates for this. For example following shell script creates dynamic_mapping_test index with dynamic template set when indexing field tags.*, mapping is set to type:string and index:not_analyzed.
echo "Delete dynamic_mapping_test"
curl -s -X DELETE http://localhost:9200/dynamic_mapping_test?pretty ; echo ""
echo "Create dynamic_mapping_test with nested tags and dynamic_template"
curl -s -X POST http://localhost:9200/dynamic_mapping_test?pretty -d '{
"mappings": {
"document": {
"dynamic_templates": [
{
"string_template": {
"path_match": "tags.*",
"mapping": {
"type": "string",
"index": "not_analyzed"
}
}
}
],
"properties": {
"tags": {
"type": "nested"
}
}
}
}
}' ; echo ""
echo "Display mapping"
curl -s "http://localhost:9200/dynamic_mapping_test/_mapping?pretty" ; echo ""
echo "Index document with new property tags.content"
curl -s -X POST "http://localhost:9200/dynamic_mapping_test/document?pretty" -d '{
"tags": {
"content": "this CONTENT should not be analyzed"
}
}' ; echo ""
echo "Refresh index"
curl -s -X POST "http://localhost:9200/dynamic_mapping_test/_refresh"
echo "Display mapping again"
curl -s "http://localhost:9200/dynamic_mapping_test/_mapping?pretty" ; echo ""
echo "Index document with new property tags.title"
curl -s -X POST "http://localhost:9200/dynamic_mapping_test/document?pretty" -d '{
"tags": {
"title": "this TITLE should not be analyzed"
}
}' ; echo ""
echo "Refresh index"
curl -s -X POST "http://localhost:9200/dynamic_mapping_test/_refresh"; echo ""
echo "Display mapping again"
curl -s "http://localhost:9200/dynamic_mapping_test/_mapping?pretty" ; echo ""
I suggest, all string "not_analyzed", and all numbers to long and "not_analyzed".
Because default string analyzed have more memory and file size.
I have reduced size and search fields' full word
range search long type.
{
"mappings": {
"_default_": {
"_source": {
"enabled": true
},
"_all": {
"enabled": false
},
"_type": {
"index": "no",
"store": false
},
"dynamic_templates": [
{
"el": {
"match": "*",
"match_mapping_type": "long",
"mapping": {
"type": "long",
"index": "not_analyzed"
}
}
},
{
"es": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"index": "not_analyzed"
}
}
}
]
}
}
}
I don't think there is any way to specify mapping while indexing the data. So, as an alternative, you can modify your tags document to have the following mapping:
{ tags: {
properties: {
tag_type: {type: 'string', index: 'not_analyzed'}
tag_value: {type: 'string', index: 'not_analyzed'}
}
}
}
Here, tag_type can contain the any of the values (content, id, kid, mailto, etc.), and tag_values can contain the actual value of the field that is named in tag_type.