I am using AWS Data Pipeline to save a text file to my S3 bucket from RDS. I would like the file name to to have the date and the hour in the file name like:
myfile-YYYYMMDD-HH.txt
myfile-20140813-12.txt
I have specified my S3DataNode FilePath as:
s3://mybucketname/out/myfile-#{format(myDateTime,'YYYY-MM-dd-HH')}.txt
When I try to save my pipeline I get the following error:
ERROR: Unable to resolve myDateTime for object:DataNodeId_xOQxz
According to the AWS Data Pipeline documentation for date and time functions this is the proper syntax for using the format function.
When I save pipeline using a "hard-coded" the date and time I don't get this error and my file is in my S3 bucket and folder as expected.
My thinking is that I need to define "myDateTime" somewhere or use a NOW()
Can somebody tell me how to set "myDateTime" to the current time (e.g. NOW) or give a workaround so I can format the current time to be used in my FilePath?
I am not aware of an exact equivalent of NOW() in Data Pipeline. I tried using makeDate with no arguments (just for fun) to see if that worked.. it did not.
The closest are runtime variables scheduledStartTime, actualStartTime, reportProgressTime.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-s3datanode.html
The following for eg. should work.
s3://mybucketname/out/myfile-#{format(#scheduledStartTime,'YYYY-MM-dd-HH')}.txt
Just for fun, here is some more info on Parameters.
At the end of your Pipeline Json (click List Pipelines, select into one, click Edit Pipeline, then click Export), you need to add a Parameters and/or Values object.
I use a myStartDate for backfill processes which you can manipulate once it is passed in for ad hoc runs. You can give this a static default, but can't set it to a dynamic value so it is limited for regular schedule tasks. For realtime/scheduled dates, you need to use the #scheduledStartTime, etc, as suggested. Here is a sample of setting up some Parameters and or Values. Both show up in Parameters in the UI. These values can be used through out your pipeline activities (shell, hive, etc) with the #{myVariableToUse} notation.
"parameters": [
{
"helpText": "Put help text here",
"watermark": "This shows if no default or value set",
"description": "Label/Desc",
"id": "myVariableToUse",
"type": "string"
}
]
And for Values:
"values": {
"myS3OutLocation": "s3://some-bucket/path",
"myThreshold": "30000",
}
You cannot add these directly in the UI (yet) but once they are there you can change and save the values.
Related
I have got requirement saying, blob storage has multiple files with names file_1.csv,file_2.csv,file_3.csv,file_4.csv,file_5.csv,file_6.csv,file_7.csv. From these i have to read only filenames from 5 to 7.
how we can achieve this in ADF/Synapse pipeline.
I have repro’d in my lab, please see the below repro steps.
ADF:
Using the Get Metadata activity, get a list of all files.
(Parameterize the source file name in the source dataset to pass ‘*’ in the dataset parameters to get all files.)
Get Metadata output:
Pass the Get Metadata output child items to ForEach activity.
#activity('Get Metadata1').output.childItems
Add If Condition activity inside ForEach and add the true case expression to copy only required files to sink.
#and(greater(int(substring(item().name,4,1)),4),lessOrEquals(int(substring(item().name,4,1)),7))
When the If Condition is True, add copy data activity to copy the current item (file) to sink.
Source:
Sink:
Output:
I took a slightly different approaching using a Filter activity and the endsWith function:
The filter expression is:
#or(or(endsWith(item().name, '_5.csv'),endsWith(item().name, '_6.csv')),endsWith(item().name, '_7.csv'))
Slightly different approaches, similar results, it depends what you need.
You can always do what #NiharikaMoola-MT suggested . But since you already know the range of the files ( 5-7) , I suggest
Declare two paramter as an upper and lower range
Create a Foreach loop and pass the parameter and to create a range[lowerlimit,upperlimit]
Create a paramterized dataset for source .
Use the fileNumber from the FE loop to create a dynamic expression like
#concat('file',item(),'.csv')
So here is what I want as a module in Pseudo Code:
IF UseCustom, Create AWS Launch Config With One Custom EBS Device and One Generic EBS Device
ELSE Create AWS Launch Config With One Generic EBS Device
I am aware that I can use the 'count' function within a resource to decide whether it is created or not... So I currently have:
resource aws_launch_configuration "basic_launch_config" {
count = var.boolean ? 0 : 1
blah
}
resource aws_launch_configuration "custom_launch_config" {
count = var.boolean ? 1 : 0
blah
blah
}
Which is great, now it creates the right Launch configuration based on my 'boolean' variable... But in order to then create the AutoScalingGroup using that Launch Configuration, I need the Launch Configuration Name. I know what you're thinking, just output it and grab it, you moron! Well of course I'm outputting it:
output "name" {
description = "The Name of the Default Launch Configuration"
value = aws_launch_configuration.basic_launch_config.*.name
}
output "name" {
description = "The Name of the Custom Launch Configuration"
value = aws_launch_configuration.custom_launch_config.*.name
}
But how the heck do I know from the higher area that I'm calling the module that creates the Launch Configuration and Then the Auto Scaling Group which output to use for passing into the ASG???
Is there a different way to grab the value I want that I'm overlooking? I'm new to Terraform and the whole no real conditional thing is really throwing me for a loop.
Terraform: How to conditionally assign an EBS volume to an ECS Cluster
This seemed to be the cleanest way I could find, using a ternary operator:
output "name {
description = "The Name of the Launch Configuration"
value = "${(var.booleanVar) == 0 ? aws_launch_configuration.default_launch_config.*.name : aws_launch_configuration.custom_launch_config.*.name}
}
Let me know if there is a better way!
You can use the same variable you used to decide which resource to enable to select the appropriate result:
output "name" {
value = var.boolean ? aws_launch_configuration.custom_launch_config[0].name : aws_launch_configuration.basic_launch_config[0].name
}
Another option, which is a little more terse but arguably also a little less clear to a future reader, is to exploit the fact that you will always have one list of zero elements and one list with one elements, like this:
output "name" {
value = concat(
aws_launch_configuration.basic_launch_config[*].name,
aws_launch_configuration.custom_launch_config[*].name,
)[0]
}
Concatenating these two lists will always produce a single-item list due to how the count expressions are written, and so we can use [0] to take that single item and return it.
I am saving a file to blob storage in Data factory V2, when I specify the location to save to I am calling the file (for example) file1 and it saves in blob as file1, no problem. But can I use the dynamic content feature to append the datetime to the filename so its something like file1_01-07-2019_14-30-00 ?(7th Jan 14:30:00 just in case its awkward to read). Alternatively, can I output the result (the filename) of the webhook activity to the next activity (the function)?
Thank you.
I couldn't get this to work without editing the copy pipeline JSON file directly (late 2018 - may not be needed anymore). You need dynamic code in the copy pipeline JSON and settings defined in the dataset for setting filename parameters.
In the dataset define 'Parameters' for folder path and/or filename (click '+ New' and give them any name you like) e.g. sourceFolderPath, sourceFileName.
Then in dataset under 'Connection' include the following in the 'File path' definition:
#dataset().sourceFolderPath and #dataset().sourceFileName either side of the '/'
(see screenshot below)
In the copy pipeline click on 'Code' in the upper right corner of pipeline window and look for the following code under the 'blob' object you want defined by a dynamic filename - it the 'parameters' code isn't included add it to the JSON and click the 'Finish' button - this code may be needed in 'inputs', 'outputs' or both depending on the dynamic files you are referencing in your flow - below is an example where the output includes the date parameter in both folder path and file name (the date is set by a Trigger parameter):
"inputs": [
{
"referenceName": "tmpDataForImportParticipants",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "StgParticipants",
"type": "DatasetReference",
"parameters": {
"sourceFolderPath": {
"value": <derived value of folder path>,
"type": "Expression"
},
"sourceFileName": {
"value": <derived file name>,
"type": "Expression"
}
}
}
]
Derived value of folder path may be something like the following - this results in a folder path of yyyy/mm/dd within specified blobContainer:
"blobContainer/#{formatDateTime(pipeline().parameters.windowStart,'yyyy')}/#{formatDateTime(pipeline().parameters.windowStart,'MM')}/#{formatDateTime(pipeline().parameters.windowStart,'dd')}"
or it could be hardcoded e.g. "blobContainer/directoryPath" - don't include '/' at start or end of definition
Derived file name could be something like the following:
"#concat(string(pipeline().parameters.'_',formatDateTime(dataset().WindowStartTime, 'MM-dd-yyyy_hh-mm-ss'))>,'.txt')"
You can include any parameter set by the Trigger e.g. an ID value, account name, etc. by including pipeline().parameters.
Dynamic Dataset Parameters example
Dynamic Dataset Connection example
Once you set up the copy activity and select you blob dataset as the sink, you need to put in a value for the WindowStartTime, this can either just be a timestamp e.g. 1900-01-01T13:00:00Z or you can put in a pipeline parameter into this.
Having a parameter would maybe be more helpful if you're setting up a schedule trigger, as you will be able to input this WindowStartTime timestamp by when the trigger runs. For this you would use #trigger().scheduledTime as the value for the trigger parameter WindowStartTime.
https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers#trigger-type-comparison
You can add a dataset parameter such an WindowStartTime, which is in the format 2019-01-10T13:50:04.279Z. Then you would have something like below for the dynamic filename:
#concat('file1_', formatDateTime(dataset().WindowStartTime, 'MM-dd-yyyy_hh-mm-ss')).
To use in the copy activity you will also need to add a pipeline parameter.
I tried to load logs from Google Cloud Storage to BigQuery by the bq command
and I've got this error "Could not convert value to string".
my example data
{"ids":"1234,5678"}
{"ids":1234}
my example schema
[
{ "name":"ids", "type":"string" }
]
It seems IDs can't convert by none quote at single ID.
Data is made with fluent-plugin-s3, but more than one ID connected by a comma can be bound up with a quotation and isn't made single id.
How can I load these data to BigQuery?
Thanks in advance
Well check different fluentd plugins that can help you, maybe
https://github.com/lob/fluent-plugin-json-transform
https://github.com/tarom/fluent-plugin-typecast
While reading in the knowledge center, the following is mentioned:
The TTL properties are not applied to data that already exists in the
Analytics Platform. You must set the TTL properties before you add
data.
So how can I remove existing logs before setting those properties?
You must use the Elastic Search delete APIs to remove existing documents from Worklight Analytics.
Before using any of the Elastic Search delete APIs it is advised to back up your data first, as misuse of the APIs or an undesired query will result in permanent data loss.
Below is an example of how to delete client logs in a specified date range, assuming your instance of Elastic Search is running on http://localhost:9500. This specific example deletes all client logs between October 1st and October 15th 2014.
curl -XDELETE 'http://localhost:9500/worklight/client_logs/_query' -d
'
{
"query": {
"range": {
"timestamp": {
"gt" : 1412121600000,
"lt" : 1413331200000
}
}
}
}
'
You can delete any type of document using the path http://localhost:9500/worklight/{document_type}. The types of documents are app_activities, network_activities, notification_activities, client_logs and server_logs.
When deleting documents, you can filter on two properties: "timestamp" or "daystamp", which are both represented in epoch time in milliseconds. Please note, "daystamp" is simply the first timestamp for the given day (i.e. 12:00AM). The range query also accepts the following parameters:
gte - greater than or equal to
gt - greater than
lte - less than or equal to
lt - less than
For more information refer to Elastic Search delete and query APIS:
Delete by Query API
Queries
Range Query