I am getting schema validation warnings "Value should be one of" when having YAML file opened with CloudFormation template. It seems that IntelliJ/WebStorm are validating YAML against remote JSON schemas if available, in this case it seems to be: https://www.schemastore.org/json/ (as stated here: https://www.jetbrains.com/help/phpstorm/yaml.html#remote_json)
But for some reason as simple type as CloudFront distribution does not validate:
Type: AWS::CloudFront::Distribution but for example Type: AWS::ECS::TaskDefinition is accepted fine. For me it looks like that https://www.schemastore.org/json/ should be up to date. Anyone else experiencing similar issues?
I also tried this plugin https://plugins.jetbrains.com/plugin/7371-aws-cloudformation, but that doesnt seem to even work when YAML validation by JSON schema is disabled:
Related
I have a pipeline I need to cancel if it runs for too long. It could look something like this:
So in case the work takes longer than 10000 seconds, the pipeline will fail and cancel itself. The thing is, I can't get the web activity to work. I've tried something like this:
https://learn.microsoft.com/es-es/rest/api/synapse/data-plane/pipeline-run/cancel-pipeline-run
But it doesn't even work using the 'Try it' thing. I get this error:
{"code": "InvalidTokenAuthenticationAudience", "message": "Token Authentication failed with SecurityTokenInvalidAudienceException - IDX10214: Audience validation failed. Audiences: '[PII is hidden]'. Did not match: validationParameters.ValidAudience: '[PII is hidden]' or validationParameters.ValidAudiences: '[PII is hidden]'."}
Using this URL:
POST
https://{workspacename}.dev.azuresynapse.net/pipelineruns/729345a-fh67-2344-908b-345dkd725668d/cancel?api-version=2020-12-01
Also, using ADF it seemed quite easy to do this:
https://cloudsafari.ca/2020/09/data-engineering/Azure-DataFactory-Cancel-Pipeline-Run
Including authentication using a Managed Identity, which in the case of Synapse I'm not too sure would resource should I use. Any idea on how to achieve what I want or if I'm doing something wrong?
Your URL is correct. Just check the following and then it should work:
Add the MSI of the workspace to the workspace resource itself with Role = Contributor
In the web activity, set the Resource to "https://dev.azuresynapse.net/" (without the quotes, obviously)
This was a bit buried in the docs, see last bullet of this section here: https://learn.microsoft.com/en-us/rest/api/synapse/#common-parameters-and-headers
NOTE: the REST API is unable to cancel pipelines run in DEBUG in Synapse (you'll get an error response saying pipeline with that ID is not found). This means for it to work, you have to first publish the pipelines and then trigger them.
I see there's no schedule type provided by GCP. I'd like to know the steps to create a template, a composite-type or similar, to provide Cloud Scheduler type. I know Google already provides an example about it.
If it's posible to do so by code It could make use of the python client library though it says in the documentation this library is not available, I could inline it in the code.
I cannot think of a way to authenticate against the google API to do such requests.
In short, my question is how can make Deployment Manager type for Cloud? I know it is sort of vague. Just want to know if it would be doable.
On the other hand, where can I find the official development for this
GCP service?
For completenesss here's the related Github issue too
Cloud Scheduler type is not supported yet according to GCP's documentation.
I am not aware of any official development for this GCP service other than the one I linked above. That being said, I will create a feature request for your use case. Please add any additional that I have missed and you may use the same thread to communicate with the deployment manager team.
I was looking for this functionality and thought I should give an up to date answer on the topic.
Thanks to https://stackoverflow.com/users/9253778/dany-l for the feature request which led me to this answer.
It looks like this functionality is indeed provided, just that the documentation has yet to be updated to reflect it.
Here's the snippet from https://issuetracker.google.com/issues/123013878:
- type: gcp-types/cloudscheduler-v1:projects.locations.jobs
name: <YOUR_JOB_NAME_HERE>
properties:
parent: projects/<YOUR_PROJECT_ID_HERE>/locations/<YOUR_REGION_HERE>
name: <YOUR_JOB_NAME_HERE>
description: <YOUR_JOB_DESCRIPTION_HERE>
schedule: "0 2 * * *" # daily at 2 am
timeZone: "Europe/Amsterdam"
pubsubTarget:
topicName: projects/<YOUR_PROJECT_ID_HERE>/topics/<YOUR_EXPECTED_TOPIC_HERE>
data: aGVsbG8hCg== # base64 encoded "hello!"
You can use general YAML file with deployment-manager:
config.yaml:
resources:
- name: <<YOUR_JOB_NAME>>
type: gcp-types/cloudscheduler-v1:projects.locations.jobs # Cloud scheduler
properties:
parent: "projects/<<YOUR_PROJECT_NAME>>/locations/<<YOUR_LOCATION_ID>>"
description: "<<JOB_DESCRIPTION_OPTIONAL>>"
schedule: "* */2 * * *" # accepts 'cron' format
http_target:
http_method: "GET"
uri: "<<URI_TO_YOUR_FUNCTION>>" # trigger link in cloud functions
You even can add to create a Pub/Sub job and other with deployment-manager just add :
- name: <<TOPIC_NAME>>
type: pubsub.v1.topic
properties:
topic: <<TOPIC_NAME>>
- name: <<NAME>>
type: pubsub.v1.subscription
properties:
subscription: <<SUBSCRIPTION_NAME>>
topic: $(ref.<<TOPIC_NAME>>.name)
ackDeadlineSeconds: 600
NOTE: to get <<YOUR_LOCATION_ID>> use gcloud app describe.
To deploy use:
gcloud deployment-manager deployments create <<DEPLOYMENT_NAME>> --config=<<PATH_TO_YOUR_YAML_FILE>>
To delete use:
gcloud deployment-manager deployments delete <<DEPLOYMENT_NAME>> -q
For more properties on Cloud Scheduler read the documentation:
https://cloud.google.com/scheduler/docs/reference/rpc/google.cloud.scheduler.v1#google.cloud.scheduler.v1.HttpTarget
We have a .yml file defining the REST API, with many entries like this
/projects/{projectId}/jobs/{jobId}:
parameters:
- $ref: '#/parameters/projectId'
- $ref: '#/parameters/jobId'
get:
summary: Get Job
responses:
200:
description: Information retrieved successfully.
schema:
$ref: '#/definitions/Job'
The $ref items are not control-clickable in IDEA although they can be.
The YAML and YAML/Ansible support plugins are installed and enabled.
For me they are control clickable.
But maybe a workaround : use ctrl-B (go to declaration)
I've discovered, with the help of #Frederik here, that the Swagger Plugin does the trick. I chose the one by "Zalando" for editing and not for code generation as it has the highest rate.
I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like
java.lang.IllegalArgumentException: AWS Access Key ID and Secret
Access Key must be specified as the username or password
(respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId
or fs.s3n.awsSecretAccessKey properties (respectively).
I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:
pyspark.SparkContext().textFile("s3n://user:password#bucket/key")
(note the s3n [s3 did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials file anyway.
So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials file (ideally, without copying the credentials there to yet another configuration file)?
PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = … and os.environ["AWS_SECRET_ACCESS_KEY"] = …, it did not work.
PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty(), sc.setLocalProperty(), and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf). Nothing worked.
Yes, you have to use s3n instead of s3. s3 is some weird abuse of S3 the benefits of which are unclear to me.
You can pass the credentials to the sc.hadoopFile or sc.newAPIHadoopFile calls:
rdd = sc.hadoopFile('s3n://my_bucket/my_file', conf = {
'fs.s3n.awsAccessKeyId': '...',
'fs.s3n.awsSecretAccessKey': '...',
})
The problem was actually a bug in the Amazon's boto Python module. The problem was related to the fact that MacPort's version is actually old: installing boto through pip solved the problem: ~/.aws/credentials was correctly read.
Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto and Spark every time something strange happens: this has "magically" solved a few issues already for me.
Here is a solution on how to read the credentials from ~/.aws/credentials. It makes use of the fact that the credentials file is an INI file which can be parsed with Python's configparser.
import os
import configparser
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
aws_profile = 'default' # your AWS profile to use
access_id = config.get(aws_profile, "aws_access_key_id")
access_key = config.get(aws_profile, "aws_secret_access_key")
See also my gist at https://gist.github.com/asmaier/5768c7cda3620901440a62248614bbd0 .
Environment variables setup could help.
Here in Spark FAQ under the question "How can I access data in S3?" they suggest to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
I cannot say much about the java objects you have to give to the hadoopFile function, only that this function already seems depricated for some "newAPIHadoopFile". The documentation on this is quite sketchy and I feel like you need to know Scala/Java to really get to the bottom of what everything means.
In the mean time, I figured out how to actually get some s3 data into pyspark and I thought I would share my findings.
This documentation: Spark API documentation says that it uses a dict that gets converted into a java configuration (XML). I found the configuration for java, this should probably reflect the values you should put into the dict: How to access S3/S3n from local hadoop installation
bucket = "mycompany-mydata-bucket"
prefix = "2015/04/04/mybiglogfile.log.gz"
filename = "s3n://{}/{}".format(bucket, prefix)
config_dict = {"fs.s3n.awsAccessKeyId":"FOOBAR",
"fs.s3n.awsSecretAccessKey":"BARFOO"}
rdd = sc.hadoopFile(filename,
'org.apache.hadoop.mapred.TextInputFormat',
'org.apache.hadoop.io.Text',
'org.apache.hadoop.io.LongWritable',
conf=config_dict)
This code snippet loads the file from the bucket and prefix (file path in the bucket) specified on the first two lines.
One of our APIs accept certificates from our users. With current design, users dump raw certificate data in the payload and make a POST request with the content type set to application/x-pkcs12.
So essentially, our API is accepting raw bytes of a file in the body of the request.
If I try to define this API via Swagger, then I can't do so. Because, correct me if I'm wrong, the parameter of this operation will have to be 'in' body and the 'type' of this parameter would have to be file.
Swagger requires all body parameters to have the Schema object necessarily, and all parameters of type file should have 'in' value set to formData. Both of these requirements are contradictory to our case.
So my question is, is this Swagger's limitation? Or is this just bad API design, and should we be structuring/designing our API in some other way?
I'm fairly new to the world of APIs so I'm not sure which of the cases it is.
Thanks in advance.
I believe this can still be done. Your body parameter schema should have type []byte. When you call the API, your parameter value should be a base-64 encoded string of the file contents. This is similar to how you would send the contents of a binary .jpg file in the body of a request.
Swagger 2.0 allows parameters of type file. This seems to fit your use case.
parameters:
- name: cert
in: formData
description: The certificate
required: true
type: file
Your scenario is supported in OpenAPI 3.0. The previous version, OpenAPI/Swagger 2.0, allowed file uploads using multipart/form-data requests only, but 3.0 supports uploading raw files as well.
paths:
/cert:
post:
requestBody:
required: true
content:
application/x-pkcs12:
schema:
type: string
format: binary
responses:
...
More info: File Upload