aws opensearch restore manual snapshot to different cluster - amazon-s3

Is there a way to copy an aws opensearch manual snapshot to a different s3 bucket and restore it to a different cluster?
I have tf deployment that installs aws opensearch cluster with manual snapshots pointing to s3 configured. These work and i can restore to the same cluster that the snapshot was taken. However I cant find a solution to restore to a different cluster. Do you have to share the original s3 bucket or is it possible to copy the files to a new s3 (which would be ideal!)
When i do a copy of the snapshots from source to destination and then run a restore i get an error:
{
"error" : {
"root_cause" : [
{
"type" : "security_exception",
"reason" : "no permissions for [] and User [name=admin, backend_roles=[], requestedTenant=]"
}
],
"type" : "security_exception",
"reason" : "no permissions for [] and User [name=admin, backend_roles=[], requestedTenant=]"
},
"status" : 403
}
This is based on:
1: run the below on the source cluster and find the last snapshot
GET /_snapshot/snapshot_repo/_all
2: copy the snapshot files from source cluster s3 to destination cluster s3
3: run a restore of the snapshot in the destination cluster
POST _snapshot/snapshot_repo/snapshot-2022-11-15-11-01-59/_restore
note: when i run a get snapshots i do not get the list of copied snapshots (which i was hoping i would!).
Any ideas of solution?

Related

Quicksight Dataset suddenly cant refresh from S3 anymore

I have a Quicksight dataset that’s been working fine for months pulling data from S3 via a manifest file but all of a sudden all the refreshes started failing with the errors below since yesterday: FAILURE_TO_PROCESS_JSON_FILE= Error details: S3 Iterator: S3 problem reading data
I’ve double checked the manifest file format and the S3 bucket permissions for Quicksight and everything seems fine and nothing has changed on our end for this to suddenly stop working out of the blue…
Manifest file:
{
"fileLocations": [
{
"URIPrefixes": [
"https://s3.amazonaws.com/solar-dash-live/"
]
}
],
"globalUploadSettings": {
"format": "JSON"
}
}
The error I get in the email alert is different and says "Amazon QuickSight couldn't parse a manifest file as valid JSON." However I verified that the above JSON is formatted correctly.
Also, if I create a new dataset with the same manifest file, it will show the data in the preview tool but its just the Refresh that fails so clearly the manifest file is formatted correctly if Quicksight is able to initially pull data from S3 but only fails later.

JupyterHub server is unable start in Terraformed EMR cluster running in private subnet

I'm creating an EMR cluster (emr-5.24.0) with Terraform, deployed into a private subnet, that includes Spark, Hive and JupyterHub.
I've added an additional configuration JSON to the deployment, which should add persistency for the Jupiter notebooks into S3 (instead of locally on disk).
The overall architecture includes a VPC endpoint to S3 and I'm able to access the bucket I'm trying to write the notebooks to.
When the cluster is provisioned, the JupyterHub server is unable to start.
Logging into the master node and trying to start/restart the docker container for the jupyterhub does not help.
The configuration for this persistency looks like this:
[
{
"Classification": "jupyter-s3-conf",
"Properties": {
"s3.persistence.enabled": "true",
"s3.persistence.bucket": "${project}-${suffix}"
}
},
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}
]
}
]
In the terraform EMR resource definition, this is then referenced:
configurations = "${data.template_file.configuration.rendered}"
This is read from:
data "template_file" "configuration" {
template = "${file("${path.module}/templates/cluster_configuration.json.tpl")}"
vars = {
project = "${var.project_name}"
suffix = "bucket"
}
}
When I don't use persistency on the notebooks, everything works fine and I am able to log into JupyterHub.
I'm fairly certain it's not a IAM policy issue since the EMR cluster role policy Allow action is defined as "s3:*".
Are there any additional steps that need to be taken in order for this to function ?
/K
It seems that the jupyter on EMR uses the S3ContentsManager to connect with S3.
https://github.com/danielfrg/s3contents
I dig a bit S3ContentsManager and found the S3 endpoints which are the public one (as expected). Since the endpoint of S3 is public, jupyter needs to access the internet but you are running the EMR on the private subnet which is not possible to connect the endpoint I guess.
You might need to use a NAT gateway in a public subnet or create s3 endpoint for your VPC.
Yup. We ran into this too. Add an S3 VPC Endpoint, then from AWS support -
add a JupyterHub notebook config:
{
"Classification": "jupyter-notebook-conf",
"Properties": {
"config.S3ContentsManager.endpoint_url": "\"https://s3.${aws_region}.amazonaws.com\"",
"config.S3ContentsManager.region_name": "\"${aws_region}\""
}
},
hth

Configuring Google cloud bucket as Airflow Log folder

We just started using Apache airflow in our project for our data pipelines .While exploring the features came to know about configuring remote folder as log destination in airflow .For that we
Created a google cloud bucket.
From Airflow UI created a new GS connection
I am not able to understand all the fields .I just created a sample GS Bucket under my project from google console and gave that project ID to this Connection.Left key file path and scopes as blank.
Then edited airflow.cfg file as follows
remote_base_log_folder = gs://my_test_bucket/
remote_log_conn_id = test_gs
After this changes restarted the web server and scheduler .But still my Dags is not writing logs to the GS bucket .I am able to see the logs which is creating logs in base_log_folder .But nothing is created in my bucket .
Is there any extra configuration needed from my side to get it working
Note: Using Airflow 1.8 .(Same issue I faced with AmazonS3 also. )
Updated on 20/09/2017
Tried the GS method attaching screenshot
Still I am not getting logs in the bucket
Thanks
Anoop R
I advise you to use a DAG to connect airflow to GCP instead of UI.
First, create a service account on GCP and download the json key.
Then execute this DAG (you can modify the scope of your access):
from airflow import DAG
from datetime import datetime
from airflow.operators.python_operator import PythonOperator
def add_gcp_connection(ds, **kwargs):
"""Add a airflow connection for GCP"""
new_conn = Connection(
conn_id='gcp_connection_id',
conn_type='google_cloud_platform',
)
scopes = [
"https://www.googleapis.com/auth/pubsub",
"https://www.googleapis.com/auth/datastore",
"https://www.googleapis.com/auth/bigquery",
"https://www.googleapis.com/auth/devstorage.read_write",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/cloud-platform",
]
conn_extra = {
"extra__google_cloud_platform__scope": ",".join(scopes),
"extra__google_cloud_platform__project": "<name_of_your_project>",
"extra__google_cloud_platform__key_path": '<path_to_your_json_key>'
}
conn_extra_json = json.dumps(conn_extra)
new_conn.set_extra(conn_extra_json)
session = settings.Session()
if not (session.query(Connection).filter(Connection.conn_id ==
new_conn.conn_id).first()):
session.add(new_conn)
session.commit()
else:
msg = '\n\tA connection with `conn_id`={conn_id} already exists\n'
msg = msg.format(conn_id=new_conn.conn_id)
print(msg)
dag = DAG('add_gcp_connection', start_date=datetime(2016,1,1), schedule_interval='#once')
# Task to add a connection
AddGCPCreds = PythonOperator(
dag=dag,
task_id='add_gcp_connection_python',
python_callable=add_gcp_connection,
provide_context=True)
Thanks to Yu Ishikawa for this code.
Yes, you need to provide additional information for both, S3 and GCP connection.
S3
Configuration is passed via extra field as JSON. You can provide only profile
{"profile": "xxx"}
or credentials
{"profile": "xxx", "aws_access_key_id": "xxx", "aws_secret_access_key": "xxx"}
or path to config file
{"profile": "xxx", "s3_config_file": "xxx", "s3_config_format": "xxx"}
In case of the first option, boto will try to detect your credentials.
Source code - airflow/hooks/S3_hook.py:107
GCP
You can either provide key_path and scope (see Service account credentials) or credentials will be extracted from your environment in this order:
Environment variable GOOGLE_APPLICATION_CREDENTIALS pointing to a file with stored credentials information.
Stored "well known" file associated with gcloud command line tool.
Google App Engine (production and testing)
Google Compute Engine production environment.
Source code - airflow/contrib/hooks/gcp_api_base_hook.py:68
The reason for logs not being written to your bucket could be related to service account rather than config on airflow itself. Make sure it has access to the mentioned bucket. I had same problems in the past.
Adding more generous permissions to the service account, e.g. even project wide Editor and then narrowing it down. You could also try using gs client with that key and see if you can write to the bucket.
For me personally this scope works fine for writing logs: "https://www.googleapis.com/auth/cloud-platform"

Where does ElasticSearch store persistent settings?

When I get my ElasticSearch server settings via
curl -XGET localhost:9200/_cluster/settings
I see persistent and transient settings.
{
"persistent": {
"cluster.routing.allocation.cluster_concurrent_rebalance": "0",
"threadpool.index.size": "20",
"threadpool.search.size": "30",
"cluster.routing.allocation.disable_allocation": "false",
"threadpool.bulk.size": "40"
},
"transient": {}
}
If I set a persistent setting, it doesn't save it to my config/elasticsearch.yml config file? So my question is when my server restarts, how does it know what my persistent settings are?
Don't tell me not to worry about it because I almost lost my entire cluster worth of data because it picked up all the settings in my config file after it restarted, NOT the persistent settings shown above :)
Persistent settings are stored on each master-eligible node in the global cluster state file, which can be found in the Elasticsearch data directory: data/CLUSTER_NAME/nodes/N/_state, where CLUSTER_NAME is the name of the cluster and N is the node number (0 if this is the only node on this machine). The file name has the following format: global-NNN where NNN is the version of the cluster state.
Besides persistent settings this file may contain other global metadata such as index templates. By default the global cluster state file is stored in the binary SMILE format. For debugging purposes, if you want to see what's actually stored in this file, you can change the format of this file to JSON by adding the following line to the elasticsearch.yml file:
format: json
Every time cluster state changes, all master-eligible nodes store the new version of the file, so during cluster restart the node that starts first and elects itself as a master will have the newest version of the cluster state. What you are describing could be possible if you updated the settings when one of your master-eligible nodes was not part of the cluster (and therefore couldn't store the latest version with your settings) and after the restart this node became the cluster master and propagated its obsolete settings to all other nodes.

Elasticsearch Writes Into S3 Bucket for Metadata But Doesn't Write Into S3 Bucket For Scheduled Snapshots?

According to the documents at http://www.elasticsearch.org/tutorials/2011/08/22/elasticsearch-on-ec2.html, I have included the respective API Access Key ID with its secret access key and Elasticsearch is able to write into the S3 bucket as follows:
[2012-08-02 04:21:38,793][DEBUG][gateway.s3] [Schultz, Herman] writing to gateway org.elasticsearch.gateway.shared.SharedStorageGateway$2#4e64f6fe ...
[2012-08-02 04:21:39,337][DEBUG][gateway.s3] [Schultz, Herman] wrote to gateway org.elasticsearch.gateway.shared.SharedStorageGateway$2#4e64f6fe, took 543ms
However when it comes to writing snapshots into the S3 bucket, out comes the following error:
[2012-08-02 04:25:37,303][WARN ][index.gateway.s3] [Schultz, Herman] [plumbline_2012.08.02][3] failed to read commit point [commit-i] java.io.IOException: Failed to get [commit-i]
Caused by: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: E084E2ED1E68E710, AWS Error Code: InvalidAccessKeyId, AWS Error Message: The AWS Access Key Id you provided does not exist in our records.
[2012-08-02 04:36:06,696][WARN ][index.gateway] [Schultz, Herman] [plumbline_2012.08.02][0] failed to snapshot (scheduled) org.elasticsearch.index.gateway.IndexShardGatewaySnapshotFailedException: [plumbline_2012.08.02][0] Failed to perform snapshot (index files)
Is there a reason why this is happening since the access keys I have provided is able to write metadata and not creating snapshots?