Kubeflow installation on existing EKS cluster with cert-manager already installed - amazon-eks

I have an EKS cluster (1.18 version)in the pre-production environment.
Now I would like to use this cluster to install Kubeflow (1.4 version).
Unfortunately, when I try to install it with the kfctl apply -V -f kfctl_aws.yaml
I get this error:
WARN[0024] Encountered error applying application cert-manager-crds: (kubeflow.error): Code 500 with message: Apply.Run : [error when applying patch:
{"metadata":{"annotations":{"cert-manager.io/inject-ca-from-secret":null,"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"apiextensions.k8s.io/v1beta1\",\"kind\":\"CustomResourceDefinition\",\"metadata\":{\"annotations\":{},\"name\":\"certificaterequests.cert-manager.io\"},\"spec\":{\"additionalPrinterColumns\":[{\"JSONPath\":\".status.conditions[?(#.type==\\\"Ready\\\")].status\",\"name\":\"Ready\",\"type\":\"string\"},{\"JSONPath\":\".spec.issuerRef.name\",\"name\":\"Issuer\",\"priority\":1,\"type\":\"string\"},{\"JSONPath\":\".status.conditions[?(#.type==\\\"Ready\\\")].message\",\"name\":\"Status\",\"priority\":1,\"type\":\"string\"},{\"JSONPath\":\".metadata.creationTimestamp\",\"description\":\"CreationTimestamp is a timestamp representing the server time when this object was created. It is not guaranteed to be set in happens-before order across separate operations. Clients may not set this value. It is represented in RFC3339 form and is in UTC.\",\"name\":\"Age\",\"type\":\"date\"}],\"group\":\"cert-manager.io\",\"names\":{\"kind\":\"CertificateRequest\",\"listKind\":\"CertificateRequestList\",\"plural\":\"certificaterequests\",\"shortNames\":[\"cr\",\"crs\"],\"singular\":\"certificaterequest\"},\"scope\":\"Namespaced\",\"subresources\":{\"status\":{}},\"validation\":{\"openAPIV3Schema\":{\"description\":\"CertificateRequest is a type to represent a Certificate Signing Request\",\"properties\":{\"apiVersion\":{\"description\":\"APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources\",\"type\":\"string\"},\"kind\":{\"description\":\"Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated.....
Could it be due because I already have cert-manager installed? If so how I can skip this step from the installation process?

Related

Hyperledger Fabric - backup and restore

I'm using Hyperledger Fabric and now I'm trying to make a backup of the current situation and restore it on a different computer.
I'm following the procedure found in hyperledger-fabric-backup-and-restore.
The main steps being:
Copy the crypto-config and the channel-artifacts directory
Copy the content of all peers and orderer containers
Modify the docker-compose.yaml to link containers volumes to the local directory where I have the backup copy.
Yet it's not working properly in my case: when I restart the network with ./byfn.hs up I first have all the containers correctly up and running then, whatever operation I try and execute on the channel (peer channel create, peer channel join, peer channel update) fails with error:
Error: got unexpected status: BAD_REQUEST -- error applying config update to existing channel 'mychannel': error authorizing update: error validating ReadSet: proposed update requires that key [Group] /Channel/Application be at version 0, but it is currently at version 1
Is there anything I should do which is not mentioned on hyperledger-fabric-backup-and-restore ?
I got the same error while trying to create a channel. Turning the "network down" and then "network up" solved my problem.

How to overwrite the api proxy deployment using apigeetool

I am using the below command in jenkins to deploy the api proxies to apigee edge.
apigeetool deployproxy -u abc -o nonprod -e dev -n poc-jenkins1 -p xyz
But am getting the below error.
Error: Path /poc-deployment-automation conflicts with existing deployment path for revision 1 of the APIProxy poc-deploy-automation in organization nonprod, environment dev
Here is my requirement , please help me what command to use.
If API doesn’t exist in target environment, Create Api in new environment with version 1.
If API already exist in target environment, Create Api in new environment with new version (previous version + 1)
So what command should we use to fix the above error and what should we use to do the above 2 tasks.
Help Appreciated.
The apigeetool deployproxy command supports by default your requirements. It deploys the revision 1 if there is no proxy with the name, and increases the revision if it already exists.
However, based on the error you mentioned, it seems that you have a path conflict between two proxies. You are trying to deploy a proxy to a /poc-deployment-automation basepath, but there is another proxy called poc-deploy-automation which is listening on the same basepath. It is not possible, even if the proxy name is different, because the basepath is what apigee uses to redirect traffic to your proxy.
Check the xml file at the root of your proxy and change the basepath attribute.
Also, the basepath of an API Proxy can be anything, but could not be the same used at the same time by two proxies--only one can be deployed at time. The revision numbers are irrelevant in this situation.

ERROR: The overall deployment failed because too many individual instances failed deployment

I'm trying to deploy using CircleCI -> S3 -> CodeDeploy -> EC2.
I was able to upload deploy image onto S3 from CircleCI, but unable to deploy S3 to EC2 instance. Here's the error.
The overall deployment failed because too many individual instances
failed deployment, too few healthy instances are available for
deployment, or some instances in your deployment group are
experiencing problems. (Error code: HEALTH_CONSTRAINTS)
The error was provided from CodeDeploy. I can't figure out why and how.
I'd appreciate if you give some advise.
If you are running on Ubuntu there might be plenty of reasons, here is a checklist can verify
Check code-deploy agent is installed on your EC2 Instance. Please refer this document to install code deploy agent.
https://docs.aws.amazon.com/codedeploy/latest/userguide/codedeploy-agent-operations-install-ubuntu.html
$ sudo service codedeploy-agent status
In case if you are running Ubuntu release 20.x and you get this error
./install:22:in block in method_missing': undefined method path' for
#<IO:> (NoMethodError)
try running the install file via this script
sudo ./install auto > /tmp/logfile
Check you have EC2 Instance Code Deploy Role -> Create a code deployment role and assign it to the Instance, https://docs.aws.amazon.com/codedeploy/latest/userguide/getting-started-create-service-role.html.
In case if you assign the EC2 Role after initiate, restart the server.
Check your appsec.yml file placement as per the top answer, try to avoid any long timeout in it.
Log into your instance check your error log
$ tail -f /var/log/aws/codedeploy-agent/codedeploy-agent.log
You should be able to figure out what caused the individual instances to fail by digging into the deployment instance details:
http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-view-instance-details.html
These should contain more detailed information about why your application was unable to be deployed.
This error is commonly due to problems in the configuration of the appSpec.yml or appSpec.json file (It depends on the format you are using).
"If you have any Hook I recommend that you remove them, check if it works, then you can add one by one (the Hooks) and so you can identify the error"
The appspec.yml file should be located at the root of your project:
│-- appspec.yml
│-- index.html
└-- scripts
│-- install_dependencies
│-- start_server
└-- stop_server
In the scripts folder you will have to place the processes that you want to be executed according to the Hook
Here is an example of the appspec.yml file
version: 0.0
os: linux
files:
- source: /index.html
destination: /var/www/html/
hooks:
BeforeInstall:
- location: scripts/install_dependencies
timeout: 300
runas: root
- location: scripts/start_server
timeout: 300
runas: root
ApplicationStop:
- location: scripts/stop_server
timeout: 300
runas: root
I hope I can help you 😃👻🕺🏾
Make sure the CodeDeploy Host Agent Service is running in your target EC2 instance.
The error you are facing is a generic error message thrown on any of the event failure which could be beforeblockTraffic, blockTraffic, ApplicationStop etc.
The first step in this case would be check whether code deploy agent is running or not if first event i.e. BeforeBlockTraffic event is failed.
As you can see in the screenshot below, the event failure message would tell you the exact error behind.
From the failed deployments, I can see all lifecycle events were skipped. Instance i-0bcc36e73851297f2 is currently in Stopped state but I can see the IAM instance profile is missing. Your Amazon EC2 instances need permission to access the Amazon S3 buckets or GitHub repositories where the applications that will be deployed by AWS CodeDeploy are stored. To launch Amazon EC2 instances that are compatible with AWS CodeDeploy, you must create an additional IAM role, an instance profile. 1
For such failures, you can always begin with a general troubleshooting checklist for a failed deployment 2 and then look for troubleshooting guides on Deployment Issues and Instance issues3.
1[http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-create-iam-instance-profile.html]1
2 [http://docs.aws.amazon.com/codedeploy/latest/userguide/troubleshooting-general.html]2
3 [http://docs.aws.amazon.com/codedeploy/latest/userguide/troubleshooting.html]3
Check the status of the Code Deploy Agent. In my case, the agent wasn't up.
Please check the role given to the ec2 machine(where the agent is running). It should have s3 access as well. This resolved my issue.
"The CodeDeploy agent did not find an AppSpec file within the unpacked revision directory at revision-relative path 'appspec.yml'"
Please place your appspec.yml file in your root folder to solve this error
To access your after script and before script
The overall deployment failed because too many individual instances failed deployment, too few healthy instances are available for deployment, or some instances in your deployment group are experiencing problems.

SCOPES_WARNING in BigQuery when accessed from a Cloud Compute instance

Every time I use bq on a Cloud Compute instance, I get this:
/usr/local/share/google/google-cloud-sdk/platform/bq/third_party/oauth2client/contrib/gce.py:73: UserWarning: You have requested explicit scopes to be used with a GCE service account.
Using this argument will have no effect on the actual scopes for tokens
requested. These scopes are set at VM instance creation time and
can't be overridden in the request.
warnings.warn(_SCOPES_WARNING)
This is a default micro in f1 with Debian 8. I gave this instance access to all Cloud APIs and its service account is also an owner of a project. I run gcloud init. But this error persists.
Is there something wrong?
I noticed that this warning did not appear on an older instance running SDK version 0.9.85, however I now get it when creating a new instance or upgrading the the latest Gcloud SDK.
The scopes warning can be safely ignored, as it's just telling you that the only scopes that will be used are the ones specified at instance creation time, which is the expected behavior of the default GCE service account.
It seems the 'bq' tool doesn't distinguish between the default service account on GCE and a regular service account and always tries to set the scopes explicitly. The warning comes from oauth2client, and it looks like it didn't display this warning in versions prior to v2.0.0.
I've created public issue to track this which you can star to get updates:
https://code.google.com/p/google-bigquery/issues/detail?id=557

How to do kerberos authentication on a flink standalone installation?

I have a standalone Flink installation on top of which I want to run a streaming job that is writing data into a HDFS installation. The HDFS installation is part of a Cloudera deployment and requires Kerberos authentication in order to read and write the HDFS. Since I found no documentation on how to make Flink connect with a Kerberos-protected HDFS I had to make some educated guesses about the procedure. Here is what I did so far:
I created a keytab file for my user.
In my Flink job, I added the following code:
UserGroupInformation.loginUserFromKeytab("myusername", "/path/to/keytab");
Finally I am using a TextOutputFormatto write data to the HDFS.
When I run the job, I'm getting the following error:
org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBE
ROS]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1730)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1668)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1593)
at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:397)
at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:393)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:393)
at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:337)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:405)
For some odd reason, Flink seems to try SIMPLE authentication, even though I called loginUserFromKeytab. I found another similar issue on Stackoverflow (Error with Kerberos authentication when executing Flink example code on YARN cluster (Cloudera)) which had an answer explaining that:
Standalone Flink currently only supports accessing Kerberos secured HDFS if the user is authenticated on all worker nodes.
That may mean that I have to do some authentication at the OS level e.g. with kinit. Since my knowledge of Kerberos is very limited I have no idea how I would do it. Also I would like to understand how the program running after kinit actually knows which Kerberos ticket to pick from the local cache when there is no configuration whatsoever regarding this.
I'm not a Flink user, but based on what I've seen with Spark & friends, my guess is that "Authenticated on all worker nodes" means that each worker process has
a core-site.xml config available on local fs with
hadoop.security.authentication set to kerberos (among other
things)
the local dir containing core-site.xml added to the CLASSPATH so that it is found automatically by the Hadoop Configuration object [it will revert silently to default hard-coded values otherwise, duh]
implicit authentication via kinit and the default cache [TGT set globally for the Linux account, impacts all processes, duh] ## or ## implicit authentication via kinit and a "private" cache set thru KRB5CCNAME env variable (Hadoop supports only "FILE:" type) ## or ## explicit authentication via UserGroupInformation.loginUserFromKeytab() and a keytab available on the local fs
That UGI "login" method is incredibly verbose, so if it was indeed called before Flink tries to initiate the HDFS client from the Configuration, you will notice. On the other hand, if you don't see the verbose stuff, then your attempt to create a private Kerberos TGT is bypassed by Flink, and you have to find a way to bypass Flink :-/
You can also configure your stand alone cluster to handle authentication for you without additional code in your jobs.
Export HADOOP_CONF_DIR and point it to directory where core-site.xml and hdfs-site.xml is located
Add to flink-conf.yml
security.kerberos.login.use-ticket-cache: false
security.kerberos.login.keytab: <path to keytab>
security.kerberos.login.principal: <principal>
env.java.opts: -Djava.security.krb5.conf=<path to krb5 conf>
Add pre-bundled Hadoop to lib directory of your cluster https://flink.apache.org/downloads.html
The only dependencies you should need in your jobs is:
compile "org.apache.flink:flink-java:$flinkVersion"
compile "org.apache.flink:flink-clients_2.11:$flinkVersion"
compile 'org.apache.hadoop:hadoop-hdfs:$hadoopVersion'
compile 'org.apache.hadoop:hadoop-client:$hadoopVersion'
In order to access a secured HDFS or HBase installation from a standalone Flink installation, you have to do the following:
Log into the server running the JobManager, authenticate against Kerberos using kinit and start the JobManager (without logging out or switching the user in between).
Log into each server running a TaskManager, authenticate again using kinit and start the TaskManager (again, with the same user).
Log into the server from where you want to start your streaming job (often, its the same machine running the JobManager), log into Kerberos (with kinit) and start your job with /bin/flink run.
In my understanding, kinit is logging in the current user and creating a file somewhere in /tmp with some login data. The mostly static class UserGroupInformation is looking up that file with the login data when its loaded the first time. If the current user is authenticated with Kerberos, the information is used to authenticate against HDFS.