MLflow Artifacts Storing But Not Listing In UI - amazon-s3

I've run into an issue using MLflow server. When I first ran the command to start an mlflow server on an ec2 instance, everything worked fine. Now, although logs and artifacts are being stored to postgres and s3, the UI is not listing the artifacts. Instead, the artifact section of the UI shows:
Loading Artifacts Failed
Unable to list artifacts stored under <s3-location> for the current run. Please contact your tracking server administrator to notify them of this error, which can happen when the tracking server lacks permission to list artifacts under the current run's root artifact directory.
But when I check in s3, I see the artifact in the s3 location that the error shows. What could possibly have started causing this as it used to work not too long ago and nothing was changed on the ec2 that is hosting mlflow?

I found the answer. The error was that mlflow could not find boto3, so a conda installation of that worked. The logs for this were buried and hard to find in stdout.

Related

GitLab Runner fails to upload artifacts with "invalid argument" error

I'm completely new to trying to implement GitLab's CI/CD pipelines, but it's been going quite well. In fact, for my ASP.NET project, if I specify a Publish Profile in the msbuild command that uses Web Deploy, it actually deploys the code successfully to the web server.
However, I'm now wanting to have the "build" job create artifacts which are uploaded to GitLab that I can then subsequently deploy. We're using a self-hosted instance of GitLab, for which I'm not an admin, but I can speak to the admin if I know what I'm asking for!
So I've configured my gitlab-ci.yml file like this:
variables:
NUGET_PATH: 'C:\Program Files\Nuget\Nuget.exe'
NUGET_SOURCES: 'https://api.nuget.org/v3/index.json'
MSBUILD_PATH: 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Current\Bin\msbuild.exe'
stages:
- build
build-job:
variables:
CI_DEBUG_TRACE: "true"
stage: build
script:
- '& "$env:NUGET_PATH" restore ApplicationTemplate.sln -Source "$env:NUGET_SOURCES"'
- '& "$env:MSBUILD_PATH" ApplicationTemplate\ApplicationTemplate.csproj /p:DeployOnBuild=true /p:Configuration=Release /p:PublishProfile=FolderPublish.pubxml'
artifacts:
paths:
- '.\ApplicationTemplate\bin\Release\Publish\'
The output shows that this builds the code just fine, and it also seems to successfully find the artifacts for upload. However, when it uploads the artifacts, even though the request gets a 200 OK response, the process fails. Here is the log output:
So, it finds the artifacts, it attempts to upload them and even gets a 200 OK response (in contrast to the handful of similar reports of this error I've been able to find online), but it still fails due to an invalid argument.
I've already enabled verbose debugging, as you can see from the output, but I'm none the wiser. Looking at the GitLab Runner entries in the Windows Event Log on the box where the runner is hosted doesn't shed any light on things either. The total size of the artifacts is 61.1MB, so I don't think my issue is related to that.
Can anyone see from this output what's invalid? Can I identify which argument is invalid and/or why it's invalid?
Edit: Things I've tried
Specifying a value for artifacts:expire_in.
Setting artifacts:public to FALSE, since I'm using a self-hosted GitLab environment and the default value for this setting (TRUE) is not valid in such an environment.
Trying every format I can think of for the value of the artifacts:paths setting (this seems to be incredibly robust - regardless of the format I use, the Runner seems to have no problem parsing it and finding the files to upload).
Taking a cue from this question I created a new project with a very simple build job to upload a single file:
stages:
- build
build-job:
variables:
CI_DEBUG_TRACE: "true"
stage: build
script:
- echo "Test" > test.txt
artifacts:
paths:
- test.txt
About 50% of the time this job hangs on the uploading of the artifacts and I have to cancel it. The other half of the time it fails in exactly the same way as the my previous project:
After countless hours working on this, it seems that ultimately the issue was that our internal Web Application Firewall was blocking some part of the transfer of artefacts to the server, or the response back from it. With the WAF reconfigured not to block traffic from the machine running the GitLab Runner, the artefacts are successfully uploaded and the job succeeds.
This would have been significantly easier to diagnose if the logging from GitLab was better. As per my comment on this issue, it should be possible to see the content of the response from the GitLab server after uploading artefacts, even when the response code is 200.
What's strange - and made diagnosing the issue even harder - is that when I worked through the issue with the admin of our GitLab instance, digging through logs and running it in debug mode, the artefact upload process was uploading something successfully. We could see, for example, the GitLab Runner's log had been uploaded to the server. Clearly the WAF's blocking was selective and didn't block everything in both directions.

Gitlab-CI: AWS S3 deploy is failing

I am trying to create a deployment pipeline for Gitlab-CI on a react project. The build is working fine and I use artifacts to store the dist folder from my yarn build command. This is working fine as well.
The issue is regarding my deployment with command: aws s3 sync dist/'bucket-name'.
Expected: "Done in x seconds"
Actual:
error Command failed with exit code 2. info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command. Running after_script 00:01 Uploading artifacts for failed job 00:01 ERROR: Job failed: exit code 1
The files seem to have been uploaded correctly to the S3 bucket, however I do not know why I get an error on the deployment job.
When I run the aws s3 sync dist/'bucket-name' locally everything works correctly.
Check out AWS CLI Return Codes
2 -- The meaning of this return code depends on the command being run.
The primary meaning is that the command entered on the command line failed to be parsed. Parsing failures can be caused by, but are not limited to, missing any required subcommands or arguments or using any unknown commands or arguments. Note that this return code meaning is applicable to all CLI commands.
The other meaning is only applicable to s3 commands. It can mean at least one or more files marked for transfer were skipped during the transfer process. However, all other files marked for transfer were successfully transferred. Files that are skipped during the transfer process include: files that do not exist, files that are character special devices, block special device, FIFO's, or sockets, and files that the user cannot read from.
The second paragraph might explain what's happening.
There is no yarn build command. See https://classic.yarnpkg.com/en/docs/cli/run
As Anton mentioned, the second paragraph of his answer was the problem. The solution to the problem was removing special characters from a couple SVGs. I suspect uploading the dist folder as an artifact(zip) might have changed some of the file names altogether which was confusing to S3. By removing ® and + from the filename the issue was resolved.

yarn usercache dir not resolved properly when running an example application

I am using Hadoop 3.2.0 and trying to run a simple application in a docker container and I have made the required configuration changes both in yarn-site.xml and container-executor.cfg to choose LinuxContainerExecutor and docker runtime.
I use the example of distributed shell in one of the hortonworks blog. https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/
The problem I face here is when the application is submitted to YARN it fails with a reason related to directory creation issue with the below error
2019-02-14 20:51:16,450 INFO distributedshell.Client: Got application
report from ASM for, appId=2, clientToAMToken=null,
appDiagnostics=Application application_1550156488785_0002 failed 2
times due to AM Container for appattempt_1550156488785_0002_000002
exited with exitCode: -1000 Failing this attempt.Diagnostics:
[2019-02-14 20:51:16.282]Application application_1550156488785_0002
initialization failed (exitCode=20) with output: main : command
provided 0 main : user is myuser main : requested yarn user is
myuser Failed to create directory
/data/yarn/local/nmPrivate/container_1550156488785_0002_02_000001.tokens/usercache/myuser
- Not a directory
I have configured yarn.nodemanager.local-dirs in yarn-site.xml and I can see the same reflected in YARN web ui localhost:8088/conf
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/data/yarn/local</value>
<final>false</final>
<source>yarn-site.xml</source>
</property>
I do not understand why is it trying to create usercache dir inside the nmPrivate directory.
Note : I have verified the permissions for myuser to the directories and also have tried clearing the directories manually as suggested in a related post. But no fruit. I do not see any additional information about container launch failure in any other logs.
How do I debug why the usercache dir is not resolved properly??
Really appreciate any help on this.
Realized that this is all because of the users the services were started with and the permissions to the directories the services work on.
After making sure the required changes are done, I am able to seamlessly run the examples and other applications..
Thanks Hadoop user community for the direction. Adding the link here for more details.
http://mail-archives.apache.org/mod_mbox/hadoop-user/201902.mbox/browser

Can't deploy using serverless framework from Windows 10

Attempt to deploy via serverless framework using Windows 10 fails:
C:\Users\xxxxxx>sls deploy --verbose Serverless: Packaging
service... Serverless: Excluding development dependencies...
Error --------------------------------------------------
EPERM: operation not permitted, scandir
'C:\Users\xxxxxx\AppData\Local\ElevatedDiagnostics' For debugging
logs, run again after setting the "SLS_DEBUG=*" environment variable.
Your Environment Information ----------------------------- OS: win32
Node Version: 6.11.2 Serverless Version: 1.19.0
Tried again with command prompt under elevated privileges:
EBUSY: resource busy or locked, scandir
'C:\Users\xxxxxx\AppData\Local\Microsoft\InputPersonalization\TextHarvester\WaitList.dat'
I assumed there was a permissions issue at first so I retried with the command prompt at full admin mode but just ran into the the second error. My research suggested an issue with windows search so I turned it off (and also all background apps). Trying again (and again) I just ran into more similar issues and am unable to deploy anything. Anyone had similar issues and found a way around them?
I worked it out finally, so in case anyone else encounters this issue here is a summary. There seem to be 2 issues:
Don't create functions in your root folder. Create a specific folder for your serverless function i.e. not in C:\Users\nnnnnn> but within your regular document storage. In Windows 10 it works nicely if you use a OneDrive folder, with the benefit that your function(s) are also then replicated to other dev machines that you might use (and are automatically backed up offsite).
More importantly, the serverless framework seems to have an issue if you attempt to deploy to a region other than the default region set in your aws CLI configuration. I've no idea why this should be since the credentials I use with the AWS CLI are authorised for all regions. I also have no idea why the issue should result in serverless attempting to access a whole series of windows files for which it has no authority but nevertheless...
In my case, I primarily use region ap-southeast-2. By default, SLS CREATE generates a serverless.yml using a default US region. If this is left as-is, there is then a mismatch between the deployment region and your AWS CLI region. Not good. To avoid the minor pain of having to specify a deployment region in the SLS deploy command, just update the deployment region in the serverless.yml file to match the CLI region.
Now works a treat...

Atlassian Bamboo: First plan with a simple job of downloading a local git repo

I just downloaded the free trial of Bamboo continuous integration server, and created the first plan with nothing but downloading the source code from the git. I have a local git repository on the bamboo machine so the git URL is pointing to a local path.
The problem is that when I run the job, it never finishes even after waiting for an hour. This is the last lines of the activity log:
07-Apr-2011 20:03:23 Checking out revision f9dc82500914333ed4bbdae5ed038771fd658c3c.
07-Apr-2011 20:03:23 Creating local git repository in '/home/bob/bamboo-home/xml-data/build-dir/DEV-DEV-1/.git'.
From the shell I can go to the directory shown in the log and see that the source code were cloned correctly to the bamboo working directory. But the job will never finish and the log will not have any more update from here. I have to manually terminate the job. Any ideas? Do I miss something?
Just a guess, since the Bamboo instance we have at work pulls from Accurev and not Git, and I've never run into this problem myself - but it may be hung because there isn't a builder defined for that plan. You might try defining a builder (even if it's one that you know will fail) just to see if it makes it to that next step.
I had very similar problem.
It's not very original solution but I just uninstalled bamboo and installed it again.. Now it works now