How can I make Sagemaker processing job run with multi-record manifest file - amazon-s3

I am trying to run a simple Sagemaker job, just to make sure I can run multiple steps from a manifest file.
The docker file is very simple
FROM python:latest
COPY ./main.py /main.py
ENTRYPOINT [ "python", "/main.py"]
main.py
print(1)
The manifest file is names input.json and located in s3://bucket/input.json
This is the file
[{"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}]
When running the job from the sagemaker processing job console I used those parameters
Input mode: File
S3 data type: ManifestFile
URI: s3://bucket/input.json
local_path: /opt/ml/processing/sdklofjdslkfj
I would hope it to run 10 jobs one for each record in the file but when I go to the logs I only see it print 1 once.
Questions:
How do I make it run once for every record and give it parameters for the run
2.Why do I need a local path for the manifest file?
Thanks!

Related

Filepulse Connector error with S3 provider (Source Connector)

I am trying to poll csv files from S3 buckets using Filepulse source connector. When the task starts I get the following error. What additional libraries do I need to add to make this work from S3 bucket ? Config file below.
Where did I go wrong ?
Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:208)
java.nio.file.FileSystemNotFoundException: Provider "s3" not installed
at java.base/java.nio.file.Path.of(Path.java:212)
at java.base/java.nio.file.Paths.get(Paths.java:98)
at io.streamthoughts.kafka.connect.filepulse.fs.reader.LocalFileStorage.exists(LocalFileStorage.java:62)
Config file :
{
"name": "FilePulseConnector_3",
"config": {
"connector.class": "io.streamthoughts.kafka.connect.filepulse.source.FilePulseSourceConnector",
"filters": "ParseCSVLine, Drop",
"filters.Drop.if": "{{ equals($value.artist, 'U2') }}",
"filters.Drop.invert": "true",
"filters.Drop.type": "io.streamthoughts.kafka.connect.filepulse.filter.DropFilter",
"filters.ParseCSVLine.extract.column.name": "headers",
"filters.ParseCSVLine.trim.column": "true",
"filters.ParseCSVLine.seperator": ";",
"filters.ParseCSVLine.type": "io.streamthoughts.kafka.connect.filepulse.filter.DelimitedRowFilter",
"fs.cleanup.policy.class": "io.streamthoughts.kafka.connect.filepulse.fs.clean.LogCleanupPolicy",
"fs.cleanup.policy.triggered.on":"COMMITTED",
"fs.listing.class": "io.streamthoughts.kafka.connect.filepulse.fs.AmazonS3FileSystemListing",
"fs.listing.filters":"io.streamthoughts.kafka.connect.filepulse.fs.filter.RegexFileListFilter",
"fs.listing.interval.ms": "10000",
"file.filter.regex.pattern":".*\\.csv$",
"offset.policy.class":"io.streamthoughts.kafka.connect.filepulse.offset.DefaultSourceOffsetPolicy",
"offset.attributes.string": "name",
"skip.headers": "1",
"topic": "connect-file-pulse-quickstart-csv",
"tasks.reader.class": "io.streamthoughts.kafka.connect.filepulse.fs.reader.LocalRowFileInputReader",
"tasks.file.status.storage.class": "io.streamthoughts.kafka.connect.filepulse.state.KafkaFileObjectStateBackingStore",
"tasks.file.status.storage.bootstrap.servers": "172.27.157.66:9092",
"tasks.file.status.storage.topic": "connect-file-pulse-status",
"tasks.file.status.storage.topic.partitions": 10,
"tasks.file.status.storage.topic.replication.factor": 1,
"tasks.max": 1,
"aws.access.key.id":"<<>>",
"aws.secret.access.key":"<<>>",
"aws.s3.bucket.name":"mytestbucketamtrak",
"aws.s3.region":"us-east-1"
}
}
What should I put in the libraries to make this work ? Note : The lenses connector sources from S3 bucket without issues. So its not a credentials issue.
As mentioned in comments by #OneCricketeer
Suggest you follow - github.com/streamthoughts/kafka-connect-file-pulse/issues/382 pointed to root cause.
Modifying the config file to use this property sourced the file:
"tasks.reader.class": "io.streamthoughts.kafka.connect.filepulse.fs.reader.AmazonS3RowFileInputReader"

Bazel extension that allows to load file from S3 into BUILD

Currently I have a list of dictionaries in a .bzl file:
test_data = [
{ "name": "test", "data": "test_data"}
]
That I load in a BUILD file and perform some magic with list comprehension...
[
foo(name=data["name"], data=data["data"])
for data in test_data
]
I need to be able to pull this file in from S3 and provide the contents of the BUILD file the same way I do with the static .bzl file.

Using CDSAPI in google colab

I installed a python library called cdsapiin google colab.
To use it I need to locate its config file (which in a general Linux system is $HOME/.cdsapirc) and add my account key to it.
More details can be found here (https://cds.climate.copernicus.eu/api-how-to).
I am having a problem with this step
Copy the code displayed beside, in the file $HOME/.cdsapirc (in your
Unix/Linux environment): url: {api-url} key: {uid}:{api-key}
I tried using !cd /home/ in colab notebook but it doesn't contain this file.
I have also tried !cat /home/.cdsapirc, it gave error:
cat: /home/.cdsapirc: No such file or directory
I achieved this successfully. My code in Colab is as follows:
First, create '.cdsapirc' and write your key in root dir:
url = 'url: https://cds.climate.copernicus.eu/api/v2'
key = 'key: your uid and key'
with open('/root/.cdsapirc', 'w') as f:
f.write('\n'.join([url, key]))
with open('/root/.cdsapirc') as f:
print(f.read())
Then, install cdsapi:
!pip install cdsapi
Run example:
import cdsapi
c = cdsapi.Client()
c.retrieve("reanalysis-era5-pressure-levels",
{
"variable": "temperature",
"pressure_level": "1000",
"product_type": "reanalysis",
"year": "2008",
"month": "01",
"day": "01",
"time": "12:00",
"format": "grib"
}, "/target/dir/download.grib")
The target dir could be your google drive folder.
You can specify your UID, API key and CDS API endpoint directly as arguments into the constructor:
uid = <YOUR UID HERE>
apikey = <YOUR APIKEY>
c = cdsapi.Client(key=f"{uid}:{apikey}", url="https://cds.climate.copernicus.eu/api/v2")

successful snapshot fails to load some shards, RepositoryMissingException in elasticsearch

I had a backup successfully complete to my s3 bucket in elasticsearch:
{
"state": "SUCCESS",
"start_time": "2014-12- 06T00:12:39.362Z",
"start_time_in_millis": 1417824759362,
"end_time": "2014-12-06T00:33:34.352Z",
"end_time_in_millis": 1417826014352,
"duration_in_millis": 1254990,
"failures": [],
"shards": {
"total": 345,
"failed": 0,
"successful": 345
}
}
But when I restore from the snapshot, I have a few failed shards, with the following message:
[2014-12-08 00:00:05,580][WARN ][cluster.action.shard] [Sunder] [kibana-int][4] received shard failed for [kibana-int][4],
node[_QG8dkDaRD-H1uPL_p57lw], [P], restoring[elasticsearch:snapshot_1], s[INITIALIZING], indexUUID [SAuv_EU3TBGZ71NhkC7WOA],
reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[kibana-int][4] failed recovery];
nested: IndexShardRestoreFailedException[[kibana-int][4] restore failed];
nested: RepositoryMissingException[[elasticsearch] missing]; ]]
how I do reconcile the data, or if necessary remove the shards from my cluster to complete the recovery?

Projects were not shown in scrapyd

I am new to scrapyd,
I have insert the below code into scrapy.cfg file.
[settings]
default = uk.settings
[deploy:scrapyd]
url = http://localhost:6800/
project=ukmall
[deploy:scrapyd2]
url = http://scrapyd.mydomain.com/api/scrapyd/
username = john
password = secret
If I run below code code
$scrapyd-deploy -l
I can get
scrapyd2 http://scrapyd.mydomain.com/api/scrapyd/
scrapyd http://localst:6800/
To see all available projects
scrapyd-deploy -L scrapyd
But it shows nothing in my machine?
Ref: http://scrapyd.readthedocs.org/en/latest/deploy.html#deploying-a-project
If Did
$ scrapy deploy scrapyd2
anandhakumar#MMTPC104:~/ScrapyProject/mall_uk$ scrapy deploy scrapyd2
Packing version 1412322816
Traceback (most recent call last):
File "/usr/bin/scrapy", line 4, in <module>
execute()
File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 142, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 88, in _run_print_help
func(*a, **kw)
File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 149, in _run_command
cmd.run(args, opts)
File "/usr/lib/pymodules/python2.7/scrapy/commands/deploy.py", line 103, in run
egg, tmpdir = _build_egg()
File "/usr/lib/pymodules/python2.7/scrapy/commands/deploy.py", line 228, in _build_egg
retry_on_eintr(check_call, [sys.executable, 'setup.py', 'clean', '-a', 'bdist_egg', '-d', d], stdout=o, stderr=e)
File "/usr/lib/pymodules/python2.7/scrapy/utils/python.py", line 276, in retry_on_eintr
return function(*args, **kw)
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', 'setup.py', 'clean', '-a', 'bdist_egg', '-d', '/tmp/scrapydeploy-VLM6W7']' returned non-zero exit status 1
anandhakumar#MMTPC104:~/ScrapyProject/mall_uk$
If I do this for another project means it shows.
$ scrapy deploy scrapyd
Packing version 1412325181
Deploying to project "project2" in http://localhost:6800/addversion.json
Server response (200):
{"status": "error", "message": "[Errno 13] Permission denied: 'eggs'"}
You'll only be able to list the spiders that have been deployed. If you haven't deployed anything yet then to deploy your spider you simply use scrapy deploy:
scrapy deploy [ <target:project> | -l <target> | -L ]
vagrant#portia:~/takeovertheworld$ scrapy deploy scrapyd2
Packing version 1410145736
Deploying to project "takeovertheworld" in http://ec2-xx-xxx-xx-xxx.compute-1.amazonaws.com:6800/addversion.json
Server response (200):
{"status": "ok", "project": "takeovertheworld", "version": "1410145736", "spiders": 1}
Verify that the project was installed correctly by accessing the scrapyd API:
vagrant#portia:~/takeovertheworld$ curl http://ec2-xx-xxx-xx-xxx.compute-1.amazonaws.com:6800/listprojects.json
{"status": "ok", "projects": ["takeovertheworld"]}
I had same error too. As #hugsbrugs said,because a folder inside the scrapy project had root rights.So, I do that.
sudo scrapy deploy scrapyd2