How can I make Sagemaker processing job run with multi-record manifest file

How can I make Sagemaker processing job run with multi-record manifest file - amazon-s3

I am trying to run a simple Sagemaker job, just to make sure I can run multiple steps from a manifest file.
The docker file is very simple
FROM python:latest
COPY ./main.py /main.py
ENTRYPOINT [ "python", "/main.py"]
main.py
print(1)
The manifest file is names input.json and located in s3://bucket/input.json
This is the file
[{"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}, {"Environment": {"path": 1}}]
When running the job from the sagemaker processing job console I used those parameters
Input mode: File
S3 data type: ManifestFile
URI: s3://bucket/input.json
local_path: /opt/ml/processing/sdklofjdslkfj
I would hope it to run 10 jobs one for each record in the file but when I go to the logs I only see it print 1 once.
Questions:
How do I make it run once for every record and give it parameters for the run
2.Why do I need a local path for the manifest file?
Thanks!

Related

Filepulse Connector error with S3 provider (Source Connector)

I am trying to poll csv files from S3 buckets using Filepulse source connector. When the task starts I get the following error. What additional libraries do I need to add to make this work from S3 bucket ? Config file below.
Where did I go wrong ?
Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:208)
java.nio.file.FileSystemNotFoundException: Provider "s3" not installed
at java.base/java.nio.file.Path.of(Path.java:212)
at java.base/java.nio.file.Paths.get(Paths.java:98)
at io.streamthoughts.kafka.connect.filepulse.fs.reader.LocalFileStorage.exists(LocalFileStorage.java:62)
Config file :
{
"name": "FilePulseConnector_3",
"config": {
"connector.class": "io.streamthoughts.kafka.connect.filepulse.source.FilePulseSourceConnector",
"filters": "ParseCSVLine, Drop",
"filters.Drop.if": "{{ equals($value.artist, 'U2') }}",
"filters.Drop.invert": "true",
"filters.Drop.type": "io.streamthoughts.kafka.connect.filepulse.filter.DropFilter",
"filters.ParseCSVLine.extract.column.name": "headers",
"filters.ParseCSVLine.trim.column": "true",
"filters.ParseCSVLine.seperator": ";",
"filters.ParseCSVLine.type": "io.streamthoughts.kafka.connect.filepulse.filter.DelimitedRowFilter",
"fs.cleanup.policy.class": "io.streamthoughts.kafka.connect.filepulse.fs.clean.LogCleanupPolicy",
"fs.cleanup.policy.triggered.on":"COMMITTED",
"fs.listing.class": "io.streamthoughts.kafka.connect.filepulse.fs.AmazonS3FileSystemListing",
"fs.listing.filters":"io.streamthoughts.kafka.connect.filepulse.fs.filter.RegexFileListFilter",
"fs.listing.interval.ms": "10000",
"file.filter.regex.pattern":".*\\.csv$",
"offset.policy.class":"io.streamthoughts.kafka.connect.filepulse.offset.DefaultSourceOffsetPolicy",
"offset.attributes.string": "name",
"skip.headers": "1",
"topic": "connect-file-pulse-quickstart-csv",
"tasks.reader.class": "io.streamthoughts.kafka.connect.filepulse.fs.reader.LocalRowFileInputReader",
"tasks.file.status.storage.class": "io.streamthoughts.kafka.connect.filepulse.state.KafkaFileObjectStateBackingStore",
"tasks.file.status.storage.bootstrap.servers": "172.27.157.66:9092",
"tasks.file.status.storage.topic": "connect-file-pulse-status",
"tasks.file.status.storage.topic.partitions": 10,
"tasks.file.status.storage.topic.replication.factor": 1,
"tasks.max": 1,
"aws.access.key.id":"<<>>",
"aws.secret.access.key":"<<>>",
"aws.s3.bucket.name":"mytestbucketamtrak",
"aws.s3.region":"us-east-1"
}
}
What should I put in the libraries to make this work ? Note : The lenses connector sources from S3 bucket without issues. So its not a credentials issue.

As mentioned in comments by #OneCricketeer
Suggest you follow - github.com/streamthoughts/kafka-connect-file-pulse/issues/382 pointed to root cause.
Modifying the config file to use this property sourced the file:
"tasks.reader.class": "io.streamthoughts.kafka.connect.filepulse.fs.reader.AmazonS3RowFileInputReader"

Bazel extension that allows to load file from S3 into BUILD

Currently I have a list of dictionaries in a .bzl file:
test_data = [
{ "name": "test", "data": "test_data"}
]
That I load in a BUILD file and perform some magic with list comprehension...
[
foo(name=data["name"], data=data["data"])
for data in test_data
]
I need to be able to pull this file in from S3 and provide the contents of the BUILD file the same way I do with the static .bzl file.

Using CDSAPI in google colab

I installed a python library called cdsapiin google colab.
To use it I need to locate its config file (which in a general Linux system is $HOME/.cdsapirc) and add my account key to it.
More details can be found here (https://cds.climate.copernicus.eu/api-how-to).
I am having a problem with this step
Copy the code displayed beside, in the file $HOME/.cdsapirc (in your
Unix/Linux environment): url: {api-url} key: {uid}:{api-key}
I tried using !cd /home/ in colab notebook but it doesn't contain this file.
I have also tried !cat /home/.cdsapirc, it gave error:
cat: /home/.cdsapirc: No such file or directory

I achieved this successfully. My code in Colab is as follows:
First, create '.cdsapirc' and write your key in root dir:
url = 'url: https://cds.climate.copernicus.eu/api/v2'
key = 'key: your uid and key'
with open('/root/.cdsapirc', 'w') as f:
f.write('\n'.join([url, key]))
with open('/root/.cdsapirc') as f:
print(f.read())
Then, install cdsapi:
!pip install cdsapi
Run example:
import cdsapi
c = cdsapi.Client()
c.retrieve("reanalysis-era5-pressure-levels",
{
"variable": "temperature",
"pressure_level": "1000",
"product_type": "reanalysis",
"year": "2008",
"month": "01",
"day": "01",
"time": "12:00",
"format": "grib"
}, "/target/dir/download.grib")
The target dir could be your google drive folder.

You can specify your UID, API key and CDS API endpoint directly as arguments into the constructor:
uid = <YOUR UID HERE>
apikey = <YOUR APIKEY>
c = cdsapi.Client(key=f"{uid}:{apikey}", url="https://cds.climate.copernicus.eu/api/v2")

successful snapshot fails to load some shards, RepositoryMissingException in elasticsearch

I had a backup successfully complete to my s3 bucket in elasticsearch:
{
"state": "SUCCESS",
"start_time": "2014-12- 06T00:12:39.362Z",
"start_time_in_millis": 1417824759362,
"end_time": "2014-12-06T00:33:34.352Z",
"end_time_in_millis": 1417826014352,
"duration_in_millis": 1254990,
"failures": [],
"shards": {
"total": 345,
"failed": 0,
"successful": 345
}
}
But when I restore from the snapshot, I have a few failed shards, with the following message:
[2014-12-08 00:00:05,580][WARN ][cluster.action.shard] [Sunder] [kibana-int][4] received shard failed for [kibana-int][4],
node[_QG8dkDaRD-H1uPL_p57lw], [P], restoring[elasticsearch:snapshot_1], s[INITIALIZING], indexUUID [SAuv_EU3TBGZ71NhkC7WOA],
reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[kibana-int][4] failed recovery];
nested: IndexShardRestoreFailedException[[kibana-int][4] restore failed];
nested: RepositoryMissingException[[elasticsearch] missing]; ]]
how I do reconcile the data, or if necessary remove the shards from my cluster to complete the recovery?

Projects were not shown in scrapyd

I am new to scrapyd,
I have insert the below code into scrapy.cfg file.
[settings]
default = uk.settings
[deploy:scrapyd]
url = http://localhost:6800/
project=ukmall
[deploy:scrapyd2]
url = http://scrapyd.mydomain.com/api/scrapyd/
username = john
password = secret
If I run below code code
$scrapyd-deploy -l
I can get
scrapyd2 http://scrapyd.mydomain.com/api/scrapyd/
scrapyd http://localst:6800/
To see all available projects
scrapyd-deploy -L scrapyd
But it shows nothing in my machine?
Ref: http://scrapyd.readthedocs.org/en/latest/deploy.html#deploying-a-project
If Did
$ scrapy deploy scrapyd2
anandhakumar#MMTPC104:~/ScrapyProject/mall_uk$ scrapy deploy scrapyd2
Packing version 1412322816
Traceback (most recent call last):
File "/usr/bin/scrapy", line 4, in <module>
execute()
File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 142, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 88, in _run_print_help
func(*a, **kw)
File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 149, in _run_command
cmd.run(args, opts)
File "/usr/lib/pymodules/python2.7/scrapy/commands/deploy.py", line 103, in run
egg, tmpdir = _build_egg()
File "/usr/lib/pymodules/python2.7/scrapy/commands/deploy.py", line 228, in _build_egg
retry_on_eintr(check_call, [sys.executable, 'setup.py', 'clean', '-a', 'bdist_egg', '-d', d], stdout=o, stderr=e)
File "/usr/lib/pymodules/python2.7/scrapy/utils/python.py", line 276, in retry_on_eintr
return function(*args, **kw)
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', 'setup.py', 'clean', '-a', 'bdist_egg', '-d', '/tmp/scrapydeploy-VLM6W7']' returned non-zero exit status 1
anandhakumar#MMTPC104:~/ScrapyProject/mall_uk$
If I do this for another project means it shows.
$ scrapy deploy scrapyd
Packing version 1412325181
Deploying to project "project2" in http://localhost:6800/addversion.json
Server response (200):
{"status": "error", "message": "[Errno 13] Permission denied: 'eggs'"}

You'll only be able to list the spiders that have been deployed. If you haven't deployed anything yet then to deploy your spider you simply use scrapy deploy:
scrapy deploy [ <target:project> | -l <target> | -L ]
vagrant#portia:~/takeovertheworld$ scrapy deploy scrapyd2
Packing version 1410145736
Deploying to project "takeovertheworld" in http://ec2-xx-xxx-xx-xxx.compute-1.amazonaws.com:6800/addversion.json
Server response (200):
{"status": "ok", "project": "takeovertheworld", "version": "1410145736", "spiders": 1}
Verify that the project was installed correctly by accessing the scrapyd API:
vagrant#portia:~/takeovertheworld$ curl http://ec2-xx-xxx-xx-xxx.compute-1.amazonaws.com:6800/listprojects.json
{"status": "ok", "projects": ["takeovertheworld"]}

I had same error too. As #hugsbrugs said,because a folder inside the scrapy project had root rights.So, I do that.
sudo scrapy deploy scrapyd2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I make Sagemaker processing job run with multi-record manifest file - amazon-s3

Related

Filepulse Connector error with S3 provider (Source Connector)

Bazel extension that allows to load file from S3 into BUILD

Using CDSAPI in google colab

successful snapshot fails to load some shards, RepositoryMissingException in elasticsearch

Projects were not shown in scrapyd

Categories

Resources