Are there any API's for Amazon Web Services PRICING? [closed] - api

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Are there any API's that have up-to-date pricing on Amazon Web Services? Something that can be queried, for example, for the latest price S3 for a given region, or EC2, etc.
thanks

UPDATE:
AWS has pricing API nowadays: https://aws.amazon.com/blogs/aws/new-aws-price-list-api/
Original answer:
This is something I have asked for (via AWS evangelists and surveys) previously, but hasn't been forthcoming. I guess the AWS folks have more interesting innovations on their horizon.
As pointed out by #brokenbeatnik, there is an API for spot-price history. API docs here: http://docs.amazonwebservices.com/AWSEC2/latest/APIReference/ApiReference-query-DescribeSpotPriceHistory.html
I find it odd that the spot-price history has an official API, but that they didn't do this for on-demand services at the same time. Anyway, to answer the question, yes you can query the advertised AWS pricing...
The best I can come up with is from examining the (client-side) source of the various services' pricing pages. Therein you'll find that the tables are built in JS and populated with JSON data, data that you can GET yourself. E.g.:
http://aws.amazon.com/ec2/pricing/pricing-on-demand-instances.json
http://aws.amazon.com/s3/pricing/pricing-storage.json
That's only half the battle though, next you have to pick apart the object format to get at the values you want, e.g., in Python this gets the Hi-CPU On-Demand Extra-Large Linux Instance pricing for Virginia:
>>> import json
>>> import urllib2
>>> response = urllib2.urlopen('http://aws.amazon.com/ec2/pricing/pricing-on-demand-instances.json')
>>> pricejson = response.read()
>>> pricing = json.loads(pricejson)
>>> pricing['config']['regions'][0]['instanceTypes'][3]['sizes'][1]['valueColumns'][0]['prices']['USD']
u'0.68'
Disclaimer: Obviously this is not an AWS sanctioned API and as such I wouldn't recommend expecting stability of the data format or even continued existence of the source. But it's there, and it beats transcribing the pricing data into static config/source files!

For the people who wanted to use the data from the amazon api who uses things like "t1.micro" here is a translation array
type_translation = {
'm1.small' : ['stdODI', 'sm'],
'm1.medium' : ['stdODI', 'med'],
'm1.large' : ['stdODI', 'lg'],
'm1.xlarge' : ['stdODI', 'xl'],
't1.micro' : ['uODI', 'u'],
'm2.xlarge' : ['hiMemODI', 'xl'],
'm2.2xlarge' : ['hiMemODI', 'xxl'],
'm2.4xlarge' : ['hiMemODI', 'xxxxl'],
'c1.medium' : ['hiCPUODI', 'med'],
'c1.xlarge' : ['hiCPUODI', 'xl'],
'cc1.4xlarge' : ['clusterComputeI', 'xxxxl'],
'cc2.8xlarge' : ['clusterComputeI', 'xxxxxxxxl'],
'cg1.4xlarge' : ['clusterGPUI', 'xxxxl'],
'hi1.4xlarge' : ['hiIoODI', 'xxxx1']
}
region_translation = {
'us-east-1' : 'us-east',
'us-west-2' : 'us-west-2',
'us-west-1' : 'us-west',
'eu-west-1' : 'eu-ireland',
'ap-southeast-1' : 'apac-sin',
'ap-northeast-1' : 'apac-tokyo',
'sa-east-1' : 'sa-east-1'
}

I've create a quick & dirty API in Python for accessing the pricing data in those JSON files and converting it to the relevant values (the right translations and the right instance types).
You can get the code here: https://github.com/erans/ec2instancespricing
And read a bit more about it here: http://forecastcloudy.net/2012/04/03/quick-dirty-api-for-accessing-amazon-web-services-aws-ec2-pricing-data/
You can use this file as a module and call the functions to get a Python dictionary with the results, or you can use it as a command line tool to get the output is a human readable table, JSON or CSV to use in combination with other command line tools.

There is a nice API available via the link below which you can query for AWS pricing.
http://info.awsstream.com
If you play around a bit with the filters, you can see how to construct a query to return the specific information you are after e.g. region, instance type etc. For example, to return a json containing the EC2 pricing for the eu-west-1 region linux instances, you can format your query as per below.
http://info.awsstream.com/instances.json?region=eu-west-1&os=linux
Just replace json with xml in the query above to return the information in an xml format.
Note - similar to the URL's posted by other contributors above, I don't believe this is an officially sanctioned AWS API. However, based on a number of spot checks I've made over the last couple of days I can confirm that at time of posting the pricing information seems to be correct.

I don't believe there's an API that covers general current prices for the standard services. However, for EC2 in particular, you can see spot price history so that you don't have to guess what the market price for a spot instance is. More on this is available at:
http://docs.amazonwebservices.com/AWSEC2/latest/DeveloperGuide/using-spot-instances-history.html

I too needed an API to retrieve AWS pricing. I was surprised to find nothing especially given the large number of APIs available for AWS resources.
My preferred language is Ruby so I wrote a Gem to called AWSCosts that provides programmatic access to AWS pricing.
Here is an example of how to find the on demand price for a m1.medium Linux instance.
AWSCosts.region('us-east-1').ec2.on_demand(:linux).price('m1.medium')

For those who need the comprehensive AWS instance pricing data (EC2, RDS, ElastiCache and Redshift), here is the Python module grown from the one suggested above by Eran Sandler:
https://github.com/ilia-semenov/awspricingfull
It contains previous generation instances as well as current generation ones (including newest d2 family), reserved and on-demand pricing. JSON, Table and CSV formats available.

I made a Gist of forward and reverse names in Yaml should anyone need them for Rails, etc.

Another quick & dirty, but with a conversion to a more convenient final data format
class CostsAmazon(object):
'''Class for general info on the Amazon EC2 compute cloud.
'''
def __init__(self):
'''Fetch a bunch of instance cost data from Amazon and convert it
into the following form (as self.table):
table['us-east']['linux']['m1']['small']['light']['ondemand']['USD']
'''
#
# tables_raw['ondemand']['config']['regions'
# ][0]['instanceTypes'][0]['sizes'][0]['valueColumns'][0
# ]['prices']['USD']
#
# structure of tables_raw:
# ┃
# ┗━━[key]
# ┣━━['use'] # an input 3 x ∈ { 'light', 'medium', ... }
# ┣━━['os'] # an input 2 x ∈ { 'linux', 'mswin' }
# ┣━━['scheduling'] # an input
# ┣━━['uri'] # an input (see dict above)
# ┃ # the core output from Amazon follows
# ┣━━['vers'] == 0.01
# ┗━━['config']:
# * ┣━━['regions']: 7 x
# ┃ ┣━━['region'] == ∈ { 'us-east', ... }
# * ┃ ┗━━['instanceTypes']: 7 x
# ┃ ┣━━['type']: 'stdODI'
# * ┃ ┗━━['sizes']: 4 x
# ┃ ┗━━['valueColumns']
# ┃ ┣━━['size']: 'sm'
# * ┃ ┗━━['valueColumns']: 2 x
# ┃ ┣━━['name']: ~ 'linux'
# ┃ ┗━━['prices']
# ┃ ┗━━['USD']: ~ '0.080'
# ┣━━['rate']: ~ 'perhr'
# ┣━━['currencies']: ∈ { 'USD', ... }
# ┗━━['valueColumns']: [ 'linux', 'mswin' ]
#
# The valueColumns thing is weird, it looks like they're trying
# to constrain actual data to leaf nodes only, which is a little
# bit of a conceit since they have lists in several levels. So
# we can obtain the *much* more readable:
#
# tables['regions']['us-east']['m1']['linux']['ondemand'
# ]['small']['light']['USD']
#
# structure of the reworked tables:
# ┃
# ┗━━[<region>]: 7 x ∈ { 'us-east', ... }
# ┗━━[<os>]: 2 x ∈ { 'linux', 'mswin' } # oses
# ┗━━[<type>]: 7 x ∈ { 'm1', ... }
# ┗━━[<scheduling>]: 2 x ∈ { 'ondemand', 'reserved' }
# ┗━━[<size>]: 4 x ∈ { 'small', ... }
# ┗━━[<use>]: 3 x ∈ { 'light', 'medium', ... }
# ┗━━[<currency>]: ∈ { 'USD', ... }
# ┗━━> ~ '0.080' or None
uri_base = 'http://aws.amazon.com/ec2/pricing'
tables_raw = {
'ondemand': {'scheduling': 'ondemand',
'uri': '/pricing-on-demand-instances.json',
'os': 'linux', 'use': 'light'},
'reserved-light-linux': {
'scheduling': 'ondemand',
'uri': 'ri-light-linux.json', 'os': 'linux', 'use': 'light'},
'reserved-light-mswin': {
'scheduling': 'ondemand',
'uri': 'ri-light-mswin.json', 'os': 'mswin', 'use': 'light'},
'reserved-medium-linux': {
'scheduling': 'ondemand',
'uri': 'ri-medium-linux.json', 'os': 'linux', 'use': 'medium'},
'reserved-medium-mswin': {
'scheduling': 'ondemand',
'uri': 'ri-medium-mswin.json', 'os': 'mswin', 'use': 'medium'},
'reserved-heavy-linux': {
'scheduling': 'ondemand',
'uri': 'ri-heavy-linux.json', 'os': 'linux', 'use': 'heavy'},
'reserved-heavy-mswin': {
'scheduling': 'ondemand',
'uri': 'ri-heavy-mswin.json', 'os': 'mswin', 'use': 'heavy'},
}
for key in tables_raw:
# expand to full URIs
tables_raw[key]['uri'] = (
'%s/%s'% (uri_base, tables_raw[key]['uri']))
# fetch the data from Amazon
link = urllib2.urlopen(tables_raw[key]['uri'])
# adds keys: 'vers' 'config'
tables_raw[key].update(json.loads(link.read()))
link.close()
# canonicalize the types - the default is pretty annoying.
#
self.currencies = set()
self.regions = set()
self.types = set()
self.intervals = set()
self.oses = set()
self.sizes = set()
self.schedulings = set()
self.uses = set()
self.footnotes = {}
self.typesizes = {} # self.typesizes['m1.small'] = [<region>...]
self.table = {}
# grovel through Amazon's tables_raw and convert to something orderly:
for key in tables_raw:
scheduling = tables_raw[key]['scheduling']
self.schedulings.update([scheduling])
use = tables_raw[key]['use']
self.uses.update([use])
os = tables_raw[key]['os']
self.oses.update([os])
config_data = tables_raw[key]['config']
self.currencies.update(config_data['currencies'])
for region_data in config_data['regions']:
region = self.instance_region_from_raw(region_data['region'])
self.regions.update([region])
if 'footnotes' in region_data:
self.footnotes[region] = region_data['footnotes']
for instance_type_data in region_data['instanceTypes']:
instance_type = self.instance_types_from_raw(
instance_type_data['type'])
self.types.update([instance_type])
for size_data in instance_type_data['sizes']:
size = self.instance_size_from_raw(size_data['size'])
typesize = '%s.%s' % (instance_type, size)
if typesize not in self.typesizes:
self.typesizes[typesize] = set()
self.typesizes[typesize].update([region])
self.sizes.update([size])
for size_values in size_data['valueColumns']:
interval = size_values['name']
self.intervals.update([interval])
for currency in size_values['prices']:
cost = size_values['prices'][currency]
self.table_add_row(region, os, instance_type,
size, use, scheduling,
currency, cost)
def table_add_row(self, region, os, instance_type, size, use, scheduling,
currency, cost):
if cost == 'N/A*':
return
table = self.table
for key in [region, os, instance_type, size, use, scheduling]:
if key not in table:
table[key] = {}
table = table[key]
table[currency] = cost
def instance_region_from_raw(self, raw_region):
'''Return a less intelligent given EC2 pricing name to the
corresponding region name.
'''
regions = {
'apac-tokyo' : 'ap-northeast-1',
'apac-sin' : 'ap-southeast-1',
'eu-ireland' : 'eu-west-1',
'sa-east-1' : 'sa-east-1',
'us-east' : 'us-east-1',
'us-west' : 'us-west-1',
'us-west-2' : 'us-west-2',
}
return regions[raw_region] if raw_region in regions else raw_region
def instance_types_from_raw(self, raw_type):
types = {
# ondemand reserved
'stdODI' : 'm1', 'stdResI' : 'm1',
'uODI' : 't1', 'uResI' : 't1',
'hiMemODI' : 'm2', 'hiMemResI' : 'm2',
'hiCPUODI' : 'c1', 'hiCPUResI' : 'c1',
'clusterComputeI' : 'cc1', 'clusterCompResI' : 'cc1',
'clusterGPUI' : 'cc2', 'clusterGPUResI' : 'cc2',
'hiIoODI' : 'hi1', 'hiIoResI' : 'hi1'
}
return types[raw_type]
def instance_size_from_raw(self, raw_size):
sizes = {
'u' : 'micro',
'sm' : 'small',
'med' : 'medium',
'lg' : 'large',
'xl' : 'xlarge',
'xxl' : '2xlarge',
'xxxxl' : '4xlarge',
'xxxxxxxxl' : '8xlarge'
}
return sizes[raw_size]
def cost(self, region, os, instance_type, size, use, scheduling,
currency):
try:
return self.table[region][os][instance_type][
size][use][scheduling][currency]
except KeyError as ex:
return None

Here is another unsanctioned "api" which covers reserved instances: http://aws.amazon.com/ec2/pricing/pricing-reserved-instances.json

There is no pricing api, but there are very nice price mentioned above.
In the addition to the ec2 price ripper I'd like to share my rds and elasticache price rippers:
https://github.com/evgeny-gridasov/rdsinstancespricing
https://github.com/evgeny-gridasov/elasticachepricing

There is a reply to a similar question which lists all the .js files containing the prices, which are barely JSON files (with only a callback(...); statement to remove).
Here is an exemple for Linux On Demand prices : http://aws-assets-pricing-prod.s3.amazonaws.com/pricing/ec2/linux-od.js
(Get the full list directly on that reply)

Related

How to get the uri of the current pipeline's artifact

Consider the following pipeline:
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder)
statistics_gen = tfx.components.StatisticsGen(examples=example_gen.outputs['examples'])
schema_gen = tfx.components.SchemaGen(
statistics=statistics_gen.outputs['statistics'],
infer_feature_shape=True)
transform = tfx.components.Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
module_file=os.path.abspath('preprocessing_fn.py'))
_trainer_module_file = 'run_fn.py'
trainer = tfx.components.Trainer(
module_file=os.path.abspath(_trainer_module_file),
examples=transform.outputs['transformed_examples'],
transform_graph=transform.outputs['transform_graph'],
schema=schema_gen.outputs['schema'],
train_args=tfx.proto.TrainArgs(num_steps=10),
eval_args=tfx.proto.EvalArgs(num_steps=6),)
pusher = tfx.components.Pusher(
model=trainer.outputs['model'],
push_destination=tfx.proto.PushDestination(
filesystem=tfx.proto.PushDestination.Filesystem(
base_directory=_serving_model_dir)
)
)
components = [
example_gen,
statistics_gen,
schema_gen,
transform,
trainer,
pusher,
]
_pipeline_data_folder = './simple_pipeline_data'
pipeline = tfx.dsl.Pipeline(
pipeline_name='simple_pipeline',
pipeline_root=_pipeline_data_folder,
metadata_connection_config=tfx.orchestration.metadata.sqlite_metadata_connection_config(
f'{_pipeline_data_folder}/metadata.db'),
components=components)
tfx.orchestration.LocalDagRunner().run(pipeline)
Now, let's assume that once the pipeline is down, I would like to do something with the artifacts. I know I can query the ML Metadata like this:
import ml_metadata as mlmd
connection_config = pipeline.metadata_connection_config
store = mlmd.MetadataStore(connection_config)
print(store.get_artifact_types())
But this way, I have no idea which IDs belong to the current pipeline. Sure, I can assume that the largest IDs represent the current pipeline artifacts but that's not going to be a practical approach in production when multiple executions might try to work with the same metadata store concurrently.
So, the question is how can I figure out the artifact IDs that were just created by the current execution?
[UPDATE]
To clarify the problem, consider the following partial solution:
def get_latest_artifact(metadata_connection_config, pipeline_name: str, component_name: str, type_name: str):
with Metadata(metadata_connection_config) as metadata:
context = metadata.store.get_context_by_type_and_name('node', f'{pipeline_name}.{component_name}')
artifacts = metadata.store.get_artifacts_by_context(context.id)
artifact_type = metadata.store.get_artifact_type(type_name)
latest_artifact = max([a for a in artifacts if a.type_id == artifact_type.id],
key=lambda a: a.last_update_time_since_epoch)
artifact = types.Artifact(artifact_type)
artifact.set_mlmd_artifact(latest_artifact)
return artifact
sqlite_path = './pipeline_data/metadata.db'
metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(sqlite_path)
examples_artifact = get_latest_artifact(metadata_connection_config, 'simple_pipeline',
'SchemaGen', 'Schema')
Using get_latest_artifact function, I can get the latest artifact of a specific type from a specific pipeline. This will work even if two pipelines (with different names) create new artifacts concurrently. But it will fail when I try to extract the artifact of the "just finished" pipeline if multiple instances of the same pipeline are making changes to the store concurrently. That's because the function takes in the pipeline name as an input argument (as opposed to some pipeline unique ID).
I'm looking for a solution that works no matter how many different (or the same) pipelines work with the same store concurrently. At this point, I'm not sure if this can be done with MlMD. And if it cannot be done at the moment, I consider that a missed feature, a very crucial one.
OK, this is the solution I found. When defining the pipeline's components, you should use .with_id() method and give the component a custom ID. That way you can find it later on.
Here's an example. Let's say that I want to find the schema generated as part of the recently executed pipeline.
schema_gen = tfx.components.SchemaGen(
statistics=statistics_gen.outputs['statistics'],
infer_feature_shape=True).with_id('some_unique_id')
Then, the same function I defined above can be used like this:
def get_latest_artifact(metadata_connection_config, pipeline_name: str, component_name: str, type_name: str):
with Metadata(metadata_connection_config) as metadata:
context = metadata.store.get_context_by_type_and_name('node', f'{pipeline_name}.{component_name}')
artifacts = metadata.store.get_artifacts_by_context(context.id)
artifact_type = metadata.store.get_artifact_type(type_name)
latest_artifact = max([a for a in artifacts if a.type_id == artifact_type.id],
key=lambda a: a.last_update_time_since_epoch)
artifact = types.Artifact(artifact_type)
artifact.set_mlmd_artifact(latest_artifact)
return artifact
sqlite_path = './pipeline_data/metadata.db'
metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(sqlite_path)
examples_artifact = get_latest_artifact(metadata_connection_config, 'simple_pipeline',
'some_unique_id', 'Schema')
Once your TFX pipeline completes the run, you can query the ML metadata using below code.
connection_config = interactive_context.metadata_connection_config
store = mlmd.MetadataStore(connection_config)
# All TFX artifacts are stored in the base directory
base_dir = connection_config.sqlite.filename_uri.split('metadata.sqlite')[0]
Once metadata is fetched, you can use below helper functions to view data from MD store. The display_types() function queries the list of all its stored ArtifactTypes. Then display_artifacts() function list all artifacts for given artifact type and their uri. The display_properties() function will give execution properties for given artifact.
Please refer MLMD tutorial for detailed implementation of below functions.
def display_types(types):
# Helper function to render dataframes for the artifact and execution types
table = {'id': [], 'name': []}
for a_type in types:
table['id'].append(a_type.id)
table['name'].append(a_type.name)
return pd.DataFrame(data=table)
def display_artifacts(store, artifacts):
# Helper function to render dataframes for the input artifacts
table = {'artifact id': [], 'type': [], 'uri': []}
for a in artifacts:
table['artifact id'].append(a.id)
artifact_type = store.get_artifact_types_by_id([a.type_id])[0]
table['type'].append(artifact_type.name)
table['uri'].append(a.uri.replace(base_dir, './'))
return pd.DataFrame(data=table)
def display_properties(store, node):
# Helper function to render dataframes for artifact and execution properties
table = {'property': [], 'value': []}
for k, v in node.properties.items():
table['property'].append(k)
table['value'].append(
v.string_value if v.HasField('string_value') else v.int_value)
for k, v in node.custom_properties.items():
table['property'].append(k)
table['value'].append(
v.string_value if v.HasField('string_value') else v.int_value)
return pd.DataFrame(data=table)
Example code to get the latest pushed model execution properties.
# get all artifacts with ArtifactType as PusherModel
pushed_models = store.get_artifacts_by_type("PushedModel")
# get the latest pushed model
pushed_model = pushed_models[-1]
# get execution properties for latest pushed model
display_properties(store, pushed_model)

How to switch between policies in multiagent rllib?

For my thesis, I want to train a PPO model on a collaborative multi-agent situation. The 'teammates' are already trained Behavior Cloning models each representing different kind of behavior. These models will not be trained simultaneously to the PPO. I want the PPO encounter all these different models during training, meaning: I want to switch out the 'teammate' of the PPO every x amount of epochs. How do I do this?
I am using the overcooked-ai environment. All code would be too much to include. But the generation of the trainer ends up returning this:
trainer = PPOTrainer(env="overcooked_multi_agent", config={
"multiagent": multi_agent_config,
"callbacks" : TrainingCallbacks,
"custom_eval_function" : get_rllib_eval_function(evaluation_params, environment_params['eval_mdp_params'], environment_params['env_params'],
environment_params["outer_shape"], 'ppo', 'ppo' if self_play else 'bc',
verbose=params['verbose']),
"env_config" : environment_params,
"eager" : False,
**training_params
}, logger_creator=custom_logger_creator
and the multi-agent-config is created like
def gen_policy(policy_type="ppo"):
# supported policy types thus far
assert policy_type in ["ppo", "bc"]
if policy_type == "ppo":
config = {
"model" : {
"custom_options" : model_params,
"custom_model" : "MyPPOModel"
}
}
return (None, env.ppo_observation_space, env.action_space, config)
elif policy_type == "bc":
bc_cls = bc_params['bc_policy_cls']
bc_config = bc_params['bc_config']
return (bc_cls, env.bc_observation_space, env.action_space, bc_config)
# Rllib compatible way of setting the directory we store agent checkpoints in
logdir_prefix = "{0}_{1}_{2}".format(params["experiment_name"], params['training_params']['seed'], timestr)
def custom_logger_creator(config):
"""Creates a Unified logger that stores results in <params['results_dir']>/<params["experiment_name"]>_<seed>_<timestamp>
"""
results_dir = params['results_dir']
if not os.path.exists(results_dir):
try:
os.makedirs(results_dir)
except Exception as e:
print("error creating custom logging dir. Falling back to default logdir {}".format(DEFAULT_RESULTS_DIR))
results_dir = DEFAULT_RESULTS_DIR
logdir = tempfile.mkdtemp(
prefix=logdir_prefix, dir=results_dir)
logger = UnifiedLogger(config, logdir, loggers=None)
return logger
# Create rllib compatible multi-agent config based on params
multi_agent_config = {}
all_policies = ['ppo']
# Whether both agents should be learned
self_play = iterable_equal(multi_agent_params['bc_schedule'], OvercookedMultiAgent.self_play_bc_schedule)
if not self_play:
all_policies.append('bc')
multi_agent_config['policies'] = { policy : gen_policy(policy) for policy in all_policies }
def select_policy(agent_id):
if agent_id.startswith('ppo'):
return 'ppo'
if agent_id.startswith('bc'):
return 'bc'
multi_agent_config['policy_mapping_fn'] = select_policy
multi_agent_config['policies_to_train'] = 'ppo'
Do I already change my configuration to include multiple BHC models (code above is built to handle 1), or can I somehow change the models in the trainingloop?
Any ideas would be super helpful!

Vertex AI Model Batch prediction, issue with referencing existing model and input file on Cloud Storage

I'm struggling to correctly set Vertex AI pipeline which does the following:
read data from API and store to GCS and as as input for batch prediction.
get an existing model (Video classification on Vertex AI)
create Batch prediction job with input from point 1.
As it will be seen, I don't have much experience with Vertex Pipelines/Kubeflow thus I'm asking for help/advice, hope it's just some beginner mistake.
this is the gist of the code I'm using as pipeline
from google_cloud_pipeline_components import aiplatform as gcc_aip
from kfp.v2 import dsl
from kfp.v2.dsl import component
from kfp.v2.dsl import (
Output,
Artifact,
Model,
)
PROJECT_ID = 'my-gcp-project'
BUCKET_NAME = "mybucket"
PIPELINE_ROOT = "{}/pipeline_root".format(BUCKET_NAME)
#component
def get_input_data() -> str:
# getting data from API, save to Cloud Storage
# return GS URI
gcs_batch_input_path = 'gs://somebucket/file'
return gcs_batch_input_path
#component(
base_image="python:3.9",
packages_to_install=['google-cloud-aiplatform==1.8.0']
)
def load_ml_model(project_id: str, model: Output[Artifact]):
"""Load existing Vertex model"""
import google.cloud.aiplatform as aip
model_id = '1234'
model = aip.Model(model_name=model_id, project=project_id, location='us-central1')
#dsl.pipeline(
name="batch-pipeline", pipeline_root=PIPELINE_ROOT,
)
def pipeline(gcp_project: str):
input_data = get_input_data()
ml_model = load_ml_model(gcp_project)
gcc_aip.ModelBatchPredictOp(
project=PROJECT_ID,
job_display_name=f'test-prediction',
model=ml_model.output,
gcs_source_uris=[input_data.output], # this doesn't work
# gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
)
if __name__ == '__main__':
from kfp.v2 import compiler
import google.cloud.aiplatform as aip
pipeline_export_filepath = 'test-pipeline.json'
compiler.Compiler().compile(pipeline_func=pipeline,
package_path=pipeline_export_filepath)
# pipeline_params = {
# 'gcp_project': PROJECT_ID,
# }
# job = aip.PipelineJob(
# display_name='test-pipeline',
# template_path=pipeline_export_filepath,
# pipeline_root=f'gs://{PIPELINE_ROOT}',
# project=PROJECT_ID,
# parameter_values=pipeline_params,
# )
# job.run()
When running the pipeline it throws this exception when running Batch prediction:
details = "List of found errors: 1.Field: batch_prediction_job.model; Message: Invalid Model resource name.
so I'm not sure what could be wrong. I tried to load model in the notebook (outside of component) and it correctly returns.
Second issue I'm having is referencing GCS URI as output from component to batch job input.
input_data = get_input_data2()
gcc_aip.ModelBatchPredictOp(
project=PROJECT_ID,
job_display_name=f'test-prediction',
model=ml_model.output,
gcs_source_uris=[input_data.output], # this doesn't work
# gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
)
During compilation, I get following exception TypeError: Object of type PipelineParam is not JSON serializable, though I think this could be issue of ModelBatchPredictOp component.
Again any help/advice appreciated, I'm dealing with this from yesterday, so maybe I missed something obvious.
libraries I'm using:
google-cloud-aiplatform==1.8.0
google-cloud-pipeline-components==0.2.0
kfp==1.8.10
kfp-pipeline-spec==0.1.13
kfp-server-api==1.7.1
UPDATE
After comments, some research and tuning, for referencing model this works:
#component
def load_ml_model(project_id: str, model: Output[Artifact]):
region = 'us-central1'
model_id = '1234'
model_uid = f'projects/{project_id}/locations/{region}/models/{model_id}'
model.uri = model_uid
model.metadata['resourceName'] = model_uid
and then I can use it as intended:
batch_predict_op = gcc_aip.ModelBatchPredictOp(
project=gcp_project,
job_display_name=f'batch-prediction-test',
model=ml_model.outputs['model'],
gcs_source_uris=[input_batch_gcs_path],
gcs_destination_output_uri_prefix=f'gs://{BUCKET_NAME}/prediction_output/test'
)
UPDATE 2
regarding GCS path, a workaround is to define path outside of the component and pass it as an input parameter, for example (abbreviated):
#dsl.pipeline(
name="my-pipeline",
pipeline_root=PIPELINE_ROOT,
)
def pipeline(
gcp_project: str,
region: str,
bucket: str
):
ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
gcs_prediction_input_path = f'gs://{BUCKET_NAME}/prediction_input/video_batch_prediction_input_{ts}.jsonl'
batch_input_data_op = get_input_data(gcs_prediction_input_path) # this loads input data to GCS path
batch_predict_op = gcc_aip.ModelBatchPredictOp(
project=gcp_project,
model=training_job_run_op.outputs["model"],
job_display_name='batch-prediction',
# gcs_source_uris=[batch_input_data_op.output],
gcs_source_uris=[gcs_prediction_input_path],
gcs_destination_output_uri_prefix=f'gs://{BUCKET_NAME}/prediction_output/',
).after(batch_input_data_op) # we need to add 'after' so it runs after input data is prepared since get_input_data doesn't returns anything
still not sure, why it doesn't work/compile when I return GCS path from get_input_data component
I'm glad you solved most of your main issues and found a workaround for model declaration.
For your input.output observation on gcs_source_uris, the reason behind it is because the way the function/class returns the value. If you dig inside the class/methods of google_cloud_pipeline_components you will find that it implements a structure that will allow you to use .outputs from the returned value of the function called.
If you go to the implementation of one of the components of the pipeline you will find that it returns an output array from convert_method_to_component function. So, in order to have that implemented in your custom class/function your function should return a value which can be called as an attribute. Below is a basic implementation of it.
class CustomClass():
def __init__(self):
self.return_val = {'path':'custompath','desc':'a desc'}
#property
def output(self):
return self.return_val
hello = CustomClass()
print(hello.output['path'])
If you want to dig more about it you can go to the following pages:
convert_method_to_component, which is the implementation of convert_method_to_component
Properties, basics of property in python.

Can't change the language in microsoft cognitive services spellchecker

Here is the Microsoft Python example of using spellchecker API:
import http.client, urllib.parse, json
text = 'Hollo, wrld!'
data = {'text': text}
# NOTE: Replace this example key with a valid subscription key.
key = 'MY_API_KEY'
host = 'api.cognitive.microsoft.com'
path = '/bing/v7.0/spellcheck?'
params = 'mkt=en-us&mode=proof'
headers = {'Ocp-Apim-Subscription-Key': key,
'Content-Type': 'application/x-www-form-urlencoded'}
# The headers in the following example
# are optional but should be considered as required:
#
# X-MSEdge-ClientIP: 999.999.999.999
# X-Search-Location: lat: +90.0000000000000;long: 00.0000000000000;re:100.000000000000
# X-MSEdge-ClientID: <Client ID from Previous Response Goes Here>
conn = http.client.HTTPSConnection(host)
body = urllib.parse.urlencode(data)
conn.request ("POST", path + params, body, headers)
response = conn.getresponse()
output = json.dumps(json.loads(response.read()), indent=4)
print (output)
And it works well for mkt=en-us. But if I try to change it, for example to 'fr-FR'. It always answers me with a blank response to any input text.
{
"_type": "SpellCheck",
"flaggedTokens": []
}
Has anybody encountered the similar problem? May it be connected with my trial api key (though they do not mention that trial supports only English)?
Well, I've found out what the problem was. 'mode=proof' — advanced spellchecker currently available only if 'mkt=en-us' (for some Microsoft reasons it does not available even if 'mkt=en-uk'). For all other languages, you should use 'mode=spell'.
The main difference between 'proof' and 'spell' is described like this:
The Spell mode finds most spelling mistakes but doesn't find some of the grammar errors that Proof catches (for example, capitalization and repeated words).

Advice needed on setting up an (Objective C?) Mac-based web service

I have developed numerous iOS apps over the years so know Objective C reasonably well.
I'd like to build my first web service to offload some of the most processor intensive functions.
I'm leaning towards using my Mac as the server, which comes with Apache. I have configured this and it appears to be working as it should (I can type the Mac's IP address and receive a confirmation).
Now I'm trying to decide on how to build the server-side web service, which is totally new to me. I'd like to leverage my Objective C knowledge if possible. I think I'm looking for an Objective C-compatible web service engine and some examples how to connect it to browsers and mobile interfaces. I was leaning towards using Amazon's SimpleDB as the database.
BTW: I see Apple have Lion Server, but I cannot work out if this is an option.
Any thoughts/recommendations are appreciated.?
There are examples of simple web servers out there written in ObjC such as this and this.
That said, there are probably "better" ways of doing this if you don't mind using other technologies. This is a matter of preference; but I've use Python, MySQL, and the excellent web.py framework for these sorts of backends.
For example, here's an example web service (some redundancies omitted...) using the combination of technologies described. I just run this on my server, and it takes care of url redirection and serves JSON from the db.
import web
import json
import MySQLdb
urls = (
"/equip/gruppo", "gruppo", # GET = get all gruppos, # POST = save gruppo
"/equip/frame", "frame"
)
class StatusCode:
(Success,SuccessNoRows,FailConnect,FailQuery,FailMissingParam,FailOther) = range(6);
# top-level class that handles db interaction
class APIObject:
def __init__(self):
self.object_dict = {} # top-level dictionary to be turned into JSON
self.rows = []
self.cursor = ""
self.conn = ""
def dbConnect(self):
try:
self.conn = MySQLdb.connect( host = 'localhost', user = 'my_api_user', passwd = 'api_user_pw', db = 'my_db')
self.cursor = self.conn.cursor(MySQLdb.cursors.DictCursor)
except:
self.object_dict['api_status'] = StatusCode.FailConnect
return False
else:
return True
def queryExecute(self,query):
try:
self.cursor.execute(query)
self.rows = self.cursor.fetchall()
except:
self.object_dict['api_status'] = StatusCode.FailQuery
return False
else:
return True
class gruppo(APIObject):
def GET(self):
web.header('Content-Type', 'application/json')
if self.dbConnect() == False:
return json.dumps(self.object_dict,sort_keys=True, indent=4)
else:
if self.queryExecute("SELECT * FROM gruppos") == False:
return json.dumps(self.object_dict,sort_keys=True, indent=4)
else:
self.object_dict['api_status'] = StatusCode.SuccessNoRows if self.rows.count == 0 else StatusCode.Success
data_list = []
for row in self.rows:
# create a dictionary with the required elements
d = {}
d['id'] = row['id']
d['maker'] = row['maker_name']
d['type'] = row['type_name']
# append to the object list
data_list.append(d)
self.object_dict['data'] = data_list
# return to the client
return json.dumps(self.object_dict,sort_keys=True, indent=4)