SageMaker Studio Lab - XGBoost Algortim - Am I Reading the Right Documentation? - xgboost

Google has answers for SageMaker Studio, but I am at a loss for an answer on SageMaker Studio LAB...
I am reading the following on XGBoost - Am I in the right place for SM Studio LAB?
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html
Is there a better way for me to use XGBoost in LAB, vs what I am doing below by reading the doc above?
In SageMaker Studio I would do the following to get the ECR container for the XGBoost algorithm:
from sagemaker import image_uris
container = image_uris.retrieve('xgboost', boto3.Session().region_name, '1')
I made it a bit farther using the github example:
https://github.com/aws/studio-lab-examples/blob/main/connect-to-aws/Access_AWS_from_Studio_Lab.ipynb
This works:
from sagemaker import image_uris
from sagemaker.xgboost import XGBoost
# Create a training job name
job_name = 'ufo-xgboost-job-{}'.format(datetime.now().strftime("%Y%m%d%H%M%S"))
print('Here is the job name {}'.format(job_name))
import sagemaker
import boto3
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
But this is giving me trouble:
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(container,
role,
instance_count=1,
instance_type='ml.m4.xlarge',
output_path='model.tar.gz',
sagemaker_session=sess)
xgb.set_hyperparameters(objective='multi:softmax',
num_class=3,
num_round=100)
data_channels = {
'train': s3_input_train,
'validation': s3_input_validation
}
xgb.fit(data_channels, job_name=job_name)
With the following errors:
ParsingError Traceback (most recent call last)
~/.conda/envs/default/lib/python3.9/site-packages/botocore/configloader.py in raw_config_parse(config_filename, parse_subsections)
148 try:
--> 149 cp.read([path])
150 except (six.moves.configparser.Error, UnicodeDecodeError):
~/.conda/envs/default/lib/python3.9/configparser.py in read(self, filenames, encoding)
696 with open(filename, encoding=encoding) as fp:
--> 697 self._read(fp, filename)
698 except OSError:
~/.conda/envs/default/lib/python3.9/configparser.py in _read(self, fp, fpname)
1115 if e:
-> 1116 raise e
1117
ParsingError: Source contains parsing errors: '/home/studio-lab-user/.aws/config'
[line 5]: 'from sagemaker import image_uris\n'
[line 6]: 'import boto3\n'
During handling of the above exception, another exception occurred:
truncated
error at bottom:
ConfigParseError: Unable to parse config file: /home/studio-lab-user/.aws/config

You're trying to run an external training job that would run outside of your Studio Lab Environment, on an ephemeral EC2 instance as part of a SageMaker training job.
With the error being: ConfigParseError: Unable to parse config file: /home/studio-lab-user/.aws/config it sounds like you didn't configure your AWS credentials correctly.
Did you complete successfully step 0 and step 1, in this guide for connecting Studio Lab with your AWS account?
Alternatively: Run XGBoost locally in your Studio Lab - OR - as you already have an AWS account use SageMaker Studio (isn't free), which will already be configured with a certain IAM role your admin will configure when creating the Studio domain.

Related

Writing pandas dataframe to excel in dbfs azure databricks: OSError: [Errno 95] Operation not supported

I am trying to write a pandas dataframe to the local file system in azure databricks:
import pandas as pd
url = 'https://www.stats.govt.nz/assets/Uploads/Business-price-indexes/Business-price-indexes-March-2019-quarter/Download-data/business-price-indexes-march-2019-quarter-csv.csv'
data = pd.read_csv(url)
with pd.ExcelWriter(r'/dbfs/tmp/export.xlsx', engine="openpyxl") as writer:
data.to_excel(writer)
Then I get the following error message:
OSError: [Errno 95] Operation not supported
--------------------------------------------------------------------------- OSError Traceback (most recent call
last) in
3 data = pd.read_csv(url)
4 with pd.ExcelWriter(r'/dbfs/tmp/export.xlsx', engine="openpyxl") as writer:
----> 5 data.to_excel(writer)
/databricks/python/lib/python3.8/site-packages/pandas/io/excel/_base.py
in exit(self, exc_type, exc_value, traceback)
892
893 def exit(self, exc_type, exc_value, traceback):
--> 894 self.close()
895
896 def close(self):
/databricks/python/lib/python3.8/site-packages/pandas/io/excel/_base.py
in close(self)
896 def close(self):
897 """synonym for save, to make it more file-like"""
--> 898 content = self.save()
899 self.handles.close()
900 return content
I read in this post some limitations for mounted file systems: Pandas: Write to Excel not working in Databricks
But if I got it right, the solution is to write to the local workspace file system, which is exactly what is not working for me.
My user is workspace admin and I am using a standard cluster with 10.4 Runtime.
I also verified I can write csv file to the same location using pd.to_csv
What could be missing.
Databricks has a drawback that does not allow random write operations into DBFS which is indicated in the SO thread you are referring to.
So, a workaround for this would be to write the file to local file system (file:/) and then move to the required location inside DBFS. You can use the following code:
import pandas as pd
url = 'https://www.stats.govt.nz/assets/Uploads/Business-price-indexes/Business-price-indexes-March-2019-quarter/Download-data/business-price-indexes-march-2019-quarter-csv.csv'
data = pd.read_csv(url)
with pd.ExcelWriter(r'export.xlsx', engine="openpyxl") as writer:
#file will be written to /databricks/driver/ i.e., local file system
data.to_excel(writer)
dbutils.fs.ls("/databricks/driver/") indicates that the path you want to use to list the files is dbfs:/databricks/driver/ (absolute path) which does not exist.
/databricks/driver/ belongs to the local file system (DBFS is a part of this). The absolute path of /databricks/driver/ is file:/databricks/driver/. You can list the contents of this path by using either of the following:
import os
print(os.listdir("/databricks/driver/")
#OR
dbutils.fs.ls("file:/databricks/driver/")
So, use the file located in this path and move (or copy) it to your destination using shutil library as the following:
from shutil import move
move('/databricks/driver/export.xlsx','/dbfs/tmp/export.xlsx')

Streaming output truncated to the last 5000 lines

The Google Colab output is being truncated. I've looked through the settings and I didn't see a limitation there. What is the best option to solve the problem?
I had the same problem and managed it by writing the output on a file on drive:
from google.colab import drive
drive.mount('/content/drive')
import os
os.chdir("/content/drive/")
with open('/content/drive/output.txt','w') as out:
out.write(' abcd \n')
I have the same issue currently, I found this link on medium, check the part "How do I use Colab for long training times/runs?"
So basically according to this article you need to store checkpoints on your drive and by using callbacks from Keras, you will be able to run it nonstop.
from keras.callbacks import *
filepath = "/content/gdrive/My Drive/MyCNN/epochs:{epoch:03d}-val_acc:{val_acc:.3f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]
Other solution to solve this problem is according to my researches, you should put this code to your console but make sure that you save your progress to drive, because it will be terminated in 12 hours.
function ClickConnect() {
console.log("Working");
document
.querySelector('#top-toolbar > colab-connect-button')
.shadowRoot.querySelector('#connect')
.click()
}
setInterval(ClickConnect, 60000)

Keras: Error when downloading Fashion_MNIST Data

I am trying to download data from Fashion MNIST, but it produces an error. Originally, it was downloading and working properly, but I had to terminate it because I had to turn off my computer. Once I opened the file up again, it gives me an error. I'm not sure what the problem is, but is it because I already downloaded some parts of the data once, and keras doesn't recognize that? I am using Jupyter notebook in a conda environment
Here is the link to the image:
https://i.stack.imgur.com/wLGDm.png
You have missed adding tf. to the line
fashion_mnist = keras.datasets.fashion_mnist
The below code works perfectly for me. Importing the fashion_mnist dataset has been outlined in tensorflow documention here.
Change your code to:
import tensorflow as tf
fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
or, use the better way to do it below. This avoids creating an extra variable fashion_mnist:
import tensorflow as tf
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.fashion_mnist.load_data()
I am using tensorflow 1.9.0, keras 2.2.2 and python 3.6.6 on Windows 10 x64 OS.
I know my pc well, I can't download anything larger than 2.7 MB (in terminal), due to WinError 8.
So I manually downloaded all packs from storage.google (since some packs are 25 MB).
Check the packs:
then I paste all packs to \datasets\fashion-mnist
The next time u run your code, it should be fixed.
Note : If u have VScode then just CTRL and click the link, then you can download it easily.
I had an error regarding the cURL connection, and by looking into the error message I was able to track the file where the URL was declared. In my case it was:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow_core/python/keras/datasets/fashion_mnist.py
At line 44 I have commented out the line:
# base = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
And declared a different base URL, which I had found looking into the documentation of the original dataset:
base = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/'
The download started immediately and gave no errors. Hope this helps.
This is because for some reason you have an incomplete download for the MNIST dataset.
You will have to manually delete the downloaded folder which usually resides in ~/.keras/datasets or any path specified by you relative to this path, in your case MNIST_data.
Go to : C:\Users\Username.keras\datasets
and then Delete the Dataset that you want to redownload or has the error
You should be good to go!
You can also manually add print for the path from which it is taking dataset ..
Ex: print(paths) in file fashion_mnist.py
with gzip.open(paths[3], 'rb') as imgpath:
print(paths) #debug print in fashion_mnist.py
x_test = np.frombuffer(
imgpath.read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)
& from this path, remove the files & this will start to download fresh data ..
Change The base address with 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/' as described previously. It works for me.
I was getting error of Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Traceback (most recent call last):
File "C:\Users\AsadA\AppData\Local\Programs\Python\Python38\lib\site-packages\numpy\lib\npyio.py", line 448, in load
return pickle.load(fid, **pickle_kwargs)
EOFError: Ran out of input
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\AsadA\AppData\Local\Programs\Python\Python38\lib\site-packages\numpy\lib\npyio.py", line 450, in load
raise IOError(
OSError: Failed to interpret file 'C:\\Users\\AsadA\\.keras\\datasets\\mnist.npz' as a pickle"**
GO TO FILE C:\Users\AsadA\AppData\Local\Programs\Python\Python38\Lib\site-packages\tensorflow\python\keras\datasets (In my Case) and follow the instructions:

Google Storage (gs) wrapper file input/out for Cloud ML?

Google recently announced the Clould ML, https://cloud.google.com/ml/ and it's very useful. However, one limitation is that the input/out of a Tensorflow program should support gs://.
If we use all tensorflow APIS to read/write files, it should OK, since these APIs support gs://.
However, if we use native file IO APIs such as open, it does not work, because they don't understand gs://
For example:
with open(vocab_file, 'wb') as f:
cPickle.dump(self.words, f)
This code won't work in Google Cloud ML.
However, modifying all native file IO APIs to tensorflow APIs or Google Storage Python APIs is really tedious. Is there any simple way to do this? Any wrappers to support google storage systems, gs:// on top of the native file IO?
As suggested here Pickled scipy sparse matrix as input data?, perhaps we can use file_io.read_file_to_string('gs://...'), but still this requrements significant code modifcation.
Do it like this:
from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://.....', mode='w+') as f:
cPickle.dump(self.words, f)
Or you can read pickle file in like this:
file_stream = file_io.FileIO(train_file, mode='r')
x_train, y_train, x_test, y_test = pickle.load(file_stream)
One solution is to copy all of the data to local disk when the program starts up. You can do that using gsutil inside the Python script that gets run, something like:
vocab_file = 'vocab.pickled'
subprocess.check_call(['gsutil', '-m' , 'cp', '-r',
os.path.join('gs://path/to/', vocab_file), '/tmp'])
with open(os.path.join('/tmp', vocab_file), 'wb') as f:
cPickle.dump(self.words, f)
And if you have any outputs, you can write them to local disk and gsutil rsync them. (But, be careful to handle restarts correctly, because you may be put on a different machine).
The other solution is to monkey patch open (Note: untested):
import __builtin__
# NB: not all modes are compatible; should handle more carefully.
# Probably should be reported on
# https://github.com/tensorflow/tensorflow/issues/4357
def new_open(name, mode='r', buffering=-1):
return file_io.FileIO(name, mode)
__builtin__.open = new_open
Just be sure to do that before any module actually tries to read from GCS.
apache_beam has the gcsio module which can be used to return a standard Python file object to read/write GCS objects. You can use this object with any method that works with Python file objects. For example
def open_local_or_gcs(path, mode):
"""Opens the given path."""
if path.startswith('gs://'):
try:
return gcsio.GcsIO().open(path, mode)
except Exception as e: # pylint: disable=broad-except
# Currently we retry exactly once, to work around flaky gcs calls.
logging.error('Retrying after exception reading gcs file: %s', e)
time.sleep(10)
return gcsio.GcsIO().open(path, mode)
else:
return open(path, mode)
with open_local_or_gcs(vocab_file, 'wb') as f:
cPickle.dump(self.words, f)

TensorFlow distributed master worker save fails silently; the checkpoint file isn't created but no exception is raised

In distribution tensorflow environment. the master worker saves checkpoint fail.
saver.save has return ok*(not raise exception and return the store checkpoint file path) but, the return checkpoint file is not exist.
this is not same as the description of the tensorflow api
Why? How to Fix it?
=============
the related code is below:
def def_ps(self):
self.saver = tf.train.Saver(max_to_keep=100,keep_checkpoint_every_n_hours=3)
def save(self,idx):
ret = self.saver.save(self.sess,self.save_model_path,global_step=None,write_meta_graph=False)
if not os.path.exists(ret):
msg = "save model for %u path %s not exists."%(idx,ret)
lg.error(msg)
raise Exception(msg);
=============
the log is below:
2016-06-02 21:33:52,323 root ERROR save model for 2 path model_path/rl_model_2 not exists.
2016-06-02 21:33:52,323 root ERROR has error:save model for 2 path model_path/rl_model_2 not exists.
Traceback (most recent call last):
File "d_rl_main_model_dist_0.py", line 755, in run_worker
model_a.save(next_model_idx)
File "d_rl_main_model_dist_0.py", line 360, in save
Trainer.save(self,save_idx)
File "d_rl_main_model_dist_0.py", line 289, in save
raise Exception(msg);
Exception: save model for 2 path model_path/rl_model_2 not exists.
===========
not meets the tensorflow api which define Saver.save as below:
https://www.tensorflow.org/versions/master/api_docs/python/state_ops.html#Saver
tf.train.Saver.save(sess, save_path, global_step=None, latest_filename=None, meta_graph_suffix='meta', write_meta_graph=True)
Returns:
A string: path at which the variables were saved. If the saver is sharded, this string ends with: '-?????-of-nnnnn' where 'nnnnn' is the number of shards created.
Raises:
TypeError: If sess is not a Session.
ValueError: If latest_filename contains path components.
The tf.train.Saver.save() method is a little... surprising when you run in distributed mode. The actual file is written by the process that holds the tf.Variable op, which is typically a process in "/job:ps" if you've used the example code to set things up. This means that you need to look in save_path on each of the remote machines that have variables to find the checkpoint files.
Why is this the case? The Saver API implicitly assumes that all processes have the same view of a shared file system, like an NFS mount, because that is the typical setup we use at Google. We've added support for Google Cloud Storage in the latest nightly versions of TensorFlow, and are investigating HDFS support as well.