How to introduce xgboost in pyscript - xgboost

I found that the xgboost package exists in pyodide, but I can't import it in the of pyscript
<link rel="stylesheet" href="https://pyscript.net/alpha/pyscript.css" />
<script defer src="https://pyscript.net/alpha/pyscript.js"></script>
<py-env>
- numpy
- matplotlib
- pandas
- scikit-learn
- xgboost
</py-env>
error:
ValueError: Couldn't find a pure Python 3 wheel for 'xgboost'. You can use micropip.install(..., keep_going=True) to get a list of all packages with missing wheels.

As for "I found that the xgboost package exists in pyodide,", I think you mean to reference the announcement here of Pyodide 0.21.0. However, as Pyscript is built on top of Pyodide, it doesn't mean as a Pyscript user you immediately reap the benefits of an update to Pyodide.
I believe this discussion relates to your question and the code in the linked twitter thread shows you how you could use the 'unstable' release now.

Related

Plot a subset of data from a grib file on google colab

I'm trying to plot a subset of a field from a grib file on google colab. The issue I am finding is that due to google colab using an older version of python I can't get enough libraries to work together to 1.) get a field from the grib file and then 2.) extract a subset of that field by lat/lon, and then 3.) be able to plot with matplotlib/cartopy.
I've been able to do each of the above steps on my own PC and there are numerous answers on this forum already that work away from colab, so the issue is related to making it work on the colab environment, which uses python 3.7.
For simplicity, here are some assumptions that could be made for anybody who wants to help.
1.) Use this file, since its been what I have been trying to use:
https://noaa-hrrr-bdp-pds.s3.amazonaws.com/hrrr.20221113/conus/hrrr.t18z.wrfnatf00.grib2
2.) You could use any field, but I've been extracting this one (output from pygrib):
14:Temperature:K (instant):lambert:hybrid:level 1:fcst time 0 hrs:from 202211131800
3.) You can get this data in zarr format from AWS, but the grib format uploads to the AWS database faster so I need to use it.
Here are some notes on what I've tried:
Downloading the data isn't an issue, it's mostly relating to extracting the data (by lat lon) that is the main issue. I've tried using condacolab or pip to download pygrib, pupygrib, pinio, or cfgrib. I can then use these to download the data above.
I could never get pupygrib or pinio to even download correctly. Cfgrib I was able to get it to work with conda, but then xarray fails when trying to extract fields due to a library conflict. Pygrib worked the best, I was able to extract fields from the grib file. However, the function grb.data(lat1=30,lat2=40,lon1=-100,lon2-90) fails. It dumps the data into 1d arrays instead of 2d as it is supposed to per the documentation found here: https://jswhit.github.io/pygrib/api.html#example-usage
Here is some code I used for the pygrib set up in case that is useful:
!pip install pyproj
!pip install pygrib
# Uninstall existing shapely
!pip uninstall --yes shapely
!apt-get install -qq libgdal-dev libgeos-dev
!pip install shapely --no-binary shapely
!pip install cartopy==0.19.0.post1
!pip install metpy
!pip install wget
!pip install s3fs
import time
from matplotlib import pyplot as plt
import numpy as np
import scipy
import pygrib
import fsspec
import xarray as xr
import metpy.calc as mpcalc
from metpy.interpolate import cross_section
from metpy.units import units
from metpy.plots import USCOUNTIES
import cartopy.crs as ccrs
import cartopy.feature as cfeature
!wget https://noaa-hrrr-bdp-pds.s3.amazonaws.com/hrrr.20221113/conus/hrrr.t18z.wrfnatf00.grib2
grbs = pygrib.open('/content/hrrr.t18z.wrfnatf00.grib2')
grb2 = grbs.message(1)
data, lats, lons = grb2.data(lat1=30,lat2=40,lon1=-100,lon2=-90)
data.shape
This will output a 1d array for data, or lats and lons. That is as far as I can get here because existing options like meshgrib don't work on big datasets (I tried it).
The other option is to get data this way:
grb_t = grbs.select(name='Temperature')[0]
This is plottable, but I don't know of a way to extract a subset of the data from here using lat/lons.
If you can help, feel free to ask me anything I can add more details, but since I've tried like 10 different ways probably no sense in adding every failure. Really, I am open to any way to accomplish this task. Thank you.

Sklearn datasets default data structure is pandas or numPy?

I'm working through an exercise in https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ and am finding unexpected behavior on my computer when I fetch a dataset. The following code returns
numpy.ndarray
on the author's Google Collab page, but returns
pandas.core.frame.DataFrame
on my local Jupyter notebook. As far as I know, my environment is using the exact same versions of libraries as the author. I can easily convert the data to a numPy array, but since I'm using this book as a guide for novices, I'd like to know what could be causing this discrepancy.
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()
type(mnist['data'])
The author's Google Collab is at the following link, scrolling down to the "MNIST" heading. Thanks!
https://colab.research.google.com/github/ageron/handson-ml2/blob/master/03_classification.ipynb#scrollTo=LjZxzwOs2Q2P.
Just to close off this question, the comment by Ben Reiniger, namely to add as_frame=False, is correct. For example:
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
The OP has already made this change to the Colab code in the link.

A mkl version of mxnet seems not providing ndarray

When to use mxnet-cu101mkl = {version = "==1.5.0",sys_platform = "== 'linux'"}, I get error that I cannot longer import ndarray or nd:
ImportError: cannot import name 'ndarray'
I have no problem with this when using the same code with mxnet-cu101 (no mkl).
Is this just a bug or is this subpackage no longer supported?
I can confirm that mxnet-cu100mkl works fine (version 1.5.0). Very slight CUDA version difference to yours but the package shouldn't change. I think you might be importing a different mxnet here, possibly a folder called mxnet for example. Check the following:
import mxnet as mx
print(mx.__file__)
It should show the path to mxnet within site-packages for you Python environment. e.g.
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/__init__.py

from pyramid.arima import auto_arima not working

I am doing some timeseries forecasting, while at it I am trying to import auto_arima using pyramid but it throws an Module not found error as - ''No module named 'pyramid.arima'
from pyramid.arima import auto_arima
I also tried importing auto_arima from pmdarima :
from pmdarima.arima import auto_arima
but this throws an error as -
"type object 'pmdarima.arima._arima.array' has no attribute 'reduce_cython'"
What am I doing wrong?...
I'm using pmdarima package without any issues, but your error is highly probably related to your numpy version. I would recommend to you to upgrade it (in case you use pip):
pip install --upgrade numpy
You can also try to import numpy package before importing auto_arima (some people experience strange behavior).
You can follow discussion on github issues - https://github.com/tgsmith61591/pmdarima/issues/91 (similar here or here). You're definitely not the first one with that issue.
If it doesn't help, please, paste your pmdarima and numpy versions.

TensorFlow: Opening log data written by SummaryWriter

After following this tutorial on summaries and TensorBoard, I've been able to successfully save and look at data with TensorBoard. Is it possible to open this data with something other than TensorBoard?
By the way, my application is to do off-policy learning. I'm currently saving each state-action-reward tuple using SummaryWriter. I know I could manually store/train on this data, but I thought it'd be nice to use TensorFlow's built in logging features to store/load this data.
As of March 2017, the EventAccumulator tool has been moved from Tensorflow core to the Tensorboard Backend. You can still use it to extract data from Tensorboard log files as follows:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
event_acc = EventAccumulator('/path/to/summary/folder')
event_acc.Reload()
# Show all tags in the log file
print(event_acc.Tags())
# E. g. get wall clock, number of steps and value for a scalar 'Accuracy'
w_times, step_nums, vals = zip(*event_acc.Scalars('Accuracy'))
Easy, the data can actually be exported to a .csv file within TensorBoard under the Events tab, which can e.g. be loaded in a Pandas dataframe in Python. Make sure you check the Data download links box.
For a more automated approach, check out the TensorBoard readme:
If you'd like to export data to visualize elsewhere (e.g. iPython
Notebook), that's possible too. You can directly depend on the
underlying classes that TensorBoard uses for loading data:
python/summary/event_accumulator.py (for loading data from a single
run) or python/summary/event_multiplexer.py (for loading data from
multiple runs, and keeping it organized). These classes load groups of
event files, discard data that was "orphaned" by TensorFlow crashes,
and organize the data by tag.
As another option, there is a script
(tensorboard/scripts/serialize_tensorboard.py) which will load a
logdir just like TensorBoard does, but write all of the data out to
disk as json instead of starting a server. This script is setup to
make "fake TensorBoard backends" for testing, so it is a bit rough
around the edges.
I think the data are encoded protobufs RecordReader format. To get serialized strings out of files you can use py_record_reader or build a graph with TFRecordReader op, and to deserialize those strings to protobuf use Event schema. If you get a working example, please update this q, since we seem to be missing documentation on this.
I did something along these lines for a previous project. As mentioned by others, the main ingredient is tensorflows event accumulator
from tensorflow.python.summary import event_accumulator as ea
acc = ea.EventAccumulator("folder/containing/summaries/")
acc.Reload()
# Print tags of contained entities, use these names to retrieve entities as below
print(acc.Tags())
# E. g. get all values and steps of a scalar called 'l2_loss'
xy_l2_loss = [(s.step, s.value) for s in acc.Scalars('l2_loss')]
# Retrieve images, e. g. first labeled as 'generator'
img = acc.Images('generator/image/0')
with open('img_{}.png'.format(img.step), 'wb') as f:
f.write(img.encoded_image_string)
You can also use the tf.train.summaryiterator: To extract events in a ./logs-Folder where only classic scalars lr, acc, loss, val_acc and val_loss are present you can use this GIST: tensorboard_to_csv.py
Chris Cundy's answer works well when you have less than 10000 data points in your tfevent file. However, when you have a large file with over 10000 data points, Tensorboard will automatically sampling them and only gives you at most 10000 points. It is a quite annoying underlying behavior as it is not well-documented. See https://github.com/tensorflow/tensorboard/blob/master/tensorboard/backend/event_processing/event_accumulator.py#L186.
To get around it and get all data points, a bit hacky way is to:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
class FalseDict(object):
def __getitem__(self,key):
return 0
def __contains__(self, key):
return True
event_acc = EventAccumulator('path/to/your/tfevents',size_guidance=FalseDict())
It looks like for tb version >=2.3 you can streamline the process of converting your tb events to a pandas dataframe using tensorboard.data.experimental.ExperimentFromDev().
It requires you to upload your logs to TensorBoard.dev, though, which is public. There are plans to expand the capability to locally stored logs in the future.
https://www.tensorflow.org/tensorboard/dataframe_api
You can also use the EventFileLoader to iterate through a tensorboard file
from tensorboard.backend.event_processing.event_file_loader import EventFileLoader
for event in EventFileLoader('path/to/events.out.tfevents.xxx').Load():
print(event)
Surprisingly, the python package tb_parse has not been mentioned yet.
From documentation:
Installation:
pip install tensorflow # or tensorflow-cpu pip install -U tbparse # requires Python >= 3.7
Note: If you don't want to install TensorFlow, see Installing without TensorFlow.
We suggest using an additional virtual environment for parsing and plotting the tensorboard events. So no worries if your training code uses Python 3.6 or older versions.
Reading one or more event files with tbparse only requires 5 lines of code:
from tbparse import SummaryReader
log_dir = "<PATH_TO_EVENT_FILE_OR_DIRECTORY>"
reader = SummaryReader(log_dir)
df = reader.scalars
print(df)