Downloading a Tensorflow Dataset (MOVi) from GCS - tensorflow

I want to download the MOVi datasets that are implemented by Kubric (https://github.com/google-research/kubric/tree/main/challenges/movi) to disk.
According to the documentation of Kubric, we can access the data simply via
ds, info = tfds.load("movi_b", data_dir="gs://kubric-public/tfds", with_info=True)
This works, but it appears that the data is being streamed from GCS, instead of being downloaded. It seems from the tfds documentation that it should be very straight-forward to download the dataset, but that's not the case. I tried to follow the documentation of tfds.load (https://www.tensorflow.org/datasets/api_docs/python/tfds/load):
I tried passing download=True
I tried passing download=True, together with download_and_prepare_kwargs={"download_dir": "data"}
I tried try_gcs=False
I tried setting the environment variable TFDS_DATA_DIR
I tried calling ds["train"].save("data"). This actually downloads the data, but it will quickly fill up 32GB of memory, then crash. I tried looking into sharding, but that doesn't seem to solve the problem and the documentation doesn't help much.
I am using tensorflow_datasets==4.7.0.
Neither the tensorflow_datasets, nor the Kubric repos seem to actively review issues, so I'm out of ideas. How can I actually download this dataset to disk?

"movi_b" isn't a dataset provided by tensorflow_datasets.
They're assuming that you've imported the movi_b.py file in that directory
Importing the file registers the dataset.
So in a notebook I ran:
# get the movi_b.py file
!curl -O https://raw.githubusercontent.com/google-research/kubric/main/challenges/movi/movi_b.py
#install pypng
!pip install pypng
# register the datasaet
import movi_b
# data_dir is where to download to.
ds, info = tfds.load("movi_b", data_dir="./dataset/downloads", with_info=True)
That should work. But the code in the file is broken.
I can't find a version of tfds that can run that code. In the newest tfds some tfds.core.ReadWritePath annotations fail. If you delete those it fails because gs://research-brain-kubric-xgcp/jobs/movi_b_regen_10k.
Can you report this bug on the repo?
So it you want a local copy, tfds will not work.
But the gs://kubric-public/tfds bucket is public access.
You can list files:
gsutil ls gs://kubric-public/tfds
gs://kubric-public/tfds/
gs://kubric-public/tfds/kubric_frames/
gs://kubric-public/tfds/movi_a/
gs://kubric-public/tfds/movi_b/
gs://kubric-public/tfds/movi_c/
gs://kubric-public/tfds/movi_d/
gs://kubric-public/tfds/movi_e/
gs://kubric-public/tfds/movi_f/
gs://kubric-public/tfds/msn_easy_frames/
gs://kubric-public/tfds/multi_shapenet_frames/
gs://kubric-public/tfds/nerf_synthetic_frames/
gs://kubric-public/tfds/nerf_synthetic_scenes/
gs://kubric-public/tfds/shapenet_pretraining/
And you can download them with:
mkdir tfds/
gsutil -m cp -r gs://kubric-public/tfds/movi_b tfds/movi_b
Note that it's 260GB.
Once that's done I think this should work:
ds, info = tfds.load("movi_b", data_dir="tfds/", with_info=True)
What a mess!

Related

!cp Executes only once when copying a file from Google Colab to Google Drive

I am running a long process that will take hours and hours to finish; therefore I want to dynamically save the results I receive from an API to my Google Drive. !cp command works, but only once. It copies the file to my drive, but refuses to overwrite or update it later on.
I tried
Changing the file from private to public.
Deleting the file after !cp has been executed once to see if it will create a new one.
Played around with dynamic file names, as file_name = f"FinishedCheckpoint_{index}" however this did not work as well. After creating a file with 0th index, it just stops any further updates. However the files are still generated under the colab notebooks directory, but they are not uploaded to google drive, which is essential to not lose progress.
Code cell below, any ideas?
from google.colab import drive
drive.mount('/content/gdrive')
answers = []
for index, row in df.iterrows():
answer = prompt_to_an_api(...)
answers.append(answer)
pd.DataFrame(answers).to_csv('FinishedCheckpoint.csv')
!cp FinishedCheckpoint.csv "gdrive/My Drive/Colab Notebooks/Runtime Results"
pd.DataFrame(answers).to_csv('Finished.csv')
!cp Finished.csv "gdrive/My Drive/Colab Notebooks/Runtime Results"
drive.flush_and_unmount()

Setting environment variables before the execution of the pyiron wrapper on remote cluster

I use a jobfile for SLURM in ~/pyiron/resources/queues/, which looks roughly like this:
#!/bin/bash
#SBATCH --output=time.out
#SBATCH --job-name={{job_name}}
#SBATCH --workdir={{working_directory}}
#SBATCH --get-user-env=L
#SBATCH --partition=cpu
module load some_python_module
export PYTHONPATH=path/to/lib:$PYTHONPATH
echo {{command}}
As you can see, I need to load a module to access the correct python version before calling "python -m pyiron.base.job.wrappercmd ..." and I also want to set the PYTHONPATH variable.
Setting the environment directly in the SLURM jobfile is of course working, but it seems very inconvenient, because I need a new jobfile under ~/pyiron/resources/queues/ whenever I want to run a calculation with a slightly different environment. Ideally, I would like to be able to adjust the environment directly in the Jupyter notebook. Something like an {{environment}} block in the above jobile, which can be configured via Jupyter, seems to a nice solution.
As far as I can tell, this is impossible with the current version of pyiron and pysqa. Is there a similar solution available?
As an alternative, I could also imagine to store the above jobfile close to the Jupyter notebook. This would also ease the reproducibility for my colleagues. Is there an option to define a specific file to be used as a jinja2-template for the jobile?
I could achieve my intended setup by writing a temporary jobfile under ~/pyiron/resources/queues/ via Jupyter before running the pyiron job, but this feels like quite a hacky solution.
Thank you very much,
Florian
To explain the example in a bit more detail:
I create a notebook named: reading.ipynb with the following content:
import subprocess
subprocess.check_output("echo ${My_SPECIAL_VAR}", shell=True)
This reads the environment variable My_SPECIAL_VAR.
I can now submit this job using a second jupyter notebook:
import os
os.environ["My_SPECIAL_VAR"] = "SoSpecial"
from pyiron import Project
pr = Project("envjob")
job = pr.create_job(pr.job_type.ScriptJob, "script")
job.script_path = "readenv.ipynb"
job.server.queue = "cm"
job.run()
In this case I first set the environment variable and then submit a script job, the script job is able to read the corresponding environment variable as it is forwarded using the --get-user-env=L option. So you should be able to define the environment in the jupyter notebook which you use to submit the calculation.

How to load images in Google Colab notebook using Tensorflow from mounted Google drive

In a Google Colab notebook, I have my Google drive mounted and can see my files.
I'm trying to load a zipped directory that has two folders with several picture files in each.
I followed an example from the Tensorflow site that has an example on how to load pictures, but it's using a remote location.
Here's the site - https://www.tensorflow.org/tutorials/load_data/images
Here's the code from the example that works:
data_root_orig = tf.keras.utils.get_file(origin='https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
fname='flower_photos', untar=True)
data_root = pathlib.Path(data_root_orig)
print(data_root)
Here's the revised code where I tried to reference the zipped directory from the mounted Google drive:
data_root_orig = tf.keras.utils.get_file(origin='/content/gdrive/My Drive/TrainingPictures/',
fname='TrainingPictures_Car', untar=True)
data_root = pathlib.Path(data_root_orig)
print(data_root)
I get this error:
ValueError: unknown url type: '/content/gdrive/My Drive/TrainingPictures/'
It's obviously expecting a URL instead of the path as I've provided.
I would like to know how I can load the zipped directory as provided from the Google drive.
In this case, no need to use tf.keras.utils.get_file(), Only Path is enough.
Here 2 ways to do that
First: !unzip -q 'content/gdrive/My Drive/TrainingPictures/TrainingPictures_Car.zip'
it will be unzipped on '/content/'
import pathlib
data = pathlib.Path('/content/folders_inside_zip')
count = len(list(data.glob('*/*.jpg')))
count
Second:
if archive already unzipped in google drive:
import pathlib
data = pathlib.Path('/content/gdrive/My Drive/TrainingPictures/')
count = len(list(data.glob('*.jpg')))
count
In my case it actually worked by removing all imports and libraries and just setting the path as a string. The file has to be uploaded into the google colab.
content_path = "cat.jpg"
For me it worked with file:///content/(filename)

Blender Command line importing files

I will run the script on the Blender command line. All I want to do is run the same script for several files. I have completed the steps to run a background file (.blend) and run a script in Blender, but since I have just loaded one file, I can not run the script on another file.
I looked up the Blender manual, but I could not find the command to import the file.
I proceeded to creating a .blend file and running the script.
blender -b background.blend -P pythonfile.py
In addition, if possible, I would appreciate it if you could tell me how to script the camera and track axes to track to contraint (Ctrl + T -> Track to constraint).
really thank you for reading my ask.
Blender can only have one blend file open at a time, any open scripts are cleared out when a new file is opened. What you want is a loop that starts blender for each blend file using the same script file.
On *nix systems you can use a simple shell script
#!/bin/sh
for BF in $(ls *.blend)
do
blender -b ${BF} -P pythonfile.py
done
A more cross platform solution is to use python -
from glob import glob
from subprocess import call
for blendFile in glob('*.blend'):
call([ 'blender',
'-b', blendFile,
'--python', 'pythonfile.py' ])
To add a Track-to constraint to Camera pointing it at Cube -
camera = bpy.data.objects['Camera']
c = camera.constraints.new('TRACK_TO')
c.target = bpy.data.objects['Cube']
c.track_axis = 'TRACK_NEGATIVE_Z'
c.up_axis = 'UP_Y'
This is taken from my answer here which also animates the camera going around the object.
bpy.context.view_layer.objects.active = CameraObject
bpy.ops.object.constraint_add(type='TRACK_TO')
CameraObject.constraints["Track To"].target = bpy.data.objects['ObjectToTrack']

Persistent Python Command-Line History

I'd like to be able to "up-arrow" to commands that I input in a previous Python interpreter. I have found the readline module which offers functions like: read_history_file, write_history_file, and set_startup_hook. I'm not quite savvy enough to put this into practice though, so could someone please help? My thoughts on the solution are:
(1) Modify .login PYTHONSTARTUP to run a python script.
(2) In that python script file do something like:
def command_history_hook():
import readline
readline.read_history_file('.python_history')
command_history_hook()
(3) Whenever the interpreter exits, write the history to the file. I guess the best way to do this is to define a function in your startup script and exit using that function:
def ex():
import readline
readline.write_history_file('.python_history')
exit()
It's very annoying to have to exit using parentheses, though: ex(). Is there some python sugar that would allow ex (without the parens) to run the ex function?
Is there a better way to cause the history file to write each time? Thanks in advance for all solutions/suggestions.
Also, there are two architectural choices as I can see. One choice is to have a unified command history. The benefit is simplicity (the alternative that follows litters your home directory with a lot of files.) The disadvantage is that interpreters you run in separate terminals will be populated with each other's command histories, and they will overwrite one another's histories. (this is okay for me since I'm usually interested in closing an interpreter and reopening one immediately to reload modules, and in that case that interpreter's commands will have been written to the file.) One possible solution to maintain separate history files per terminal is to write an environment variable for each new terminal you create:
def random_key()
''.join([choice(string.uppercase + string.digits) for i in range(16)])
def command_history_hook():
import readline
key = get_env_variable('command_history_key')
if key:
readline.read_history_file('.python_history_{0}'.format(key))
else:
set_env_variable('command_history_key', random_key())
def ex():
import readline
key = get_env_variable('command_history_key')
if not key:
set_env_variable('command_history_key', random_key())
readline.write_history_file('.python_history_{0}'.format(key))
exit()
By decreasing the random key length from 16 to say 1 you could decrease the number of files littering your directories to 36 at the expense of possible (2.8% chance) of overlap.
I think the suggestions in the Python documentation pretty much cover what you want. Look at the example pystartup file toward the end of section 13.3:
http://docs.python.org/tutorial/interactive.html
or see this page:
http://rc98.net/pystartup
But, for an out of the box interactive shell that provides all this and more, take a look at using IPython:
http://ipython.scipy.org/moin/
Try using IPython as a python shell. It already has everything you ask for. They have packages for most popular distros, so install should be very easy.
Persistent history has been supported out of the box since Python 3.4. See this bug report.
Use PIP to install the pyreadline package:
pip install pyreadline
If all you want is to use interactive history substitution without all the file stuff, all you need to do is import readline:
import readline
And then you can use the up/down keys to navigate past commands. Same for 2 or 3.
This wasn't clear to me from the docs, but maybe I missed it.