Remove downloaded tensorflow and pytorch(Hugging face) models - tensorflow

I would like to remove tensorflow and hugging face models from my laptop.
I did find one link https://github.com/huggingface/transformers/issues/861
but is there not command that can remove them because as mentioned in the link manually deleting can cause problems because we don't know which other files are linked to those models or are expecting some model to be present in that location or simply it may cause some error.

The transformers library will store the downloaded files in your cache. As far as I know, there is no built-in method to remove certain models from the cache. But you can code something by yourself. The files are stored with a cryptical name alongside two additional files that have .json (.h5.json in case of Tensorflow models) and .lock appended to the cryptical name. The json file contains some metadata that can be used to identify the file. The following is an example of such a file:
{"url": "https://cdn.huggingface.co/roberta-base-pytorch_model.bin", "etag": "\"8a60a65d5096de71f572516af7f5a0c4-30\""}
We can now use this information to create a list of your cached files as shown below:
import glob
import json
import re
from collections import OrderedDict
from transformers import TRANSFORMERS_CACHE
metaFiles = glob.glob(TRANSFORMERS_CACHE + '/*.json')
modelRegex = "huggingface\.co\/(.*)(pytorch_model\.bin$|resolve\/main\/tf_model\.h5$)"
cachedModels = {}
cachedTokenizers = {}
for file in metaFiles:
with open(file) as j:
data = json.load(j)
isM = re.search(modelRegex, data['url'])
if isM:
cachedModels[isM.group(1)[:-1]] = file
else:
cachedTokenizers[data['url'].partition('huggingface.co/')[2]] = file
cachedTokenizers = OrderedDict(sorted(cachedTokenizers.items(), key=lambda k: k[0]))
Now all you have to do is to check the keys of cachedModels and cachedTokenizers and decide if you want to keep them or not. In case you want to delete them, just check for the value of the dictionary and delete the file from the cache. Don't forget to also delete the corresponding *.json and *.lock files.

Use
pip install huggingface_hub["cli"]
Then
huggingface-cli delete-cache
You should now see a list of revisions that you can select/deselect.
See this link for details.

pip uninstall tensorflow
pip uninstall tensorflow-gpu
pip uninstall transformers
and find where you have saved gpt-2
model.save_pretrained("./english-gpt2") .???
english-gpt2 = your downloaded model name.
from that path you can manually delete.

Related

How to convert multiple LCI ecospold files to a custom excel format/ how to use parse_file from pyecospold/ how to read ecospold into brightway

I have multiple ecospold (version 1) files with LCI data that I want to convert to a custom excel format. I need all data given in the ecospold file. For my own convinience I want to use python to complete this task.
My research until now has lead me to the following conclusions:
There exist at least two converters (by GLAD and openLCA) to convert ecospold formats (1 and 2) to e.g. the ILCD. But those formats are not helping me to go anywhere, since I need to have all the data accessible in python and in order to then write it into my custom excel format.
To get the data in python, the package pyecospold (https://github.com/sami-m-g/pyecospold) seems to be a suitable choice.
According to the README that can be found at the pyecospold github repository,
ecoSpold = parse_file("data/v1/v1_1.xml") # Replace with your own XML file
should do the job. So I implemented the following lines:
import os
from pyecospold import parse_file, save_file, Defaults
from lxml import etree
cd = os.getcwd()
path_input = cd + r'\inputs\ecospold_test.xml'
# Parse the required XML file to EcoSpold class.
es = parse_file('inputs/ecospold_test.xml')
Now I run into the error:
TypeError: parse_file() missing 2 required positional arguments: 'schema_path' and 'ecospold_lookup'
I understood that a schema in xsd format is needed, therefore I got the schema files from the github and amended my last line of code:
es = parse_file('inputs/ecospold_test.xml', 'inputs/schemas/v1/EcoSpold01Dataset.xsd')
Now there is still one argument missing:
TypeError: parse_file() missing 1 required positional argument: 'ecospold_lookup'
Since I have no experience in parsing xml files in python, I have no idea what to do with this. Additionally, I am confused why the README does not say anything about those additionally needed arguments.
My second idea was to use brightway to get the data into python. But since brightway itself is quite an extensive package, I could not find a simple (or any) way to do this. (Sadly, the notebooks linked in the answer of this question Import Ecoinvent 2.2 Ecospold files into Brightway do not exist anymore)
Another option would of course be to write my own parser. But because I am lacking experience and pyecospold does exactly this (at least in my understanding), I would like to avoid this option.
Additionally, there in openLCA it is possible to read in ecospold files and then export them to an excel format. From this excel format I could of course make my custom excel format. The problem here is that I have no idea how to automize this, because I do not want to read in and export each file individually and manually in openLCA.
If anyone has an idea on how to solve one of my subproblems or a good alternative on how to solve my general problem, I would be very thankful. :)

Bigquery LoadJobConfig Delete Source Files After Transfer

When creating a Bigquery Data Transfer Service Job Manually through the UI, I can select an option to delete source files after transfer. When I try to use the CLI or the Python Client to create on-demand Data Transfer Service Jobs, I do not see an option to delete the source files after transfer. Do you know if there is another way to do so? Right now, my Source URI is gs://<bucket_path>/*, so it's not trivial to delete the files myself.
For me works this snippet (replace YOUR-... with your data):
from google.cloud import bigquery_datatransfer
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "YOUR-CRED-FILE-PATH"
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
destination_project_id = "YOUR-PROJECT-ID"
destination_dataset_id = "YOUR-DATASET-ID"
transfer_config = bigquery_datatransfer.TransferConfig(
destination_dataset_id=destination_dataset_id,
display_name="YOUR-TRANSFER-NAME",
data_source_id="google_cloud_storage",
params={
"data_path_template":"gs://PATH-TO-YOUR-DATA/*.csv",
"destination_table_name_template":"YOUR-TABLE-NAME",
"file_format":"CSV",
"skip_leading_rows":"1",
"delete_source_files": True
},
)
transfer_config = transfer_client.create_transfer_config(
parent=transfer_client.common_project_path(destination_project_id),
transfer_config=transfer_config,
)
print(f"Created transfer config: {transfer_config.name}")
In this example, table YOUR-TABLE-NAME must already exist in BigQuery, otherwise the transfer will crash with error "Not found: Table YOUR-TABLE-NAME".
I used this packages:
google-cloud-bigquery-datatransfer>=3.4.1
google-cloud-bigquery>=2.31.0
Pay attention to the attribute delete_source_files in params. From docs:
Optional param delete_source_files will delete the source files after each successful transfer. (Delete jobs do not retry if the first effort to delete the source files fails.) The default value for the delete_source_files is false.

boto3 load custom models

For example:
session = boto3.Session()
client = session.client('custom-service')
I know that I can create a json with API definitions under ~/.aws/models and botocore will load it from there. The problem is that I need to get it done on the AWS Lambda function, which looks like impossible to do so.
Looking for a way to tell boto3 where are the custom json api definitions so it could load from the defined path.
Thanks
I have only a partial answer. There's a bit of documentation about botocore's loader module, which is what reads the model files. In a disscusion about loading models from ZIP archives, a monkey patch was offered up which extracts the ZIP to a temporary filesystem location and then extends the loader search path to that location. It doesn't seem like you can load model data directly from memory based on the API, but Lambda does give you some scratch space in /tmp.
Here's the important bits:
import boto3
session = boto3.Session()
session._loader.search_paths.extend(["/tmp/boto"])
client = session.client("custom-service")
The directory structure of /tmp/boto needs to follow the resource loader documentation. The main model file needs to be at /tmp/boto/custom-service/yyyy-mm-dd/service-2.json.
The issue also mentions that alternative loaders can be swapped in using Session.register_component so if you wanted to write a scrappy loader which returned a model straight from memory you could try that too. I don't have any info about how to go about doing that.
Just adding more details:
import boto3
import zipfile
import os
s3_client = boto3.client('s3')
s3_client.download_file('your-bucket','model.zip','/tmp/model.zip')
os.chdir('/tmp')
with zipfile.ZipFile('model.zip', 'r') as archive:
archive.extractall()
session = boto3.Session()
session._loader.search_paths.extend(["/tmp/boto"])
client = session.client("custom-service")
model.zip is just a compressed file that contains:
Archive: model.zip
Length Date Time Name
--------- ---------- ----- ----
0 11-04-2020 16:44 boto/
0 11-04-2020 16:44 boto/custom-service/
0 11-04-2020 16:44 boto/custom-service/2018-04-23/
21440 11-04-2020 16:44 boto/custom-service/2018-04-23/service-2.json
Just remember to have the proper lambda role to access S3 and your custom-service.
boto3 also allows setting the AWS_DATA_PATH environment variable which can point to a directory path of your choice.
[boto3 Docs]
Everything zipped with your lambda function is put under /opt/.
Let's assume all your custom models live under a models/ folder. When this folder is mounted to the lambda environment, it'll live under /opt/models/.
Simply specify AWS_DATA_PATH=/opt/models/ in the Lambda configuration and boto3 will pick up models in that directory.
This is better than fetching models from S3 during runtime, unpacking, and then modifying session parameters.

Android App via buildozer: requirements vs. recipes

i try to deploy an android app. I work with the kivy framework and buildozer in python. My issue is to include the pandas library. This is my simple and working test code:
from kivy.app import App
from kivy.uix.label import Label
import kivy
kivy.require('1.11.1')
import pandas as pd
class TestLibraries(App):
def build(self):
df = pd.DataFrame()
df.loc[0, 'text'] = 'this is pandas'
return Label(text = df.loc[0, 'text'])
if __name__ == '__main__':
TestLibraries().run()
The next step is do define the buildozer .spec file. Here i see two options:
Via requirements: So i modify the .spec file like this
# (list) Application requirements
# comma separated e.g. requirements = sqlite3,kivy
requirements = python3,kivy==1.11.1,pandas
This works very well.
2. Via recipe: I take the recipe from github. and put it into my folder called recipe. After that i modify the .spec file like this
# (str) The directory in which python-for-android should look for your own build recipes (if any)
p4a.local_recipes = /PATH_TO_FOLDER/recipe/
In the buildozer logfile i can read:
Listing '/PATH_TO_FOLDER/.buildozer/android/app/recipe/pandas'...
Compiling 'PATH_TO_FOLDER/.buildozer/android/app/recipe/pandas/__init__.py'...
So buildozer found the recipe but the library is not installed and the app dosen't works.
And the question is: why not?
You might ask me for using the second option because the first option works very well. In the next step i want to write a new recipe. So i have to learn how to include an existing recipe correctly.
I hope you unterstand my problem an have some advices.
Thanks Capa

AWS Lambda - dynamically import python module from S3 at runtime

I have some tens of python modules, each has one common method (e.g: run(params)) but with different implementations. I also have an AWS Lambda which will need to call that method from within one of those modules. Choosing which module depending on the input of that lambda.
It seems that I can achieve that by using Layers in Lambda.
However, if I use one single layer for all those modules, then I could see problems with versioning that. If I need to update one module, I'll need to re-deploy that layer, which could bring unexpected changes to other modules.
If I use one layer for each module, then there will be too many layers to manage.
I thought of putting each module into one individual zip file, and put those zip files into an S3 location. My lambda will then dynamically reads the required zip files from S3 and execute.
Is that approach viable?
=====================
My current solution is to have something like this:
def read_python_script_from_zip(bucket: str, key: str, script_name: str) -> str:
s3 = boto3.resource('s3')
raw = s3.Object(bucket, key).get()['Body'].read()
zf = zipfile.ZipFile(io.BytesIO(raw), "r")
scripts = list(filter(lambda f: f.endswith(f"/{script_name}.py"), zf.namelist()))
if len(scripts) == 0:
raise ModuleNotFoundError(f"{script_name} not found.")
if len(scripts) > 1:
raise ModuleNotFoundError(f"{script_name} is ambiguous.")
source = zf.read(scripts[0])
mod = ModuleType(script_name, '')
exec(source, mod.__dict__)
return mod
read_python_script_from_zip(source_bucket, source_key, module_name).run(params)
Looks complicate to me though, would expect an easier way.
You could try packaging each module as a separate distribution package, which would let you version them separately. However, creating a Python distribution package is not as simple as you might hope, especially if you want to publish it to a private repository hosted on S3.