How to handle punctuation and symbol in rasa? - supervised-learning

Rasa version - 1.3.7
pipeline: “supervised_embeddings”
I have trained the bot with no punctuation in intent like.
intent: ask_holiday_in_a_year
How many holidays do we have in a year?
If i ask below question to bot
How many holidays do we have in a year? - ( NLU is able to recognize
it correctly).
How, many ()? Holidays!!,do!##we have$%^ in a %^& year. - (NLU is
able to recognize it correctly.)
How many ###################### holidays do we have in a year? .(NLU
is not able to recognize it correctly.)
How many ####### holidays %%^&*$$% do we have in a year? .(NLU is
not able to recognize it correctly.)
For cases 1 and 2, it worked but for case 3 and 4, it didn’t work? Is there any way(adding some settings in the pipeline)i can handle these symbols and punctuation and give the expected result?

i'm not sure weather this is the correct approach. but, you can try creating random data using a tool like Chatito while including symbols in your train data. again, i'm not sure if this is correct or not

First of all rasa recognize examples based on what you give to it. If you have examples as you 3 and 4 sentences rasa will recognize it. If you think outside of the box there can be multiple issues like this and there is no way rasa can recognize wthat is this and so on. So you want to give examples which are somewhat related to probable questions that can be asked from bot.

This can be handled using a custom nlu component.
In the third parameter of the maketrans function, add the symbols you want to delete.
This custom pipeline will remove all defined keywords and send the filtered text to nlu.
from rasa.nlu.components import Component
import typing
from typing import Any, Optional, Text, Dict
if typing.TYPE_CHECKING:
from rasa.nlu.model import Metadata
class DeleteSymbols(Component):
provides = ["text"]
#requires = []
defaults = {}
language_list = None
def __init__(self, component_config=None):
super(DeleteSymbols, self).__init__(component_config)
def train(self, training_data, cfg, **kwargs):
pass
def process(self, message, **kwargs):
mt = message.text
message.text = mt.translate(mt.maketrans('', '', '$%&(){}^'))
def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
pass
#classmethod
def load(
cls,
meta: Dict[Text, Any],
model_dir: Optional[Text] = None,
model_metadata: Optional["Metadata"] = None,
cached_component: Optional["Component"] = None,
**kwargs: Any
) -> "Component":
"""Load this component from file."""
if cached_component:
return cached_component
else:
return cls(meta)
Add the pipeline in config.yml
# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: en
pipeline:
- name: "Pipelines.TextParsing.TextParsingPipeline"
- name: "WhitespaceTokenizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"
- name: "CountVectorsFeaturizer"
- name: "EmbeddingIntentClassifier"
Source - https://forum.rasa.com/t/how-to-handle-punctuation-and-symbol-in-rasa/19454

Related

Dask process hangs after warning about "full garbage collection took x% cpu time recently (threshold: y%)"

I'm using Dask to process a massive dataset and eventually build a model for a classification task and I'm running into problems. I hope I can get some help.
Main Task
I'm working with clinical notes. Each clinical note has a note type associated with it. However, over 60% of the notes are of type *Missing*. I'm trying to train a classifier on the notes that are labeled and run inference on the notes that have the missing type.
Data
I'm working with 3 years worth of clinical notes. The total data size is ~1.3TB. These were pulled from a database using PySpark (I have no control over this process) and are organized as year/month/partitions.parquet. The root directory is raw_data. The number of partitions within each month varies (e.g, one of the months has 2620 partitions). The total number of partitions is over 50,000.
Machine
Cores: 64
Memory: 1TB
Machine is shared with others so I won't be able to access the entire hardware resources at a given time.
Code
As a first step towards building the model, I want to preprocess the data and do some EDA. I'm using the package Textdescriptives which uses SpaCy to get some basic information about the text.
def replace_empty(text, replace=np.nan):
"""
Replace empty notes with nan's which can be removed later
"""
if pd.isnull(text):
return text
elif text.isspace() or text == '':
return replace
return text
def fix_ws(text):
"""
Replace multiple carriage returns with a single newline
and multiple new lines with a single new line
"""
text = re.sub('\r', '\n', text)
text = re.sub('\n+', '\n', text)
return text
def replace_empty_part(df, **kwargs):
return df.apply(replace_empty)
def fix_ws_part(df, **kwargs):
return df.apply(fix_ws)
def fix_missing_part(df, **kwargs):
return df.apply(lambda t: *Missing* if t == 'Unknown at this time' else t)
def extract_td_metrics(text, spacy_model):
try:
doc = spacy_model(text)
metrics_df = td.extract_df(doc)[cols]
return metrics_df.squeeze()
except:
return pd.Series([np.nan for _ in range(len(cols))], index=cols)
def extract_metrics_part(df, **kwargs):
spacy_model = spacy.load('en_core_web_sm', disable=['tok2vec', 'parser', 'ner', 'attribute_ruler', 'lemmantizer'])
spacy_model.add_pipe('textdescriptives')
return df.apply(extract_td_metrics, spacy_model=spacy_model)
client = Client(n_workers=32)
notes_df = dd.read_parquet(single_month)
notes_df['Text'] = notes_df['Text'].map_partitions(replace_empty_part, meta='string')
notes_df = notes_df.dropna()
notes_df['Text'] = notes_df['Text'].map_partitions(fix_ws_part, meta='string')
notes_df['NoteType'] = notes_df['NoteType'].map_partitions(fix_missing_part, meta='string')
metrics_df = notes_df['Text'].map_partitions(extract_metrics_part)
notes_df = dd.concat([notes_df, metrics_df], axis=1)
notes_df = notes_df.dropna()
notes_df = notes_df.repartition(npartitions=4)
notes_df.to_parquet(processed_notes, schema={'NoteType': pa.string(), 'Text': pa.string(), write_index=False)
All of this code was tested on a small sample with Pandas to make sure it works and on Dask (on the same sample) to make sure the results matched. When I run this code on only a single month worth of data, after running for a few seconds, the process just hangs outputing a stream of warnings of this type:
timestamp - distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
The machine is in a secure enclave so I don't have copy/paste facility so I'm typing out everything here. After some research I came across two threads here and here. While there wasn't a direct solution in either one of them, suggestions included disabling Python garbage collection using gc.disable and starting a clean environment with dask freshly installed. Both of these didn't help me. I'm wondering if I can maybe modify my code so that this problem doesn't happen. There is no way to load all this data in memory and use Pandas directly.
Thanks.

How to load customized NER model from disk with SpaCy?

I have customized NER pipeline with following procedure
doc = nlp("I am going to Vallila. I am going to Sörnäinen.")
for ent in doc.ents:
print(ent.text, ent.label_)
LABEL = 'DISTRICT'
TRAIN_DATA = [
(
'We need to deliver it to Vallila', {
'entities': [(25, 32, 'DISTRICT')]
}),
(
'We need to deliver it to somewhere', {
'entities': []
}),
]
ner = nlp.get_pipe("ner")
ner.add_label(LABEL)
nlp.disable_pipes("tagger")
nlp.disable_pipes("parser")
nlp.disable_pipes("attribute_ruler")
nlp.disable_pipes("lemmatizer")
nlp.disable_pipes("tok2vec")
optimizer = nlp.get_pipe("ner").create_optimizer()
import random
from spacy.training import Example
for i in range(25):
random.shuffle(TRAIN_DATA)
for text, annotation in TRAIN_DATA:
example = Example.from_dict(nlp.make_doc(text), annotation)
nlp.update([example], sgd=optimizer)
I tried to save that customized NER to disk and load it again with following code
ner.to_disk('/home/feru/ner')
import spacy
from spacy.pipeline import EntityRecognizer
nlp = spacy.load("en_core_web_lg", disable=['ner'])
ner = EntityRecognizer(nlp.vocab)
ner.from_disk('/home/feru/ner')
nlp.add_pipe(ner)
I got however following error:
---> 10 ner = EntityRecognizer(nlp.vocab)
11 ner.from_disk('/home/feru/ner')
12 nlp.add_pipe(ner)
~/.local/lib/python3.8/site-packages/spacy/pipeline/ner.pyx in
spacy.pipeline.ner.EntityRecognizer.init()
TypeError: init() takes at least 2 positional arguments (1 given)
This method to save and load custom component from disk seems to be from some erly SpaCy version. What's the second argument EntityRecognizer needs?
The general process you are following of serializing a single component and reloading it is not the recommended way to do this in spaCy. You can do it - it has to be done internally, of course - but you generally want to save and load pipelines using high-level wrappers. In this case this means that you would save like this:
nlp.to_disk("my_model") # NOT ner.to_disk
And then load it with spacy.load("my_model").
You can find more detail about this in the saving and loading docs. Since it seems you're just getting started with spaCy, you might want to go through the course too. It covers the new config-based training in v3, which is much easier than using your own custom training loop like in your code sample.
If you want to mix and match components from different pipelines, you still will generally want to save entire pipelines, and you can then combine components from them using the "sourcing" feature.

How not to get "datum" as the lemma for "data" when using Spacy?

I've run into a quite common word "data" which gets assigned a lemma "datum" from lookups exceptions table spacy uses. I understand that the lemma is technically correct, but in today's english, "data" in its basic form is just "data".
I am using the lemmas to get a sort of keywords from text and if I have a text about data, I can't possibly tag it with "datum".
I was wondering if there is another way to arrive at plain "data" then constructing another "my_exceptions" dictionary used for overriding post-processing.
Thanks for any suggestions.
You could use Lemminflect which works as an add-in pipeline component for SpaCy. It should give you better results.
To use it with SpaCy, just import lemminflect and call the new ._.lemma() function on the Token, ie.. token._.lemma(). Here's an example..
import lemminflect
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('I got the data')
for token in doc:
print('%-6s %-6s %s' % (token.text, token.lemma_, token._.lemma()))
I -PRON- I
got get get
the the the
data datum data
Lemminflect has a prioritized list of lemmas, based on occurrence in corpus data. You can see all lemmas with...
print(lemminflect.getAllLemmas('data'))
{'NOUN': ('data', 'datum')}
It's relatively easy to customize the lemmatizer once you know where to look. The original lemmatizer tables are from the package spacy-lookups-data and are loaded in the model under nlp.vocab.lookups. You can use a local install of spacy-lookups-data to customize the tables for new/blank models, but if you just want to make a few modifications to the entries for an existing model, you can modify the lemmatizer tables on the fly.
Depending on whether your pipeline includes a tagger, the lemmatizer may be referring to rules+exceptions (with a tagger) or to a simple lookup table (without a tagger), both of which include an exception that lemmatizes data to datum by default. If you remove this exception, you should get data as the lemma for data.
For a pipeline that includes a tagger (rule-based lemmatizer)
# tested with spaCy v2.2.4
import spacy
nlp = spacy.load("en_core_web_sm")
# remove exception from rule-based exceptions
lemma_exc = nlp.vocab.lookups.get_table("lemma_exc")
del lemma_exc[nlp.vocab.strings["noun"]]["data"]
assert nlp.vocab.morphology.lemmatizer("data", "NOUN") == ["data"]
# "data" with the POS "NOUN" has the lemma "data"
doc = nlp("data")
doc[0].pos_ = "NOUN" # just to make sure the POS is correct
assert doc[0].lemma_ == "data"
For a pipeline without a tagger (simple lookup lemmatizer)
import spacy
nlp = spacy.blank("en")
# remove exception from lookups
lemma_lookup = nlp.vocab.lookups.get_table("lemma_lookup")
del lemma_lookup[nlp.vocab.strings["data"]]
assert nlp.vocab.morphology.lemmatizer("data", "") == ["data"]
doc = nlp("data")
assert doc[0].lemma_ == "data"
For both: save model for future use with these modifications included in the lemmatizer tables
nlp.to_disk("/path/to/model")
Also be aware that the lemmatizer uses a cache, so make any changes before running your model on any texts or you may run into problems where it returns lemmas from the cache rather than the updated tables.

SpaCy use Lemmatizer as stand-alone component

I want to use SpaCy's lemmatizer as a standalone component (because I have pre-tokenized text, and I don't want to re-concatenate it and run the full pipeline because SpaCy will most likely tokenize differently in some cases).
I found the lemmatizer in the package but I somehow needs to load the dictionaries with the rules to initialize this Lemmatizer.
These files must be somewhere in the model of the English or German model, right? I couldn't find them there.
from spacy.lemmatizer import Lemmatizer
where do the LEMMA_INDEX, etc. files are comming from?
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
I found a similar question here: Spacy lemmatizer issue/consistency
but this one did not entirely answer how to get these dictionary files from the model. The spacy.lang.* parameter seems to no longer exist in newer versions.
Here's an extracted bit of code I had, that used the SpaCy lemmatizer by itself. I'm not somewhere I can run it so it might have a small bug or two if I made an editing mistake.
Note that in general, you need to know the upos for the word in order to lemmatize correctly. This code will return all the possible lemmas but I would advise modifying it to pass in the correct upos for your word.
class SpacyLemmatizer(object):
def __init__(self, smodel):
import spacy
self.lemmatizer = spacy.load(smodel).vocab.morphology.lemmatizer
# get the lemmas for every upos
def getLemmas(self, entry):
possible_lemmas = set()
for upos in ('NOUN', 'VERB', 'ADJ', 'ADV'):
lemmas = self.lemmatizer(entry, upos, morphology=None)
lemma = lemmas[0] # See morphology.pyx::lemmatize
possible_lemmas.add( lemma )
return possible_lemmas

How to get chosen class images from Imagenet?

Background
I have been playing around with Deep Dream and Inceptionism, using the Caffe framework to visualize layers of GoogLeNet, an architecture built for the Imagenet project, a large visual database designed for use in visual object recognition.
You can find Imagenet here: Imagenet 1000 Classes.
To probe into the architecture and generate 'dreams', I am using three notebooks:
https://github.com/google/deepdream/blob/master/dream.ipynb
https://github.com/kylemcdonald/deepdream/blob/master/dream.ipynb
https://github.com/auduno/deepdraw/blob/master/deepdraw.ipynb
The basic idea here is to extract some features from each channel in a specified layer from the model or a 'guide' image.
Then we input an image we wish to modify into the model and extract the features in the same layer specified (for each octave),
enhancing the best matching features, i.e., the largest dot product of the two feature vectors.
So far I've managed to modify input images and control dreams using the following approaches:
(a) applying layers as 'end' objectives for the input image optimization. (see Feature Visualization)
(b) using a second image to guide de optimization objective on the input image.
(c) visualize Googlenet model classes generated from noise.
However, the effect I want to achieve sits in-between these techniques, of which I haven't found any documentation, paper, or code.
Desired result (not part of the question to be answered)
To have one single class or unit belonging to a given 'end' layer (a) guide the optimization objective (b) and have this class visualized (c) on the input image:
An example where class = 'face' and input_image = 'clouds.jpg':
please note: the image above was generated using a model for face recognition, which was not trained on the Imagenet dataset. For demonstration purposes only.
Working code
Approach (a)
from cStringIO import StringIO
import numpy as np
import scipy.ndimage as nd
import PIL.Image
from IPython.display import clear_output, Image, display
from google.protobuf import text_format
import matplotlib as plt
import caffe
model_name = 'GoogLeNet'
model_path = 'models/dream/bvlc_googlenet/' # substitute your path here
net_fn = model_path + 'deploy.prototxt'
param_fn = model_path + 'bvlc_googlenet.caffemodel'
model = caffe.io.caffe_pb2.NetParameter()
text_format.Merge(open(net_fn).read(), model)
model.force_backward = True
open('models/dream/bvlc_googlenet/tmp.prototxt', 'w').write(str(model))
net = caffe.Classifier('models/dream/bvlc_googlenet/tmp.prototxt', param_fn,
mean = np.float32([104.0, 116.0, 122.0]), # ImageNet mean, training set dependent
channel_swap = (2,1,0)) # the reference model has channels in BGR order instead of RGB
def showarray(a, fmt='jpeg'):
a = np.uint8(np.clip(a, 0, 255))
f = StringIO()
PIL.Image.fromarray(a).save(f, fmt)
display(Image(data=f.getvalue()))
# a couple of utility functions for converting to and from Caffe's input image layout
def preprocess(net, img):
return np.float32(np.rollaxis(img, 2)[::-1]) - net.transformer.mean['data']
def deprocess(net, img):
return np.dstack((img + net.transformer.mean['data'])[::-1])
def objective_L2(dst):
dst.diff[:] = dst.data
def make_step(net, step_size=1.5, end='inception_4c/output',
jitter=32, clip=True, objective=objective_L2):
'''Basic gradient ascent step.'''
src = net.blobs['data'] # input image is stored in Net's 'data' blob
dst = net.blobs[end]
ox, oy = np.random.randint(-jitter, jitter+1, 2)
src.data[0] = np.roll(np.roll(src.data[0], ox, -1), oy, -2) # apply jitter shift
net.forward(end=end)
objective(dst) # specify the optimization objective
net.backward(start=end)
g = src.diff[0]
# apply normalized ascent step to the input image
src.data[:] += step_size/np.abs(g).mean() * g
src.data[0] = np.roll(np.roll(src.data[0], -ox, -1), -oy, -2) # unshift image
if clip:
bias = net.transformer.mean['data']
src.data[:] = np.clip(src.data, -bias, 255-bias)
def deepdream(net, base_img, iter_n=20, octave_n=4, octave_scale=1.4,
end='inception_4c/output', clip=True, **step_params):
# prepare base images for all octaves
octaves = [preprocess(net, base_img)]
for i in xrange(octave_n-1):
octaves.append(nd.zoom(octaves[-1], (1, 1.0/octave_scale,1.0/octave_scale), order=1))
src = net.blobs['data']
detail = np.zeros_like(octaves[-1]) # allocate image for network-produced details
for octave, octave_base in enumerate(octaves[::-1]):
h, w = octave_base.shape[-2:]
if octave > 0:
# upscale details from the previous octave
h1, w1 = detail.shape[-2:]
detail = nd.zoom(detail, (1, 1.0*h/h1,1.0*w/w1), order=1)
src.reshape(1,3,h,w) # resize the network's input image size
src.data[0] = octave_base+detail
for i in xrange(iter_n):
make_step(net, end=end, clip=clip, **step_params)
# visualization
vis = deprocess(net, src.data[0])
if not clip: # adjust image contrast if clipping is disabled
vis = vis*(255.0/np.percentile(vis, 99.98))
showarray(vis)
print octave, i, end, vis.shape
clear_output(wait=True)
# extract details produced on the current octave
detail = src.data[0]-octave_base
# returning the resulting image
return deprocess(net, src.data[0])
I run the code above with:
end = 'inception_4c/output'
img = np.float32(PIL.Image.open('clouds.jpg'))
_=deepdream(net, img)
Approach (b)
"""
Use one single image to guide
the optimization process.
This affects the style of generated images
without using a different training set.
"""
def dream_control_by_image(optimization_objective, end):
# this image will shape input img
guide = np.float32(PIL.Image.open(optimization_objective))
showarray(guide)
h, w = guide.shape[:2]
src, dst = net.blobs['data'], net.blobs[end]
src.reshape(1,3,h,w)
src.data[0] = preprocess(net, guide)
net.forward(end=end)
guide_features = dst.data[0].copy()
def objective_guide(dst):
x = dst.data[0].copy()
y = guide_features
ch = x.shape[0]
x = x.reshape(ch,-1)
y = y.reshape(ch,-1)
A = x.T.dot(y) # compute the matrix of dot-products with guide features
dst.diff[0].reshape(ch,-1)[:] = y[:,A.argmax(1)] # select ones that match best
_=deepdream(net, img, end=end, objective=objective_guide)
and I run the code above with:
end = 'inception_4c/output'
# image to be modified
img = np.float32(PIL.Image.open('img/clouds.jpg'))
guide_image = 'img/guide.jpg'
dream_control_by_image(guide_image, end)
Question
Now the failed approach how I tried to access individual classes, hot encoding the matrix of classes and focusing on one (so far to no avail):
def objective_class(dst, class=50):
# according to imagenet classes
#50: 'American alligator, Alligator mississipiensis',
one_hot = np.zeros_like(dst.data)
one_hot.flat[class] = 1.
dst.diff[:] = one_hot.flat[class]
To make this clear: the question is not about the dream code, which is the interesting background and which is already working code, but it is about this last paragraph's question only: Could someone please guide me on how to get images of a chosen class (take class #50: 'American alligator, Alligator mississipiensis') from ImageNet (so that I can use them as input - together with the cloud image - to create a dream image)?
The question is how to get images of the chosen class #50: 'American alligator, Alligator mississipiensis' from ImageNet.
Go to image-net.org.
Go to "Download".
Follow the instructions for "Download Image URLs":
How to download the URLs of a synset from your Brower?
1. Type a query in the Search box and click "Search" button
The alligator is not shown. ImageNet is under maintenance. Only ILSVRC synsets are included in the search results. No problem, we are fine with the similar animal "alligator lizard", since this search is about getting to the right branch of the WordNet treemap. I do not know whether you will get the direct ImageNet images here even if there were no maintenance.
2. Open a synset papge
Scrolling down:
Scrolling down:
Searching for the American alligator, which happens to be a saurian diapsid reptile as well, as a near neighbour:
3. You will find the "Download URLs" button under the left-bottom corner of the image browsing window.
You will get all of the URLs with the chosen class. A text file pops up in the browser:
http://image-net.org/api/text/imagenet.synset.geturls?wnid=n01698640
We see here that it is just about knowing the right WordNet id that needs to be put at the end of the URL.
Manual image download
The text file looks as follows:
http://farm1.static.flickr.com/136/326907154_d975d0c944.jpg
http://weeksbay.org/photo_gallery/reptiles/American20Alligator.jpg
...
till image number 1261.
As an example, the first URL links to:
And the second is a dead link:
The third link is dead, but the fourth is working.
The images of these URLs are publicly available, but many links are dead, and the pictures are of lower resolution.
Automated image download
From the ImageNet guide again:
How to download by HTTP protocol? To download a synset by HTTP
request, you need to obtain the "WordNet ID" (wnid) of a synset first.
When you use the explorer to browse a synset, you can find the WordNet
ID below the image window.(Click Here and search "Synset WordNet ID"
to find out the wnid of "Dog, domestic dog, Canis familiaris" synset).
To learn more about the "WordNet ID", please refer to
Mapping between ImageNet and WordNet
Given the wnid of a synset, the URLs of its images can be obtained at
http://www.image-net.org/api/text/imagenet.synset.geturls?wnid=[wnid]
You can also get the hyponym synsets given wnid, please refer to API
documentation to learn more.
So what is in that API documentation?
There is everything needed to get all of the WordNet IDs (so called "synset IDs") and their words for all synsets, that is, it has any class name and its WordNet ID at hand, for free.
Obtain the words of a synset
Given the wnid of a synset, the words of
the synset can be obtained at
http://www.image-net.org/api/text/wordnet.synset.getwords?wnid=[wnid]
You can also Click Here to
download the mapping between WordNet ID and words for all synsets,
Click Here to download the
mapping between WordNet ID and glosses for all synsets.
If you know the WordNet ids of choice and their class names, you can use the nltk.corpus.wordnet of "nltk" (natural language toolkit), see the WordNet interface.
In our case, we just need the images of class #50: 'American alligator, Alligator mississipiensis', we already know what we need, thus we can leave the nltk.corpus.wordnet aside (see tutorials or Stack Exchange questions for more). We can automate the download of all alligator images by looping through the URLs that are still alive. We could also widen this to the full WordNet with a loop over all WordNet IDs, of course, though this would take far too much time for the whole treemap - and is also not recommended since the images will stop being there if 1000s of people download them daily.
I am afraid I will not take the time to write this Python code that accepts the ImageNet class number "#50" as the argument, though that should be possible as well, using mapping tables from WordNet to ImageNet. Class name and WordNet ID should be enough.
For a single WordNet ID, the code could be as follows:
import urllib.request
import csv
wnid = "n01698640"
url = "http://image-net.org/api/text/imagenet.synset.geturls?wnid=" + str(wnid)
# From https://stackoverflow.com/a/45358832/6064933
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
with open(wnid + ".csv", "wb") as f:
with urllib.request.urlopen(req) as r:
f.write(r.read())
with open(wnid + ".csv", "r") as f:
counter = 1
for line in f.readlines():
print(line.strip("\n"))
failed = []
try:
with urllib.request.urlopen(line) as r2:
with open(f'''{wnid}_{counter:05}.jpg''', "wb") as f2:
f2.write(r2.read())
except:
failed.append(f'''{counter:05}, {line}'''.strip("\n"))
counter += 1
if counter == 10:
break
with open(wnid + "_failed.csv", "w", newline="") as f3:
writer = csv.writer(f3)
writer.writerow(failed)
Result:
If you need the images even behind the dead links and in original quality, and if your project is non-commercial, you can sign in, see "How do I get a copy of the images?" at the Download FAQ.
In the URL above, you see the wnid=n01698640 at the end of the URL which is the WordNet id that is mapped to ImageNet.
Or in the "Images of the Synset" tab, just click on "Wordnet IDs".
To get to:
or right-click -- save as:
You can use the WordNet id to get the original images.
If you are commercial, I would say contact the ImageNet team.
Add-on
Taking up the idea of a comment: If you do not want many images, but just the "one single class image" that represents the class as much as possible, have a look at Visualizing GoogLeNet Classes and try to use this method with the images of ImageNet instead. Which is using the deepdream code as well.
Visualizing GoogLeNet Classes
July 2015
Ever wondered what a deep neural network thinks a Dalmatian should
look like? Well, wonder no more.
Recently Google published a post describing how they managed to use
deep neural networks to generate class visualizations and modify
images through the so called “inceptionism” method. They later
published the code to modify images via the inceptionism method
yourself, however, they didn’t publish code to generate the class
visualizations they show in the same post.
While I never figured out exactly how Google generated their class
visualizations, after butchering the deepdream code and this ipython
notebook from Kyle McDonald, I managed to coach GoogLeNet into drawing
these:
... [with many other example images to follow]