How to handle image filename duplication in scrapy image download - scrapy

Scrapy uses sha1 to generate random image filename. When duplication occurs, it will overwrite the file, causing loss of an existing image file.
Is it possible to write extra code (e.g: an overriding class) to handle duplication. For instance: keep generating new random filename until duplication is not found?
If yes, kindly provide code example?
--- old question:
Does it check to ensure filename uniqueness for all image files under images_store folder ?
Scrapy uses sha1 to generate random filename while downloading images. Sha1 provides good level of uniqueness but by probability, there is chance for duplication.

Not sure this is the best solution, but what if you make your custom pipeline based on ImagesPipeline pipeline and override image_key method like this (though, haven't tested it):
import hashlib
import os
import random
import string
from scrapy.contrib.pipeline.images import ImagesPipeline
class CustomImagesPipeline(ImagesPipeline):
def image_key(self, url):
image_guid = hashlib.sha1(url).hexdigest()
# check if image already exists and add some random char to the file name
path_format = 'full/%s.jpg'
while True:
path = path_format % image_guid
if os.path.exists(path):
image_guid = image_guid + random.choice(string.letters)
else:
break
return path
This is just an example - you may want to improve that filename change logic. Additionally, you should do the same for thumb_key method.
Hope that helps.

You shouldn't care about it!
Scrapy uses the image url sha1. And to have a probability of 50% of finding a SHA1 collision you need about 2^80 items. So, unless you are going to crawl 2^80 images, the chances of image filename duplication is less than 50%. In fact you can crawl much more than 1 trillion images and simple ignore filename duplication because the chances are insignificant.

Related

Is there file size limit for the imported .dat file?

I meet a problem for the large stacked images of several GB. Actually, I can directly open a stacked image (dm4 format) of 9GB (1000x1000x1000), but if I want to rotate it using volumn operation such as "rotate about x", the GMS or DM exits automatically. I write a simple script code to complete the operation with the slice3 function and display the result correctly, but I cannot save it! If I try to save the resulted stacked image, software says "sorry" and forces me to close it.
OK, I think this file is too large to the software's capability. So I save the original data file to .dat formate and write a fortran code to rotate it, then save the result as a .dat file. When I use the import function of GMS or DM, it only imports first several hundred frames, not all frames.
How to deal with it?
There certainly are size restrictions in both total size and maximum length along one dimension, but I don't think 1000 x 1000 x 1000 should be a limiting factor.
I just ran the following two scripts in order and saved the data on my GMS 3.4.3 without problem.
image big := RealImage("Big First",4,1000,1000,1000)
big = icol*sin(irow/iheight*100*pi())*10000+iplane
big.showimage()
image bigIn := A
image bigOut := bigIn.Slice3(0,0,0, 1,1000,1,0,1000,1,2,1000,1)
bigOut.ShowImage()
Can you edit you question to include the script code you're failing to run and other useful information?

DEAP evolutionary module, always evaluate entire population

I'm trying to solve a non-deterministic problem with DEAP. the problem is that the module only evaluate new chromosome and uses the score kept in memory for old ones.
How can i set the module, so at each generation the ENTIRE population will be evaluated, and not just the new ones?
Thx
I don't think DEAP package can do that.
You can simply implement the algorithm on your own or find the new packages.
Anyway, check out my library contains most of the state-of-the-art meta-heuristic algorithms. It also evaluates the entire population in each generation.
https://github.com/thieunguyen5991/mealpy
You can modify 2-3 lines in your current chosen algorithm like below to force evaluation on all items. This can be done via copying from the source, to your local script, and then editing the invalid_individual flagged item check pre evaluation. Make sure in main you call local easimple and not algorithms.easimple to make the switch to local code.
If you are using easimple or eaMuPlusLambda, for example, you can find that function here in this file:
https://github.com/DEAP/deap/blob/master/deap/algorithms.py#L85
The 0th gen case here may not change(but can change anyway, unless your individuals come with a fitness already and you want to skip evaluation):
#(line 149 in above URL)
invalid_ind = [ind for ind in population if not ind.fitness.valid]
And then inside the generational process loop:
#(line 171 in url above):
invalid_ind = [ind for ind in offspring if not ind.fitness.valid]
Removing the invalid check will result in all items being passed to evaluation:
invalid_ind = [ind for ind in population] #149
...
invalid_ind = [ind for ind in offspring] #171
But keep the algorithms import! note you also need to change varAnd(case of easimple) to algorithms.varAnd to prevent a break.
offspring = algorithms.varAnd(offspring, toolbox, cxpb, mutpb)

How to create a lazy-evaluated range from a file?

The File I/O API in Phobos is relatively easy to use, but right now I feel like it's not very well integrated with D's range interface.
I could create a range delimiting the full contents by reading the entire file into an array:
import std.file;
auto mydata = cast(ubyte[]) read("filename");
processData(mydata); // takes a range of ubytes
But this eager evaluation of the data might be undesired if I only want to retrieve a file's header, for example. The upTo parameter doesn't solve this issue if the file's format assumes a variable-length header or any other element we wish to retrieve. It could even be in the middle of the file, and read forces me to read all of the file up to that point.
But indeed, there are alternatives. readf, readln, byLine and most particularly byChunk let me retrieve pieces of data until I reach the end of the file, or just when I want to stop reading the file.
import std.stdio;
File file("filename");
auto chunkRange = file.byChunk(1000); // a range of ubyte[]s
processData(chunkRange); // oops! not expecting chunks!
But now I have introduced the complexity of dealing with fixed size chunks of data, rather than a continuous range of bytes.
So how can I create a simple input range of bytes from a file that is lazy evaluated, either by characters or by small chunks (to reduce the number of reads)? Can the range in the second example be seamlessly encapsulated in a way that the data can be processed like in the first example?
You can use std.algorithm.joiner:
auto r = File("test.txt").byChunk(4096).joiner();
Note that byChunk reuses the same buffer for each chunk, so you may need to add .map!(chunk => chunk.idup) to lazily copy the chunks to the heap.

How to find large objects in ZODB

I'm trying to analyze my ZODB because it grew really large (it's also large after packing).
The package zodbbrowser has a feature that displays the amount of bytes of an object. It does so by getting the length of the pickled state (name of the variable), but it also does a bit of magic which I don't fully understand.
How would I go to find the largest objects in my ZODB?
I've written a method which should do exactly this. Feel free to use it, but be aware that this is very memory consuming. The package zodbbrowser must be installed.
def zodb_objects_by_size(self):
"""
Recurse over the ZODB tree starting from self.aq_parent. For
each object, use zodbbrowser's implementation to get the raw
object state. Put each length into a Counter object and
return a list of the biggest objects, specified by path and
size.
"""
from zodbbrowser.history import ZodbObjectHistory
from collections import Counter
def recurse(obj, results):
# Retrieve state pickle from ZODB, get length
history = ZodbObjectHistory(obj)
pstate = history.loadStatePickle()
length = len(pstate)
# Add length to Counter instance
path = '/'.join(obj.getPhysicalPath())
results[path] = length
# Recursion
for child in obj.contentValues():
# Play around portal tools and other weird objects which
# seem to contain themselves
if child.contentValues() == obj.contentValues():
continue
# Rolling in the deep
try:
recurse(child, results)
except (RuntimeError, AttributeError), err:
import pdb; pdb.set_trace() ## go debug
results = Counter()
recurse(self.aq_parent, results)
return results.most_common()

Can mrjob tasks output sets?

I tried outputting a python set from a mapper in mrjob. I changed the function signatures of my combiners and reducers accordingly.
However, I get this error:
Counters From Step 1
Unencodable output:
TypeError: 172804
When I change the sets to lists, this error disappears. Are there certain python types that cannot be outputted by mappers in mrjob?
Values are moved between stages of the MapReduce using Protocols, generally Raw, JSON or Pickle.
You must make sure that the values being moved around can be properly handled by the Protocol you pick. I would imagine that there's no default JSON representation of a set, and perhaps there's no raw representation either?
Try setting the INTERNAL_PROTOCOL to Pickle, as so:
class yourMR(MRJob):
INTERNAL_PROTOCOL = PickleProtocol
def map(self, key, value):
# mapper
def reduce(self, key, value):
# reducer
Note: MRJob will handle pickling and unpickling for you, so don't worry about that aspect. You can also set the INPUT and OUTPUT protocols if necessary (for multiple stages, or set output from the reducer).