Scrapy Downloading Files Error - error-handling

I'm using the files pipeline in Scrapy to download subtitle files off of http://opensubtitles.org.
I've got a list of all the http://dl.opensubtitles.org links, and my spider follows these links and sends the urls to the pipeline.
It works to start, and I can download the first ~100 files without any issue.
However, around then the links seem to create the error:
2016-06-09 11:44:02 [scrapy] WARNING: File (code: 301): Error downloading file from http://dl.opensubtitles.org/en/download/vrf-108d030f/sub/24617> referred in
Does it have something to do with my code?
These are in my settings:
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = 'C:/Users/Rohan/Documents/Fitroom/subtitles/subFiles'
This is my pipeline:
class SubtitlesPipeline(object):
def process_item(self, item, spider):
return item
Thanks!

This error maybe occurred due to download time out, because file maybe bigger in size. Increase download time out.
Try this in setting.py file
DOWNLOAD_TIMEOUT = 500

Related

Issue in taking screenshot in selenium while using chromedriver 73 for chrome version 73

When i tried to take screenshot of a webpage using selenium in python, i get error message selenium.common.exceptions.TimeoutException: Message: timeout: Timed out receiving message from renderer: 10.000.
Code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
indi_url = 'http://www.google.com'
options = Options()
options.add_argument("disable-infobars")
options.add_argument("--start-maximized")
options.add_argument("--disable-popup-blocking")
options.add_argument("disable-popup-blocking")
options.add_argument("--disable")
driver = webdriver.Chrome(options=options)
driver.get(indi_url)
driver.implicitly_wait(30)
driver.save_screenshot("new.png")
Error message:
I'm using Chrome version 73, chromedriver version 73.
Note: code was working fine (ie.screenshot)in lower version of chrome and chrome driver.
Help me out in fixing this issue for new version of chrome driver.
Thanks in advance
As the error shows, your filename for screenshot does not match the template extensions .png
Here is an example how to make a screenshot.
Java:
File scrFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
FileUtils.copyFile(scrFile, new File(".\\Screenshots\\example_screenshot.png"));
Python:
driver.save_screenshot("screenshot.png")
This error message...
UserWarning: name used for saved screenshot does not match file type. It should end with a .png extension
"type. It should end with a .png extension", UserWarning)
...implies that the Selenium-Python client encountered an issue while invoking get_screenshot_as_file() method.
get_screenshot_as_file()
get_screenshot_as_file() saves a screenshot of the current window to a PNG image file. Returns False if there is any IOError, else returns True. Use full paths in your filename.
Args:
filename: The full path you wish to save your screenshot to. This should end with a .png extension.
Usage:
driver.get_screenshot_as_file('/Screenshots/foo.png')
Defination:
if not filename.lower().endswith('.png'):
warnings.warn("name used for saved screenshot does not match file "
"type. It should end with a `.png` extension", UserWarning)
png = self.get_screenshot_as_png()
try:
with open(filename, 'wb') as f:
f.write(png)
except IOError:
return False
finally:
del png
return True
Analysis
As per the snapshot of the error stack trace:
You have used the command as:
driver.get_screenshot_as_file('new.jpeg')
The issues were:
The filename didn't end with .png
The desired full path of your filename wasn't provided.
Even if you desire to use save_screenshot() this method in-turn invokes get_screenshot_as_file(filename)
Solution
Create a directory within your project as Screenshots and provide the absolute path of the filename you desire for the screenshot while invoking either of the methods as follows:
driver.get_screenshot_as_file("./Screenshots/YakeshrajM.png")
driver.save_screenshot("./Screenshots/YakeshrajM.png")
Update
Currently GAed Chrome v73 have some issues and you may like to downgrade to Chrome v72. You can find a couple of relevant discussions in:
Getting Timed out receiving message from renderer: 600.000 When we execute selenium scripts using Jenkins windows service mode
Timed out receiving message from renderer: 10.000 while capturing screenshot using chromedriver and chrome through Jenkins on Windows

Failed- Path too long error while downloading file in Chrome using selenium

I want to download file in my current working directory using selenium automation. But I am getting 'Path too long' error. The code I have written so far is:
os.chdir(os.path.dirname(__file__))
current_directory = os.getcwd()
windows_cwd = current_directory.replace('\\','\\\\')+'\\\\'
chrome_options = webdriver.ChromeOptions()
prefs = {'download.default_directory': windows_cwd,
'download.directory_upgrade': True,
'safebrowsing.enabled': False,
'safebrowsing.disable_download_protection': True
}
chrome_options.add_experimental_option('prefs',prefs)
browser = webdriver.Chrome(options=chrome_options)
My current working directory is:
C:\Users\US177\PycharmProjects\Plugin
where the path is too long.
But it successfully downloads to
C:\Users\US177\Desktop
failed-long path
I'm not exactly sure what your question is based on the information provided, but I'm guessing it's along the lines of "Why is this happening?", so I will address that question.
The maximum length of a file name in Windows is 260 characters. The file is able to download to your desktop because the name of the file (when appended to your path) does not exceed this limit. When trying to download to PycharmProjects\Plugin\ folder, the path has become too long.
While setting your download path, try using double backslash (ie. path\\to\\directory).
See this Github issue about programatically downloading from chrome

Scrapy - Issue with setting FILES_STORE?

So I have a custom pipeline that extends Scrapy's current FilesPipeline. However, I'm having trouble with setting the FILES_STORE variable. My current file structure is:
my_scraper.py
files/
#this is where I want the files to download to
so, I set FILES_STORE=/files/ and run the spider. But when I do that I get the following error:
PermissionError: [Errno 13] Permission denied: '/files/'
Why does this happen? Is there anything that I am doing wrong?
If it's useful to anyone else, it was simple error - FILES_STORE requires the full path, not just the relative path from the folder.

Can download but file will not unzip as expected

I'm attempting to access the Geometadb database which first involves download of the SQL library. I did that and then I got the Geometadb library.
library(GEOmetadb)
Next I need the Geometadb file which is where things start to go wrong. I issue this command as seen exactly in the tutorial: https://bioconductor.riken.jp/packages/3.0/bioc/vignettes/GEOmetadb/inst/doc/GEOmetadb.html
if(!file.exists('GEOmetadb.sqlite')) getSQLiteFile()
It should proceed to not only download a .gz zip file but also unzip the file. It downloads it but never unzips it. Instead I get the following error.
trying URL 'http://dl.dropbox.com/u/51653511/GEOmetadb.sqlite.gz'
Error in download.file(url_geo, destfile = localfile, mode = "wb") :
cannot open URL 'http://dl.dropbox.com/u/51653511/GEOmetadb.sqlite.gz'
In addition: Warning message:
In url(url_geo_2, open = "rb") :
cannot open: HTTP status was '403 Forbidden'
Just not sure what's going on here. Considering these are just the early tutorial steps I'm probably missing something really obvious but I'm hoping someone can help me out. Thanks!

Program using selenium fails after building with cx_freeze

I'm developing an automatic web tester using Selenium (v2.37.2). Program works properly until I run the test built with cxfreeze (there is also tkinter gui).
there is the init function
def initDriver(self):
if self.browser == FIREFOX:
profile = webdriver.FirefoxProfile(profile_directory=self.profile);
self._driver = webdriver.Firefox(firefox_profile=profile)
elif self.browser == CHROME:
self._driver = webdriver.Chrome(self.executable, chrome_options=profile)
elif self.browser == IEXPLORER:
self._driver = webdriver.Ie(self.executable)
Now when I build it using Cx_freeze I get this error
method redirectToBlank(...) calls initDriver(..) as the first thingSo how I pack the .xpi file to the library.zip file - which option in setup.py I have to use? And do I even have to this?
And the second strange thing is, that the other browsers work fine, when I execute the .exe file in by clicking on its icon, but when I run it from command line, I get errors even for chrome and IE. (Sorry that the traceback isn't complete)
All paths are relative from the executed file (no matter from where you run it),
Thank you for any ideas to solve this problem.
(method redirectToBlank(...) calls initDriver(..) as the first thing)
First issue solved
It's problem with selenium - FirefoxProfile - class, which tries to load webdriver.xpi as a normal file, but selenium pack all libraries to a zip file, so selenium can't find it.
Even forcing cx_freeze in setup file to add webdriver.xpi to a proper directory in zip won't help.
It is necessary to edit FirefoxProfile (in firefox_profile module) class for example like this
def _install_extension(self, addon, unpack=True):
"""
Installs addon from a filepath, url
or directory of addons in the profile.
- path: url, path to .xpi, or directory of addons
- unpack: whether to unpack unless specified otherwise in the install.rdf
"""
if addon == WEBDRIVER_EXT:
# altered lines
import sdi.env
WEBDRIVER_SUBSTITUTE = "path/to/unpacked/webdrive.xpi"
addon = os.path.join(os.path.dirname(__file__), WEBDRIVER_SUBSTITUTE)
# Original lines:
# addon = os.path.join(os.path.dirname(__file__), WEBDRIVER_EXT)
< the rest of the method >
Issue 2
OSError: win error 6: the handle is invalid problem wasn't caused by either cxfreeze or selenium. I run the final exe file from git bash. There's the problem. For some reason git bash doesn't open stdin for the program and that's why it fails. When I run it in standard windows command line, everything is ok or if i run it from git bash like program.exe < empty_file
what i did was remove selenium form packages list.
and put it inside includefiles, then it works.
like this :
includefiles = [(seleniumPackage,'')]
...
options = {'build_exe': {'includes':includes,
'excludes':excludes,
'optimize':2,
'packages':packages,
'include_files':includefiles,
...