Scrapy - change settings of spider during run - scrapy

Is it possible to change settings of spider during the spider run? I tried to chenge it but I recieved error e.g.:
In [4]: settings.set('SPIDER_MODULES', ['a'])
TypeError: Trying to modify an immutable Settings object
In [5]: settings.update('SPIDER_MODULES', ['a'])
TypeError: Trying to modify an immutable Settings object

Related

How can I save a telegram audio file directly to S3 from Telegram?

I am trying to save a user-sent Telegram voice message directly to S3. This happens inside AWS Lambda so saving to disk and using s3.upload_file(filename,...) will not work. This fails:
def audio_handler(update, context):
message = update.effective_message
file = message.voice.get_file()
s3 = boto3.client('s3')
s3.upload_file(file, Bucket='mybucket', Key='onelove.ogg')
ValueError: Filename must be a string
If I attempt to use
s3.upload_fileobj(BytesIO(file).getbuffer(), Bucket='mybucket', Key='onelove.ogg')
TypeError: a bytes-like object is required, not 'File'
Voice.get_file returns an object of type File. To download the voice to memory, you can e.g. pass an empty BytesIO object to the out argument of File.download. Please also have a look at the wiki section on working with files and media.
Disclaimer: I'm currently the maintainer of python-telegram-bot.

How to use resource id in Appium

I am testing a mobile app which doesn't have right set locators in it, so i can use only "resource-id"
from appium.webdriver.common.appiumby import AppiumBy
# Locators
profile_btn = (AppiumBy.ID, 'io.dzain.dzain.uat:id/navItemIV')
profile_btn.click()
When I run this code following error message is displayed
AttributeError: 'tuple' object has no attribute 'click'
How can i use the resource-id to handle this problem?
You call clicks on the appium/webdriver elements returned from the driver, something like so:
profile_btn = self.driver.find_element(AppiumBy.ID, 'io.dzain.dzain.uat:id/navItemIV')
profile_btn.click()
You feed your by method and your element locator definition into the find_element function and call clicks on the element object it returns to you.

Missing results in export when running scrapy spider with multiple start_urls

I am running a scrapy spider to export some football data and using the scrapy splash plugin.
For development I am running the spider against cached results, so as not to hit the website too much. The strange thing is, that I am consistently missing some items in the export when running the spider with multiple start_urls. The number of missing Items differ slightly each time.
However, when I comment out all but one start_urls and run the spider for each one separately, I get all the results. I am not sure if this is a bug with scrapy or if I am missing something about scrapy here, as this is my first project with the framework.
Here is my caching configuration:
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = [403]
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
These are my start_urls:
start_urls = [
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2017',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2018',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2019',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2020',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2017',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2018',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2019',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2020'
]
I have a standard setup with an export pipeline and my spider yielding splash requests multiple times for each relevant url on the page from a parse method. Each parse method fills the same item passed via cb_kwargs or creates a new one with data from the passed item.
Please let me know if further code from my project, like the spider, pipelines or item loaders might be relevant to the issue and I will edit my question here.

Pytest fixture finalizer TypeError 'NoneType' object is not callable

I have a simple pytest fixture to ensure that the test data file is present (and deleted at the end of the test), but if gives me the error described in the title.
#pytest.fixture
def ensure_test_data_file(request):
data_file = server.DATA_FILE_NAME
with open(data_file, 'w') as text_file:
text_file.write(json.dumps(TEST_DATA))
text_file.close()
print(os.path.abspath(data_file))
request.addfinalizer(os.remove(data_file))
If I remove the finalizer, it works (except that the file is not deleted). Am I doing something wrong?
You need to pass a function object to request.addfinalizer - what you're doing is actually calling os.remove(data_file), which returns None, and thus you're doing request.addfinalizer(None).
Here you'd use request.addfinalizer(lambda: os.remove(data_file)) or request.addfinalizer(functools.partial(os.remove, data_file)) to get a callable with the argument already "applied", but which isn't actually called.
However, I'd recommend using yield in the fixture instead (docs), which makes this much cleaner by letting you "pause" your fixture and run the test in between:
#pytest.fixture
def ensure_test_data_file(request):
data_file = server.DATA_FILE_NAME
with open(data_file, 'w') as text_file:
text_file.write(json.dumps(TEST_DATA))
text_file.close()
print(os.path.abspath(data_file))
yield
os.remove(data_file)

Nonetype Error on Python Google Search Script - Is this a spam prevention tactic?

Fairly new to Python so apologies if this is a simple ask. I have browsed other answered questions but can't seem to get it functioning consistently.
I found the below script which prints the top result from google for a set of defined terms. It will work the first few times that I run it but will display the following error when I have searched 20 or so terms:
Traceback (most recent call last):
File "term2url.py", line 28, in <module>
results = json['responseData']['results']
TypeError: 'NoneType' object has no attribute '__getitem__'
From what I can gather, this indicates that one of the attributes does not have a defined value (potentially a result of google blocking me?). I attempted to solve the issue by adding in the else clause though I still run into the same problem.
Any help would be greatly appreciated; I have pasted the full code below.
Thanks!
#
# This is a quick and dirty script to pull the most likely url and description
# for a list of terms. Here's how you use it:
#
# python term2url.py < {a txt file with a list of terms} > {a tab delimited file of results}
#
# You'll must install the simpljson module to use it
#
import urllib
import urllib2
import simplejson
import sys
# Read the terms we want to convert into URL from info redirected from the command line
terms = sys.stdin.readlines()
for term in terms:
# Define the query to pass to Google Search API
query = urllib.urlencode({'q' : term.rstrip("\n")})
url = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s" % (query)
# Fetch the results and convert to JSON format
search_results = urllib2.urlopen(url)
json = simplejson.loads(search_results.read())
# Process the results by pulling the first record, which has the best match
results = json['responseData']['results']
for r in results[:1]:
if results is not None:
url = r['url']
desc = r['content'].encode('ascii', 'replace')
else:
url = "none"
desc = "none"
# Print the results to stdout. Use redirect to capture the output
print "%s\t%s" % (term.rstrip("\n"), url)
import time
time.sleep(1)
Here are some Python details for you first:
None is a valid object in Python, of the type NoneType:
print(type(None))
Produces:
< class 'NoneType' >
And the no attribute error you got is normal when you try to access some method or attribute of an object that doesn't have that attribute. In this case, you were attempting to use the __getitem__ syntax (object[item_index]), which NoneType objects don't support because it doesn't have the __getitem__ method.
The point of the previous explanation is that your assumption about what your error means is correct: your results object is essentially empty.
As for why you're hitting this in the first place, I believe you are running up against Google's API limits. It looks like you're using the old API that is now deprecated. The number of search results (not queries) used to be limited to around 64 per query, and there used to be no rate or per-day limit. However, since it's been deprecated for over 5 years now, there may be new undocumented limits.
I don't think it necessarily has anything to do with SPAM, but I do believe it is an undocumented limit.