Enabling HttpProxyMiddleware in scrapyd

Enabling HttpProxyMiddleware in scrapyd - scrapy

After reading the scrapy documentation, I thought that the HttpProxyMiddleware is enabled by default. But when I start a spider via scrapyd's webservice interface, HttpProxyMiddleware is not enabled. I receive the following output:
2013-02-18 23:51:01+1300 [scrapy] INFO: Scrapy 0.17.0-120-gf293d08 started (bot: pde)
2013-02-18 23:51:02+1300 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, CloseSpider, WebService, CoreStats, SpiderState
2013-02-18 23:51:02+1300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-02-18 23:51:02+1300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-02-18 23:51:02+1300 [scrapy] DEBUG: Enabled item pipelines: PdePipeline
2013-02-18 23:51:02+1300 [shotgunsupplements] INFO: Spider opened
Note that HttpProxyMiddleware is not enabled. How can I enable it for scrapyd? Any help will be greatly appreciated.
My scrapy.cfg
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# http://doc.scrapy.org/topics/scrapyd.html
[settings]
default = pd.settings
[deploy]
url = http://localhost:6800/
project = pd
I have the following settings.py
BOT_NAME = 'pd' #this gets replaced with a function
BOT_VERSION = '1.0'
SPIDER_MODULES = ['pd.spiders']
NEWSPIDER_MODULE = 'pd.spiders'
DEFAULT_ITEM_CLASS = 'pd.items.Product'
ITEM_PIPELINES = 'pd.pipelines.PdPipeline'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
TELNETCONSOLE_HOST = '127.0.0.1' # defaults to 0.0.0.0 set so
TELNETCONSOLE_PORT = '6073' # only we can see it.
TELNETCONSOLE_ENABLED = False
WEBSERVICE_ENABLED = True
LOG_ENABLED = True
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = [
'pd.pipelines.PdPipeline',
]
DATA_DIR = '/home/pd/scraped_data' #directory to store export files to.
DOWNLOAD_DELAY = 2.0
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
}
Regards,
Pranshu

After spending forever trying to debug, it turns out that HttpProxyMiddleware actually expects http_proxy environment variable to be set. The middleware will not be loaded if http_proxy is not set. Therefore, I set http_proxy and bob's your uncle! Everything works!

Related

Connect Scrapy crawler with S3

My crawler download from a URL a Request.body which I save on a file locally. Now I would like to connect to my aws-s3.
I read the documentation but face two issues:
1. the config as well as the credential files are not of a dict type? my file is an unmodified was-credential and aws-config files.
The s3 config key is not a dictionary type, ignoring its value of: None
the Response is a 'bytes' one and cannot not be processed by the feeder as such. I tried Response.text and then got the same error raised but with 'str'.
Any help is highly appreciated. Thank you.
Additional information:
config file (path ~/.aws/config):
[default]
Region=eu-west-2
output=csv
and
credentials file (path ~/.aws/credentials):
[default]
aws_access_key_id=aws_access_key_id=foo
aws_secret_access_key=aws_secret_access_key=bar
the link to Scrapy documentation:
https://docs.scrapy.org/en/latest/topics/settings.html?highlight=s3
MacBook-Pro:aircraftPositions frederic$ scrapy crawl aircraftData_2018
2019-03-13 15:35:04 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: aircraftPositions)
2019-03-13 15:35:04 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Darwin-18.2.0-x86_64-i386-64bit
2019-03-13 15:35:04 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'aircraftPositions', 'CONCURRENT_REQUESTS': 32, 'CONCURRENT_REQUESTS_PER_DOMAIN': 32, 'DOWNLOAD_DELAY': 10, 'FEED_FORMAT': 'json', 'FEED_STORE_EMPTY': True, 'FEED_URI': 's3://flightlists/lists_v1/%(name)s/%(time)s.json', 'NEWSPIDER_MODULE': 'aircraftPositions.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['aircraftPositions.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
2019-03-13 15:35:04 [scrapy.extensions.telnet] INFO: Telnet Password: 2f2c11f3300481ed
2019-03-13 15:35:04 [py.warnings] WARNING: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/utils/misc.py:144: ScrapyDeprecationWarning: Initialising `scrapy.extensions.feedexport.S3FeedStorage` without AWS keys is deprecated. Please supply credentials or use the `from_crawler()` constructor.
return objcls(*args, **kwargs)
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from creating-client-class.iot-data to creating-client-class.iot-data-plane
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-call.apigateway to before-call.api-gateway
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from request-created.machinelearning.Predict to request-created.machine-learning.Predict
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.autoscaling.CreateLaunchConfiguration to before-parameter-build.auto-scaling.CreateLaunchConfiguration
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.route53 to before-parameter-build.route-53
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from request-created.cloudsearchdomain.Search to request-created.cloudsearch-domain.Search
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from docs.*.autoscaling.CreateLaunchConfiguration.complete-section to docs.*.auto-scaling.CreateLaunchConfiguration.complete-section
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.logs.CreateExportTask to before-parameter-build.cloudwatch-logs.CreateExportTask
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from docs.*.logs.CreateExportTask.complete-section to docs.*.cloudwatch-logs.CreateExportTask.complete-section
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.cloudsearchdomain.Search to before-parameter-build.cloudsearch-domain.Search
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from docs.*.cloudsearchdomain.Search.complete-section to docs.*.cloudsearch-domain.Search.complete-section
2019-03-13 15:35:04 [botocore.loaders] DEBUG: Loading JSON file: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botocore/data/endpoints.json
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Event choose-service-name: calling handler <function handle_service_name_alias at 0x103ca4b70>
2019-03-13 15:35:04 [botocore.loaders] DEBUG: Loading JSON file: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botocore/data/s3/2006-03-01/service-2.json
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_post at 0x103c688c8>
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_url at 0x103c686a8>
2019-03-13 15:35:04 [botocore.args] DEBUG: The s3 config key is not a dictionary type, ignoring its value of: None
2019-03-13 15:35:04 [botocore.endpoint] DEBUG: Setting s3 timeout as (60, 60)
2019-03-13 15:35:04 [botocore.loaders] DEBUG: Loading JSON file: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botocore/data/_retry.json
2019-03-13 15:35:04 [botocore.client] DEBUG: Registering retry handlers for service: s3
2019-03-13 15:35:04 [botocore.client] DEBUG: Defaulting to S3 virtual host style addressing with path style addressing fallback.
2019-03-13 15:35:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-03-13 15:35:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
... after that it is enabling the downloader and the middleware.
This is the spider:
class QuotesSpider(scrapy.Spider):
name = "aircraftData_2018"
def url_values(self):
time = list(range(1538140980, 1538140780, -60))
return time
def start_requests(self):
allowed_domains = ["https://domaine.net"]
list_urls = []
for n in self.url_values():
list_urls.append("https://domaine.net/.../.../all/{}".format(n))
for url in list_urls:
yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)
def parse(self, response):
i = AircraftpositionsItem
i['url'] = response.url
i['body'] = response.body
yield I
This is the pipeline.py
class AircraftpositionsPipeline(object):
def process_item(self, item, spider):
return item
def return_body(self, response):
page = response.url.split("/")[-1]
filename = 'aircraftList-{}.csv'.format(page)
with open(filename, 'wb') as f:
f.write(response.body)

Katalon studio Unable to set text (Root cause: Element is not currently interactable and may not be manipulated

Hi guys i am using katalon studio and whiling passing text via set text email to email input field i am getting this issue:
Test Cases/Valid_Login FAILED because (of) Unable to set text 'usman#myfake.tk' of object 'Object Repository/Page_Webtalk Communicate Better/input_email' (Root cause: org.openqa.selenium.InvalidElementStateException: Element is not currently interactable and may not be manipulated
Build info: version: '3.7.1', revision: '8a0099a', time: '2017-11-06T21:07:36.161Z'
System info: host: 'USMAN', ip: '192.168.11.206', os.name: 'Windows 8.1', os.arch: 'amd64', os.version: '6.3', java.version: '1.8.0_102'
Driver info: com.kms.katalon.core.webui.driver.firefox.CGeckoDriver
Capabilities {acceptInsecureCerts: true, browserName: firefox, browserVersion: 58.0.2, javascriptEnabled: true, moz:accessibilityChecks: false, moz:headless: false, moz:processID: 5508, moz:profile: C:\Users\usmanPC\AppData\Lo..., moz:webdriverClick: true, pageLoadStrategy: normal, platform: XP, platformName: XP, platformVersion: 6.3, proxy: Proxy(direct), rotatable: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}}
Session ID: d28ee92a-cfee-47ad-b6a2-46c705493a4e)
Test Cases/Valid_Login.run:31
Generated code is below:
import static com.kms.katalon.core.checkpoint.CheckpointFactory.findCheckpoint
import static com.kms.katalon.core.testcase.TestCaseFactory.findTestCase
import static com.kms.katalon.core.testdata.TestDataFactory.findTestData
import static com.kms.katalon.core.testobject.ObjectRepository.findTestObject
import com.kms.katalon.core.checkpoint.Checkpoint as Checkpoint
import com.kms.katalon.core.checkpoint.CheckpointFactory as CheckpointFactory
import com.kms.katalon.core.mobile.keyword.MobileBuiltInKeywords as MobileBuiltInKeywords
import com.kms.katalon.core.mobile.keyword.MobileBuiltInKeywords as Mobile
import com.kms.katalon.core.model.FailureHandling as FailureHandling
import com.kms.katalon.core.testcase.TestCase as TestCase
import com.kms.katalon.core.testcase.TestCaseFactory as TestCaseFactory
import com.kms.katalon.core.testdata.TestData as TestData
import com.kms.katalon.core.testdata.TestDataFactory as TestDataFactory
import com.kms.katalon.core.testobject.ObjectRepository as ObjectRepository
import com.kms.katalon.core.testobject.TestObject as TestObject
import com.kms.katalon.core.webservice.keyword.WSBuiltInKeywords as WSBuiltInKeywords
import com.kms.katalon.core.webservice.keyword.WSBuiltInKeywords as WS
import com.kms.katalon.core.webui.keyword.WebUiBuiltInKeywords as WebUiBuiltInKeywords
import com.kms.katalon.core.webui.keyword.WebUiBuiltInKeywords as WebUI
import internal.GlobalVariable as GlobalVariable
import org.openqa.selenium.Keys as Keys
WebUI.openBrowser('')
WebUI.navigateToUrl('https://dev.webtalk.co/')
WebUI.click(findTestObject('Page_Webtalk Communicate Better/a_Login'))
WebUI.waitForElementPresent(findTestObject('Page_Webtalk Communicate Better/input_email'), 5)
WebUI.setText(findTestObject('Page_Webtalk Communicate Better/input_email'), 'usman#myfake.tk')
WebUI.setText(findTestObject('Page_Webtalk Communicate Better/input_login_password'), 'Pmasuaar1##$')
WebUI.click(findTestObject('Page_Webtalk Communicate Better/input_login_button'))
WebUI.click(findTestObject('Page_Webtalk Communicate Better/button_Talk'))
WebUI.closeBrowser()

I had a similar issue and resolved it by adding a delay line (WebUI.delay(3)) before the call to the element that shows the error.

Selenium + Mink + Chrome: "Could not open connection" error

I'm not new to setting up Selenium and Mink, but it always seems to be a hassle. This time I'm trying to get it set up in an ubuntu docker container and I am running into the following error:
Could not open connection: Unable to create new service: ChromeDriverService
Build info: version: '3.8.1', revision: '6e95a6684b', time: '2017-12-01T19:05:32.194Z'
System info: host: 'a75b4026b8e5', ip: '172.20.0.6', os.name: 'Linux', os.arch: 'amd64', os.version: '4.9.60-linuxkit-aufs', java.version: '1.8.0_161'
Driver info: driver.version: unknown (Behat\Mink\Exception\DriverException)
I can tell that Mink is hitting the Selenium to some degree, since the Selenium server outputs the following immediately before Behat reports the above error:
2018-01-30 16:13:49.870:INFO:osjshC.ROOT:qtp1156060786-12: org.openqa.selenium.remote.server.WebDriverServlet-10bbd20a: Initialising WebDriverServlet
16:13:49.988 INFO - Found handler: org.openqa.selenium.remote.server.commandhandler.BeginSession#4b4945b7
16:13:50.006 INFO - /session: Executing POST on /session (handler: BeginSession)
16:13:50.168 INFO - Capabilities are: Capabilities {browser: chrome, browserName: chrome, ignoreZoomSetting: false, marionette: true, name: Behat feature suite, tags: [a75b4026b8e5, PHP 5.6.31]}
16:13:50.171 INFO - Capabilities {browser: chrome, browserName: chrome, ignoreZoomSetting: false, marionette: true, name: Behat feature suite, tags: [a75b4026b8e5, PHP 5.6.31]} matched class org.openqa.selenium.remote.server.ServicedSession$Factory (provider: org.openqa.selenium.chrome.ChromeDriverService)
16:13:50.302 INFO - Found handler: org.openqa.selenium.remote.server.commandhandler.BeginSession#5a608e4b
16:13:50.303 INFO - /session: Executing POST on /session (handler: BeginSession)
16:13:50.306 INFO - Capabilities are: Capabilities {browser: chrome, browserName: chrome, ignoreZoomSetting: false, marionette: true, name: Behat feature suite, tags: [a75b4026b8e5, PHP 5.6.31]}
16:13:50.307 INFO - Capabilities {browser: chrome, browserName: chrome, ignoreZoomSetting: false, marionette: true, name: Behat feature suite, tags: [a75b4026b8e5, PHP 5.6.31]} matched class org.openqa.selenium.remote.server.ServicedSession$Factory (provider: org.openqa.selenium.chrome.ChromeDriverService)
Any idea what setting I have wrong?
Here's my setup:
Version of the chromedriver I installed:
https://chromedriver.storage.googleapis.com/2.35/chromedriver_linux64.zip
Version of Chrome I have installed via the apt-get install google-chrome-stable command:
Google Chrome 64.0.3282.119
Java version that's installed:
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
How I started selenium:
xvfb-run --server-args="-screen 0, 1366x768x24" java -Dwebdriver.chrome.driver="usr/bin/chromedriver" -jar selenium-server-standalone-3.8.1.jar &
composer.json:
...
"require-dev": {
"behat/behat": "^3.1",
"behat/mink": "^1.7",
"behat/mink-extension": "^2.2",
"behat/mink-goutte-driver": "^1.2",
"behat/mink-selenium2-driver": "^1.3",
...
},
behat.yml:
extensions:
Behat\MinkExtension:
goutte: ~
base_url: http://localhost/myapp/
browser_name: chrome
javascript_session: selenium2
default_session: goutte
selenium2:
wd_host: "http://127.0.0.1:4444/wd/hub"
capabilities:
marionette: true
browser: chrome
version: 2.9

Behat + Mink - Selenium2 issue: 'requiredCapabilities' parameter is not an object

I am running tests using Behat, Mink and Selenium2.
I am running the example search-with-autocompletion test that employs the #javascript tag in a scenario.
This is the exception I get:
Behat\MinkExtension\Context\MinkContext::visit()
'requiredCapabilities' parameter is not an object
Build info: version: '3.5.3', revision: 'a88d25fe6b', time: '2017-08-29T12:54:15.039Z'
System info: host: '73b4ecff4d3d', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '3.19.0-32-generic', java.version: '1.8.0_131'
Driver info: driver.version: unknown
remote stacktrace: stack backtrace:
0: 0x5787ed - backtrace::backtrace::trace::h59229d13f6a8837d
1: 0x578942 - backtrace::capture::Backtrace::new::h23089c033eded8f0
2: 0x5068af - <webdriver::capabilities::LegacyNewSessionParameters as webdriver::command::Parameters>::from_json::hd98a6246b0731ef9
3: 0x506e44 - <webdriver::command::NewSessionParameters as webdriver::command::Parameters>::from_json::ha19e8e984af08954
4: 0x41f249 - <webdriver::command::WebDriverMessage<U>>::from_http::h239258e4ad67ac76
5: 0x43a64e - <webdriver::server::HttpHandler<U> as hyper::server::Handler>::handle::hd20f6e9e0a69e2b4
6: 0x42c9af - hyper::server::listener::spawn_with::{{closure}}::h8fa3cf343f537777
7: 0x4092d7 - std::panicking::try::do_call::h649be53a713433eb
8: 0x5dc20a - panic_unwind::__rust_maybe_catch_panic
at /checkout/src/libpanic_unwind/lib.rs:98
9: 0x41c43e - <F as alloc::boxed::FnBox<A>>::call_box::hf41feb3b2b67541e
10: 0x5d48a4 - alloc::boxed::{{impl}}::call_once<(),()>
at /checkout/src/liballoc/boxed.rs:650
- std::sys_common::thread::start_thread
at /checkout/src/libstd/sys_common/thread.rs:21
- std::sys::imp::thread::{{impl}}::new::thread_start
at /checkout/src/libstd/sys/unix/thread.rs:84
I am running Selenium2 standalone server v3.5.3 using docker-selenium and geckodriver v0.18.
My configuration for Behat:
default:
context:
class: 'FeatureContext'
extensions:
Behat\MinkExtension\Extension:
base_url: '<skipped>'
goutte:
guzzle_parameters:
curl.options:
CURLOPT_SSL_VERIFYPEER: false
CURLOPT_CERTINFO: false
CURLOPT_TIMEOUT: 120
ssl.certificate_authority: false
selenium2:
wd_host: "http://127.0.0.1:4444/wd/hub"
capabilities:
browser: firefox
# acceptSslCerts: true
# marionette: true
# No context:
no_context:
paths:
bootstrap: 'NON_EXISTING_FOLDER'
filters:
tags: '~#javascript'
# Context based on inheritance:
inheritance:
context:
class: 'InheritedFeatureContext'
# Context based on traits:
traits:
paths:
bootstrap: 'features/php54_bootstrap'
context:
class: 'TraitedFeatureContext'
# Context based on subcontexting:
subcontexts:
context:
class: 'SubcontextedFeatureContext'
If connecting to localhost:4444 I am able to create a session by hand.
If using python-selenium I am able to run this code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
driver = webdriver.Remote(
command_executor='http://127.0.0.1:4444/wd/hub',
desired_capabilities=DesiredCapabilities.FIREFOX
)
#driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()
Thank you in advance folks.

Selenium 2 and Selenium 3 are not same. They are not 100% compatible. Selenium 3 is a step towards the W3C standard for Selenium.
From the exception it seems that Mink may not be compatible with Selenium 3. So either downgrade your Python selenium version to 2 (that again may not work because latest browsers may support only 3)
Also please look at https://github.com/minkphp/MinkSelenium2Driver/issues/254

Riak CS returns "s3.amazonaws.com" no matter what is in cs_root_host

cs_root_host is set up right:
grep root_host /var/lib/riak-cs/generated.configs/app.2015.07.09.13.59.07.config
{cs_root_host,"s3.example.com"},
But when I upload file:
s3cmd put test.jpg s3://images --acl-public
I get in return:
Public URL of the object is: http://images.s3.amazonaws.com/test.jpg
Where is the issue?
Added:
Here is output - everything looks fine, except the last line:
(example.com is just replacement for real domain which I don't want to public)
s3cmd -d -c .s3cfg put newfile.jpg s3://images --acl-public
DEBUG: ConfigParser: Reading file '.s3cfg'
DEBUG: ConfigParser: access_key->YD...17_chars...U
DEBUG: ConfigParser: bucket_location->RU
DEBUG: ConfigParser: cloudfront_host->cloudfront.amazonaws.com
DEBUG: ConfigParser: cloudfront_resource->/2010-07-15/distribution
DEBUG: ConfigParser: default_mime_type->binary/octet-stream
DEBUG: ConfigParser: delete_removed->False
DEBUG: ConfigParser: dry_run->False
DEBUG: ConfigParser: encoding->UTF-8
DEBUG: ConfigParser: encrypt->False
DEBUG: ConfigParser: follow_symlinks->False
DEBUG: ConfigParser: force->False
DEBUG: ConfigParser: get_continue->False
DEBUG: ConfigParser: gpg_command->/usr/bin/gpg
DEBUG: ConfigParser: gpg_decrypt->%(gpg_command)s -d --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
DEBUG: ConfigParser: gpg_encrypt->%(gpg_command)s -c --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
DEBUG: ConfigParser: gpg_passphrase->...-3_chars...
DEBUG: ConfigParser: guess_mime_type->True
DEBUG: ConfigParser: host_base->s3.example.com
DEBUG: ConfigParser: host_bucket->%(bucket)s.s3.example.com
DEBUG: ConfigParser: human_readable_sizes->False
DEBUG: ConfigParser: list_md5->False
DEBUG: ConfigParser: log_target_prefix->
DEBUG: ConfigParser: preserve_attrs->True
DEBUG: ConfigParser: progress_meter->True
DEBUG: ConfigParser: proxy_host->127.0.0.1
DEBUG: ConfigParser: proxy_port->80
DEBUG: ConfigParser: recursive->False
DEBUG: ConfigParser: recv_chunk->4096
DEBUG: ConfigParser: reduced_redundancy->False
DEBUG: ConfigParser: secret_key->kG...37_chars...=
DEBUG: ConfigParser: send_chunk->4096
DEBUG: ConfigParser: simpledb_host->sdb.amazonaws.com
DEBUG: ConfigParser: skip_existing->False
DEBUG: ConfigParser: socket_timeout->10
DEBUG: ConfigParser: urlencoding_mode->normal
DEBUG: ConfigParser: use_https->False
DEBUG: ConfigParser: verbosity->WARNING
DEBUG: Updating Config.Config encoding -> UTF-8
DEBUG: Updating Config.Config follow_symlinks -> False
DEBUG: Updating Config.Config verbosity -> 10
DEBUG: Unicodising 'put' using UTF-8
DEBUG: Unicodising 'newfile.jpg' using UTF-8
DEBUG: Unicodising 's3://images' using UTF-8
DEBUG: Command: put
INFO: Compiling list of local files...
DEBUG: DeUnicodising u'' using UTF-8
DEBUG: DeUnicodising u'newfile.jpg' using UTF-8
DEBUG: Unicodising 'newfile.jpg' using UTF-8
DEBUG: Unicodising 'newfile.jpg' using UTF-8
INFO: Applying --exclude/--include
DEBUG: CHECK: newfile.jpg
DEBUG: PASS: newfile.jpg
INFO: Summary: 1 local files to upload
DEBUG: Content-Type set to 'image/jpeg'
DEBUG: String 'newfile.jpg' encoded to 'newfile.jpg'
DEBUG: SignHeaders: 'PUT\n\nimage/jpeg\n\nx-amz-acl:public-read\nx-amz-date:Fri, 10 Jul 2015 09:55:37 +0000\n/images/newfile.jpg'
DEBUG: CreateRequest: resource[uri]=/newfile.jpg
DEBUG: Unicodising 'newfile.jpg' using UTF-8
DEBUG: SignHeaders: 'PUT\n\nimage/jpeg\n\nx-amz-acl:public-read\nx-amz-date:Fri, 10 Jul 2015 09:55:37 +0000\n/images/newfile.jpg'
newfile.jpg -> s3://images/newfile.jpg [1 of 1]
DEBUG: get_hostname(images): images.s3.example.com
DEBUG: format_uri(): http://images.s3.example.com/newfile.jpg
32600 of 32600 100% in 0s 14.49 MB/sDEBUG: Response: {'status': 200, 'headers': {'content-length': '0', 'server': 'nginx', 'connection': 'keep-alive', 'etag': '"89e39f454c69a1ce1fadec3a222fc292"', 'date': 'Fri, 10 Jul 2015 09:55:37 GMT', 'content-type': 'text/plain'}, 'reason': 'OK', 'data': '', 'size': 32600}
32600 of 32600 100% in 0s 391.54 kB/s done
DEBUG: MD5 sums: computed=89e39f454c69a1ce1fadec3a222fc292, received="89e39f454c69a1ce1fadec3a222fc292"
Public URL of the object is: http://images.s3.amazonaws.com/newfile.jpg

This is not a Riak CS issue. s3cmd itself produce public url string
and print it.
For my environment, with s3cmd of master branch of commit
7bdefc81823699069706ea3680bfa65ec8ad3db5 (just fetched today, 2015-07-14),
it shows (seemingly) the corrent URL.
% ~/g/s3cmd/build/scripts-2.7/s3cmd -c .s3cfg.15018.alice put rebar.config -P s3://test/a
rebar.config -> s3://test/a [1 of 1]
2791 of 2791 100% in 0s 196.88 kB/s done
Public URL of the object is: http://test.s3.example.com/a
From the source code of s3cmd, it seems it uses host_bucket or host_base
configuration depending on bucket name (or maybe other configurations.)
Some other details on my environment
s3cmd configuration : host_base = s3.example.com and host_bucket = %(bucket)s.s3.example.com
Server is Riak CS of develop branch (commit 1f954aaae45429923f65fdad40c7916a55ab79f3)
Riak CS configuration : cs_root_host = s3.example.com

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Enabling HttpProxyMiddleware in scrapyd - scrapy

After spending forever trying to debug, it turns out that HttpProxyMiddleware actually expects http_proxy environment variable to be set. The middleware will not be loaded if http_proxy is not set. Therefore, I set http_proxy and bob's your uncle! Everything works!

Related

Connect Scrapy crawler with S3

Katalon studio Unable to set text (Root cause: Element is not currently interactable and may not be manipulated

Selenium + Mink + Chrome: "Could not open connection" error

Behat + Mink - Selenium2 issue: 'requiredCapabilities' parameter is not an object

Riak CS returns "s3.amazonaws.com" no matter what is in cs_root_host

Categories

Resources