How to customize URI in Scrapy with non built in storage URI parameters - scrapy

I want to customize the Scrapy feed URI to s3 to include the dimensions of the uploaded file. Currently I have the following in settings.py file:
FEEDS = {
's3://path-to-file/file_to_have_dimensions.csv': {
'format': 'csv',
'encoding': 'utf8',
'store_empty': False,
'indent': 4,
}
}
But would like to have something like the following:
NUMBER_OF_ROWS_IN_CSV = file.height()
FEEDS = {
f's3://path-to-files/file_to_have_dimensions_{NUMBER_OF_ROWS_IN_CSV}.csv': {
'format': 'csv',
'encoding': 'utf8',
'store_empty': False,
'indent': 4,
}
}
Note that I would like the number of rows to be inserted automatically.
Is this possible to do this solely through changing settings.py, or is it required to change other parts of the scrapy code?

The feed file is created when the spider starts running at which point the number of items is not yet know. However, when the spider finishes running, it calls a method named closed from which you can access the spider stats, settings and also you can perform any other tasks that you want to run after the spider has finished scraping and saving items.
In the case below i renamed the feed file from intial_file.csv to final_file_{item_count}.csv.
As you cannot rename files in s3,I use the boto3 library to copy the initial_file to a new file and name it with the item_count value included in the file name and then delete the initial file.
import scrapy
import boto3
class SampleSpider(scrapy.Spider):
name = 'sample'
start_urls = [
'http://quotes.toscrape.com/',
]
custom_settings = {
'FEEDS': {
's3://path-to-file/initial_file.csv': {
'format': 'csv',
'encoding': 'utf8',
'store_empty': False,
'indent': 4,
}
}
}
def parse(self, response):
for quote in response.xpath('//div[#class="quote"]'):
yield {
'text': quote.xpath('./span[#class="text"]/text()').extract_first(),
'author': quote.xpath('.//small[#class="author"]/text()').extract_first(),
'tags': quote.xpath('.//div[#class="tags"]/a[#class="tag"]/text()').extract()
}
def closed(self, reason):
item_count = self.crawler.stats.get_value('item_scraped_count')
try:
session = boto3.Session(aws_access_key_id = 'awsAccessKey', aws_secret_access_key = 'awsSecretAccessKey')
s3 = session.resource('s3')
s3.Object('my_bucket', f'path-to-file/final_file_{item_count}.csv').copy_from(CopySource = 'my_bucket/path-to-file/initial_file.csv')
s3.Object('my_bucket', 'path-to-file/initial_file.csv').delete()
except:
self.logger.info("unable to rename s3 file")

Related

How to read files from S3 bucket in glue by using a partial name of the file

I am trying to read files from s3 bucket in Glue based on the keyword search on file names. For example, read a file if the file name contains "file". This is the code I am currently using to read a given file from s3 bucket.
File1_node = glueContext.create_dynamic_frame.from_options(
format_options={"quoteChar": '"', "withHeader": True, "separator": ","},
connection_type="s3",
format="csv",
connection_options={
"paths": [
"s3:// env-files/data/material/filename1.csv"
],
"recurse": True,
},
transformation_ctx=" File1_node",
)
File1= File1_node.toDF()
I want to read files dynamically by using keyword search. For example, if a file name contains "file" and there is a file named "filename1" then that file should be read. If there are multiple files that contains the same keyword then append them all. Please let me know if there is anyway to do so. Thanks!
You could do that using boto3 s3 list_objects_v2().
import boto3
from typing import List
s3_client = boto3.client('s3')
def get_all_filepaths(filename_filter: str, bucket: str, prefix: str) -> List[str]:
response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)
return [key['Key'] for key in response['Contents'] if filename_filter in key['Key']]
File1_node = glueContext.create_dynamic_frame.from_options(
format_options={"quoteChar": '"', "withHeader": True, "separator": ","},
connection_type="s3",
format="csv",
connection_options={
"paths": get_all_filepaths(filename_filter, bucket, prefix),
"recurse": True,
},
transformation_ctx=" File1_node",
)
File1= File1_node.toDF()
Using this, you can get a list of paths that match your criteria. I haven't run this, but I think you should also append s3://, please check that. Also, if there are more than 1000 objects, you will have to implement the logic to continue fetching the data using NextContinuationToken from the response of the function call.
Hope this helps!

Storing cache results

I have activated the middleware extension scrapy.extensions.httpcache.FilesystemCacheStorage to return the cache as results in a folder (.gzip) when scraping. However, I get the following error:
raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'\x80\x04')
I think the issue is that the file names are saved as the following:
My settings.py:
HTTPCACHE_ENABLED = True
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
HTTPCACHE_DBM_MODULE = 'dbm'
HTTPCACHE_GZIP = True
How do I correctly activate the extension and store the cache as files in my working directory?
example scraper:
import scrapy
class email_spider(scrapy.Spider):
name = 'email'
start_urls = [ 'http://quotes.toscrape.com/tag/humor/' ]
def parse(self, response):
content = response.xpath("(//div[#class='col-md-8'])[2]//div")
for stuff in content:
yield {
'stuff':stuff.xpath(".//a//#href).get(),}

HubSpot API - Search contact by phone

I understand that HubSpot api provides an endpoint to search contact by name or email?
Is there a way to search a contact by phone number with the API?
For an online tool with a free trial, you can try Acho. Our company uses it to fetch data from HubSpot and use its built-in search bar to search for a specific contact that we need.
Below Python code works on my side.
This page looks very useful.
Cheers!
import hubspot
from pprint import pprint
from hubspot.crm.deals import PublicObjectSearchRequest, ApiException
api_key = "your api key"
test_search_url = f'https://api.hubapi.com/crm/v3/objects/contact/search?hapikey='+api_key
headers = {"Content-Type": "application/json"}
payload = {
"filterGroups": [
{
"filters": [
{
"propertyName": "phone",
"operator": "EQ",
"value": "1234567"
}
]
}
],
"properties": [
"firstname", "lastname", "email", "hs_object_id"
]
}
res = requests.post(test_search_url, headers=headers, data=json.dumps(payload))
res.json()
####Output####
# {'total': 1,
# 'results': [{'id': '13701',
# 'properties': {'createdate': '2022-02-10T07:27:18.733Z',
# 'email': 'adi#XXX.com',
# 'firstname': 'XXX K.',
# 'hs_object_id': '13701',
# 'lastmodifieddate': '2022-02-10T07:28:01.033Z',
# 'lastname': None},
# 'createdAt': '2022-02-10T07:27:18.733Z',
# 'updatedAt': '2022-02-10T07:28:01.033Z',
# 'archived': False}]
# }
I have a function for exactly that written in Python.
The function returns an array of contact's IDs, but you can retrieve other properties if you want, you have to add them in properties=["id"]
from hubspot import HubSpot
from hubspot.crm.contacts import ApiException
from hubspot.crm.contacts import PublicObjectSearchRequest
def get_contacts_by_phone(phone):
api_client = HubSpot()
api_client = HubSpot(access_token=YOUR_TOKEN)
public_object_search_request = PublicObjectSearchRequest(
filter_groups=[],
sorts=[{"propertyName": "phone", "direction": "DESCENDING"}],
properties=["id"],
query=phone,
limit=100,
)
try:
api_response = api_client.crm.contacts.search_api.do_search(
public_object_search_request=public_object_search_request
)
print(api_response)
contacts_ids = []
for contact in api_response.results:
contacts_ids.append(contact.id)
return contacts_ids
except ApiException as e:
print("Exception when calling search_api->do_search: %s\n" % e)

Rails - Uploading large files directly to S3 with Jquery File Upload (hosted on Heroku )

I'm using Heroku, which means I have to upload multiple large files to S3 directly.. I'm using Rails 3.2.11, and Ruby 1.9.3. I do not wish to use carrierwave or paperclip gems, or really change much at this point - I just need to get this what I have working.
Before trying to move to S3, if I ran my app locally, I could upload multiple large files to the local file system. When I ran it on Heroku, small files upload but large ones failed. Hence the switch to S3..
I tried several tweaks, and also this link below, but it's just too much of a change to what I have that already working with the local server's file system (and Heroku as well, but Heroku just can't handle large files ..)
Tried: https://devcenter.heroku.com/articles/direct-to-s3-image-uploads-in-rails
I've tried some of the other examples here on Stack Overflow but they are too much of a change for what works locally, and well, I don't grasp everything they are doing.
Now, what happens when I do try to upload images?
It's as if the file upload works - the preview images are successfully created, but nothing is ever uploaded to Amazon s3, and I don't receive any kind of error messages (like s3 authentication failure or anything.. nothing)
What do I need to change in order to get the files over to my s3 storage, and what can I write out to console to detect problems, if any, connecting to my s3?
My form:
<%= form_for #status do |f| %>
{A FEW HTML FIELDS USED FOR A DESCRIPTION OF THE FILES - NOT IMPORTANT FOR THE QUESTION}
File:<input id="fileupload" multiple="multiple" name="image"
type="file" data-form-data = <%= #s3_direct_post.fields%>
data-url= <%= #s3_direct_post.url %>
data-host =<%=URI.parse(#s3_direct_post.url).host%> >
<%= link_to 'submit', "#", :id=>'submit' , :remote=>true%>
<% end %>
My jquery is:
....
$('#fileupload').fileupload({
formData: {
batch: createUUID(),
authenticity_token:$('meta[name="csrf-token"]').attr('content')
},
dataType: 'json',
acceptFileTypes: /(\.|\/)(gif|jpe?g|png)$/i,
maxFileSize: 5000000, // 5 MB
previewMaxWidth: 400,
previewMaxHeight: 400,
previewCrop: true,
add: function (e, data) {
tmpImg.src = URL.createObjectURL(data.files[0]) ; // create image preview
$('#'+ fn + '_inner' ).append(tmpImg);
...
My controller:
def index
#it's in the index just to simplify getting it working
#s3_direct_post = S3_BUCKET.presigned_post(key: "uploads/#{SecureRandom.uuid}/${filename}", success_action_status: '201', acl: 'public-read')
end
The element that is generated for the form is (via Inspect Element):
<input id="fileupload" multiple="multiple" name="image"
data-form-data="{"key"=>"uploads/34a64607-8d1b-4704-806b-159ecc47745e/${filename}"," "success_action_status"="
>"201"," "acl"=">"public-read"," "policy"=">"[encryped stuff - no need to post]","
"x-amz-credential"=">"
[AWS access key]/[some number]/us-east-1/s3/aws4_request"
," "x-amz-algorithm"=">"AWS4-HMAC-SHA256"
," "x-amz-date"=">"20150924T234656Z"
," "x-amz-signature"=">"[some encrypted stuff]"}"
data-url="https://nunyabizness.s3.amazonaws.com" data-host="nunyabizness.s3.amazonaws.com" type="file">
Help!
With S3 there actually is no easy out of the box solutions to upload files, because Amazon is a rather complex instrument.
I had a similar issue back in the day and spend two weeks trying to figure out how S3 works, and now use a working solution for uploading files onto S3. I can tell you a solution that works for me, I never tried the one proposed by Heroku.
A plugin of choice I use is Plupload, since it is the only component I actually managed to get working, apart from simple direct S3 uploads via XHR, and offers the use of percentage indicators and in-browser image resizing, which I find completely mandatory for production applications, where some users have 20mb images that they want to upload as their avatar.
Some basics in S3:
Step 1
Amazon bucket needs correct configuration in its CORS file to allow external uploads in the first place. The Heroku totorial already told you how to put the configuration in the right place.
http://docs.aws.amazon.com/AmazonS3/latest/dev/cors.html
Step 2
Policy data is needed, otherwise your client will not be able to access the corresponding bucket file. I find generating policies to be better done via Ajax calls, so that, for example, admin gets the ability to upload files into the folders of different users.
In my example, cancan is used to manage security for the given user and figaro is used to manage ENV variables.
def aws_policy_image
user = User.find_by_id(params[:user_id])
authorize! :upload_image, current_user
options = {}
bucket = Rails.configuration.bucket
access_key_id = ENV["AWS_ACCESS_KEY_ID"]
secret_access_key = ENV["AWS_SECRET_ACCESS_KEY"]
options[:key] ||= "users/" + params[:user_id] # folder on AWS to store file in
options[:acl] ||= 'private'
options[:expiration_date] ||= 10.hours.from_now.utc.iso8601
options[:max_filesize] ||= 10.megabytes
options[:content_type] ||= 'image/' # Videos would be binary/octet-stream
options[:filter_title] ||= 'Images'
options[:filter_extentions] ||= 'jpg,jpeg,gif,png,bmp'
policy = Base64.encode64(
"{'expiration': '#{options[:expiration_date]}',
'conditions': [
{'x-amz-server-side-encryption': 'AES256'},
{'bucket': '#{bucket}'},
{'acl': '#{options[:acl]}'},
{'success_action_status': '201'},
['content-length-range', 0, #{options[:max_filesize]}],
['starts-with', '$key', '#{options[:key]}'],
['starts-with', '$Content-Type', ''],
['starts-with', '$name', ''],
['starts-with', '$Filename', '']
]
}").gsub(/\n|\r/, '')
signature = Base64.encode64(
OpenSSL::HMAC.digest(
OpenSSL::Digest::Digest.new('sha1'),
secret_access_key, policy)).gsub("\n", "")
render :json => {:access_key_id => access_key_id, :policy => policy, :signature => signature, :bucket => bucket}
end
I went as far as put this method into the application controller, although you could find a better place for it.
Path to this function should be put into the route, of course.
Step 3
Frontend, get plupload: http://www.plupload.com/ make some link to act as the upload button:
<a id="upload_button" href="#">Upload</a>
Make a script that configures the plupload initialization.
function Plupload(config_x, access_key_id, policy, signature, bucket) {
var $this = this;
$this.config = $.extend({
key: 'error',
acl: 'private',
content_type: '',
filter_title: 'Images',
filter_extentions: 'jpg,jpeg,gif,png,bmp',
select_button: "upload_button",
multi_selection: true,
callback: function (params) {
},
add_files_callback: function (up, files) {
},
complete_callback: function (params) {
}
}, config_x);
$this.params = {
runtimes: 'html5',
browse_button: $this.config.select_button,
max_file_size: $this.config.max_file_size,
url: 'https://' + bucket + '.s3.amazonaws.com/',
flash_swf_url: '/assets/plupload/js/Moxie.swf',
silverlight_xap_url: '/assets/plupload/js/Moxie.xap',
init: {
FilesRemoved: function (up, files) {
/*if (up.files.length < 1) {
$('#' + config.select_button).fadeIn('slow');
}*/
}
},
multi_selection: $this.config.multi_selection,
multipart: true,
// resize: {width: 1000, height: 1000}, // currently causes "blob" problem
multipart_params: {
'acl': $this.config.acl,
'Content-Type': $this.config.content_type,
'success_action_status': '201',
'AWSAccessKeyId': access_key_id,
'x-amz-server-side-encryption': "AES256",
'policy': policy,
'signature': signature
},
// Resize images on clientside if we can
resize: {
preserve_headers: false, // (!)
width: 1200,
height: 1200,
quality: 70
},
filters: [
{
title: $this.config.filter_title,
extensions: $this.config.filter_extentions
}
],
file_data_name: 'file'
};
$this.uploader = new plupload.Uploader($this.params);
$this.uploader.init();
$this.uploader.bind('UploadProgress', function (up, file) {
$('#' + file.id + ' .percent').text(file.percent + '%');
});
// before upload
$this.uploader.bind('BeforeUpload', function (up, file) {
// optional: regen the filename, otherwise the user will upload image.jpg that will overwrite each other
var extension = file.name.split('.').pop();
var file_name = extension + "_" + (+new Date);
up.settings.multipart_params.key = $this.config.key + '/' + file_name + '.' + extension;
up.settings.multipart_params.Filename = $this.config.key + '/' + file_name + '.' + extension;
file.name = file_name + '.' + extension;
});
// shows error object in the browser console (for now)
$this.uploader.bind('Error', function (up, error) {
console.log('Expand the error object below to see the error. Use WireShark to debug.');
alert_x(".validation-error", error.message);
});
// files added
$this.uploader.bind('FilesAdded', function (up, files) {
$this.config.add_files_callback(up, files, $this.uploader);
// p(uploader);
// uploader.start();
});
// when file gets uploaded
$this.uploader.bind('FileUploaded', function (up, file) {
$this.config.callback(file);
up.refresh();
});
// when all files are uploaded
$this.uploader.bind('UploadComplete', function (up, file) {
$this.config.complete_callback(file);
up.refresh();
});
}
Plupload.prototype.init = function () {
//
}
Step 4
The implemetation of the general multi-purpose file uploader function:
ImageUploader = {
init: function (user_id, config, callback) {
$.ajax({
type: "get",
url: "/aws_policy_image",
data: {user_id: user_id},
error: function (request, status, error) {
alert(request.responseText);
},
success: function (msg) {
// set aws credentials
callback(config, msg);
}
});
},
},
// local functions
photo_uploader: function (user_id) {
var container = "#photos .unverified_images" // for example;
var can_render = false;
this.init(user_id,
{
select_button: "upload_photos",
callback: function (file) {
file.aws_id = file.id;
file.id = "0";
file.album_title = "userpics"; // I use this param to manage photo directory
file.user_id = user_id;
//console.log(file);
[** your ajax code here that saves the image object in the database via file variable you get here **]
});
},
add_files_callback: function (up, files, uploader) {
$.each(files, function (index, value) {
// do something like adding a progress bar html
});
uploader.start();
},
complete_callback: function (files) {
can_render = true;
}
}, function (config, msg) {
config.key = "users/" + user_id;
// Most important part:
window.photo_uploader = new Plupload(config, msg.access_key_id, msg.policy, msg.signature, msg.bucket);
});
}
can_render variable is useful so that you can make the application only then re-render the page, when the uploader is actually done.
And to make the button work from somewhere else call:
ImageUploader.photo_uploader(user_id);
And the button will act as a Plupload uploader button.
What is important is that Policy is made in a way so that noone can upload the photo into someone else's directory.
It would be great to have a version that does the same not via ajax callbacks, but with web hooks, this is something I want to do in the future.
Again, this is not a perfect solution, but something that works good enough from my experience for the purpose of uploading images and videos onto amazon.
Note in case someone asks why I have this complex object-oriented structure of uploader objects, the reason is that my application has all different kinds of uploaders that behave differently and they need to have an initializer with common behavior. The way I did it I can write an initializer for, say, videos, with minimum amount of code, that will do similar things to the existing image uploader.

How to set different scrapy-settings for different spiders?

I want to enable some http-proxy for some spiders, and disable them for other spiders.
Can I do something like this?
# settings.py
proxy_spiders = ['a1' , b2']
if spider in proxy_spider: #how to get spider name ???
HTTP_PROXY = 'http://127.0.0.1:8123'
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
else:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
If the code above doesn't work, is there any other suggestion?
a bit late, but since release 1.0.0 there is a new feature in scrapy where you can override settings per spider like this:
class MySpider(scrapy.Spider):
name = "my_spider"
custom_settings = {"HTTP_PROXY":'http://127.0.0.1:8123',
"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}
class MySpider2(scrapy.Spider):
name = "my_spider2"
custom_settings = {"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}
There is a new and easier way to do this.
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'SOME_SETTING': 'some value',
}
I use Scrapy 1.3.1
You can add setting.overrides within the spider.py file
Example that works:
from scrapy.conf import settings
settings.overrides['DOWNLOAD_TIMEOUT'] = 300
For you, something like this should also work
from scrapy.conf import settings
settings.overrides['DOWNLOADER_MIDDLEWARES'] = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
You can define your own proxy middleware, something straightforward like this:
from scrapy.contrib.downloadermiddleware import HttpProxyMiddleware
class ConditionalProxyMiddleware(HttpProxyMiddleware):
def process_request(self, request, spider):
if getattr(spider, 'use_proxy', None):
return super(ConditionalProxyMiddleware, self).process_request(request, spider)
Then define the attribute use_proxy = True in the spiders that you want to have the proxy enabled. Don't forget to disable the default proxy middleware and enable your modified one.
Why not use two projects rather than only one?
Let's name these two projects with proj1 and proj2. In proj1's settings.py, put these settings:
HTTP_PROXY = 'http://127.0.0.1:8123'
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
In proj2's settings.py, put these settings:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}