How to send a list of numbers / strings with scrapyd to a spider when using scrapyd.schedule - scrapy

I'm trying to start my scrapy bot from a Django application and I need to pass in a list of strings and also a list of numbers that the bot requires to function. This is my code in the views.py of my Django application:
task = scrapyd.schedule('default', 'arc',
settings=settings, numIds=numIds, formOp=formOp)
numId is the list of the numbers I need to pass and fromOp is the list of strings I need to pass. I tried extracting them in the scrapy spider file using:
kwargs.get('numIds')
kwargs.get('formOp')
for some reason, it only gives me the first number/string in the list even if I try and send a JSON object. is there a way to pass the entire list through scrapyd?

Related

How to extract from an array by name that includes a specific string?

I am building a Rails 5.2 app.
In this app I handle incoming emails. I parse the body of the email and get an array with links from the body. How can I extract a specific link/url and remove the others.
This is my array:
["code:", "request:", "https://mail.google.com/mail/vf-%5BANGjdJ9Ql923ILcJJ6A_uyjeKXxfYXWYew3ao8Hg1sDGPECPFhhZ7LPRP2-vIfW09o5MIoRmcrdxNDnD9LPYUTwTTx-LaJLIlfZ1xoGL7vyjLVlDyU8PVlv1T9Rf3p_I%5D-DWNS6pShF_p-zPxZNCLRE7RI-hI", "verification:", "https://mail.google.com/mail/uf-%5BANGjdJ-tvNwZ7q-MwbYcKwoXRJQ9XN0bTmUKs5ktJIP2gmeHTvnbBEj_ogXKy6uTz2dF6a-W6Qbz-2oJ5olVf4m5QPvpAi9nKS_6wHIcFY1WzYkQh3d_OS0o2r0dDB50%5D-DWNS6pShF_p-zPxZNCLRE7RI-hI", "visit:", "http://support.google.com/mail/bin/answer.py?answer=184973."]
I extract links with this:
URI.extract(body)
It unfortunately not only extracts links but other strings.
I want to extract the link starting with https://mail.google.com/mail/.

Is there any way to store array/list as variable/parameter in constructing an API in Postman?

I'm trying to parameterize a url in Postman and loop over that list to call multiple APIs one after another and append the JSON bodies.
For example:
if the url is GET https://location/store/{{user_id}}/date
and
user_id = ['1','3','5','2','6','8']
then how do I store user_idas a variable such that request can loop over each user_idin the url and generate an appended JSON body?
You can use data files. Store the user id values into a csv or JSON file and call that file in to your collection runner. The headers of the csv files can be used as variables names. please see the details of this approach in the below link
https://learning.postman.com/docs/postman/collection-runs/working-with-data-files/

how to get JSON data from an API in robot framework

I am trying to get JSON data from an API in robot framework which has data with id's. I have to get the count of id's present in the data obtained from the API.
I have tried the below code:
${result} = get ${API_JSON_PATH}
Should Be Equal ${result.status_code} ${200}
${json_data} = Set Variable ${result.content}
Log ${json_data}
I am getting the below mentioned error:
No keyword with name '${result} = get' found.
Is the approach correct or is there any other better ways for getting the JSONS data?
I'm using the RequestsLibrary and it's slide different from what you are doing.
the credential are not needed in your case but this is the example:
#{credential}= Create List Your_Username Your_Password
Create Session YOUR_API_ALIAS YOUR URL auth=#{credential}
${api}= Get Request YOUR_API_ALIAS YOUR_URI
if you want get the content of the JSON:
${api.json()}
Documentation: https://bulkan.github.io/robotframework-requests/
You need to have two or more spaces after the =. Robot looks for two or more spaces to find keywords and arguments, so it thinks your first statement begins with the keyword ${result} = get. Since that's not a valid keyword, you get that error.

Setting a custom long list of starting URLS in Scrapy

The crawling starts from the list included in start_urls = []
I need a long list of these starting urls and 2 methods of solving this problem:
Method 1: Using pandas to define the starting_urls array
#Array of Keywords
keywords = pandas.Keyword
urls = {}
count = 0
while(count < 100):
urls[count]='google.com?q=' + keywords[count]
count = count + 1
#Now I have the starting urls in urls array.
However, it doesn't seem to define starting_urls = urls because when I run:
scrapy crawl SPIDER
I get the error:
Error: Request url must be str or unicode, got int:
Method 2:
Each starting URL contains paginated content and in the def parse method I have the following code to crawl all linked pages.
next_page = response.xpath('//li[#class="next"]/a/#href').extract_first()
yield response.follow(next_page, callback=self.parse)
I want to add additional pages to crawl from the urls array defined above.
count=0
while(count < 100):
yield response.follow(urls[count], callback=self.parse)
count=count + 1
But it seems that none of these 2 methods work. Maybe I can't add this code the spider.py file?
To make first note, though obviously I can't say I've ran your entire script like that it's incomplete but first thing I noticed is that your face URL does need to have or be the proper format... "http://ect.ect" for scrapy tp make a proper request
Also, not to question your skills but if you weren't aware that by using strip, split and join functions you can turn from list, strings, dictionaries add integers back and forth from each other to achieve the needed desired effect...
WHATS HAPPENING TO YOU:
While be using range instead of count... but mimic your issue
lis = range(11)
site = "site.com/page="
for i in lis:
print(site + i)
----------
TypeError: Can't convert 'int' object to str implicity
#TURNING MY INT INTO STR:
lis = range(11)
site = "site.com/page="
for i in lis:
print(site + str(i))
--------------------
site.com/page=0
site.com/page=1
site.com/page=2
site.com/page=3
site.com/page=4
site.com/page=5
site.com/page=6
site.com/page=7
site.com/page=8
site.com/page=9
site.com/page=10
As to the error, when you you have the count to "+ 1", and then configure the entire URL then to add that 1 ... You are then trying to makes a string variable with an integer... I'd imagine simply turning the integer into a string before then constructing your url, then back to and interger before you add one to the count so it could be changed appropriately to then...
My go-to way to keep my coat as clean as possible is much cleaner. By adding an extra file at the root or current working folder of which you start to crawl, with all the urls you wish to scrape, you can use then pythons read and write functions and open the file with you or else decide your spider script.. like this
class xSpider(BaseSpider):
name = "w.e"
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
What really bothers me is that your error, is saying that you're compiling drink with an integer which I will ask you again if you need further for a complete snippet of your spider and in the spirit of coders kinship, also your settings.py because I'll tell you right now that who end up finding out, despite of any adjustments to the settings.Py file you won't be able to scrape Google search pages... Rather, not entire number of result page... Which I will then recommend to Scrappy conjunction with beautiful suit
The immediate problem I see is that you are making a DICT when it expects a list. :). Change it to a list.
There are also all kinds of interactions depending on which underlying spider you inherited from (if you did at all). Try switching to list then hit the question up again with more data if you still are having problems

How to pass some information between parse_item calls?

Ok, imagine a website with some list. The items of this list have one piece of information needed. The second piece is located at some other url, which is unique from item to item.
Currently our crawler opens a list page, scrapes each item, and for each item it opens that 2nd URL and gets the 2nd piece of the info from there. We use requests lib which is excellent in almost all cases but now it seems to be slow and ineffective. It looks that the whole Twisted is being blocked until one 'requests' request ends.
pseudo-code:
def parse_item():
for item in item_list:
content2 = requests.get(item['url'])
We can't just let Scrapy parse these 2nd urls because we need to 'connect' the first and the second url somehow. Something like Redis would work, but hey, is there any better (simpler, faster) way to do that in Scrapy? I can't believe the things must be so complicated.
You can do this my passing variable in meta
For example:
req = Request(url=http://somedonain.com/path, callback=myfunc)
req.meta['var1'] = 'some value'
yeld(req)
And in ur myfunc, you read passed variable as:
myval = response.request.meta['var1']