function with loop for urls won't return all of them - beautifulsoup

I am working on a project to pull details from state reports on restaurant inspections. Each inspection has its own url. I am able to gather the values into a dictionary, but only return one at a time. In the call to the function, if I don't specify a specific library entry, I get an error: ''list' object has no attribute 'timeout'' If it ask for a specific entry, I get a good return. How can I get them all?
# loop through the url list to gather inspection details
detailsLib = {}
def get_inspect_detail(urlList):
html = urlopen(urlList)
soup = bs4.BeautifulSoup(html.read(), 'lxml')
details = soup.find_all('font', {'face': 'verdana'})[10:]
result = []
for detail in details:
siteName = details[0].text
licNum = details[2].text
siteRank = details[4].text
detailsLib = {
'Restaurant': siteName,
'License': licNum,
'Rank': siteRank,
}
result.append(detailsLib)
return result
get_inspect_detail(urlList[21])
So I can get the 21st restaurant on the list, repeating 36 times, but not all of them.
Another question for another day is where to do the clean-up. The details will need some regex work but I'm unsure whether to do that inside the function (one at a time), or outside the function by calling all values from a specific key in the library.

Call get_inspect_detail() once per item in urlList, and save all the results.
all_results = []
for url in urlList:
details = get_inspect_detail(url)
all_results.extend(details)

Related

Shopify API Paginate Through Orders to Get Accurate Count

In Shopify ORDERS API, I use
/admin/api/2021-01/orders/count.json
In order to get the orders count, so I wanted to get all the orders. And by followign the REST API Documentation, I used two endpoints to do this.
/admin/api/2021-01/orders.json?status=any
/admin/api/2021-01/orders.json?limit=250&status=any; rel=next
First I would request the orders using the first endpoint where I get up to 50 orders/items in a list.
Then by using the counter as a limit, lets say I have 550 orders that I got from the response of orders/count.json
I do:
accumulated = []
iter = 0
while True:
if len(accumulated) > count:
break
if iter != 1:
url = #user first url
else:
url = $use second url that has next
items = #make a request here for that url and save each order item
accumulated+=items #this saves each list to the accumulated list so we know that we hit the count
But for some reason im only getting a fraction of the count. Lets say out of 550 on count, I only get 350 that are not duplicates of each other. Im thinking that maybe the second url, only requests the second page and doesnt proceed to the third page. Hence I was doing
first iteration = first page
second iteration = second page
third iteration = second page
all those gets into the accumulated list and stops the loop because of the condition that when accumulated exceeds count the loop will stop.
How can I make it so when I request ORDERS Endpoint in shopify. I go from the next pages properly?
I tried following shopify's tutorial in making paginated requests, but its unclear for me. on how to use it. THeres this page_info variable thats hard to me to understand where to find it and how to use it.
Hy! In Shopify REST api you can get max 250 orders per api call and if there are more orders you get LINK from header which contain URL for your next page request like
Here you can see I have LINK variable in my response headers you just need to get this LINK and check for rel='next' flag
But keep in mind when you hit the new URL and still you have more orders to fetch then the Header sent LINK with two URL 1 for Previous and 1 for the next.
run this snippet to get LINK from headers
var flag = false;
var next_url = order.headers.link;
if(next_url){
flag = true
next_url = next_url.replace("<","");
next_url = next_url.replace(">", "");
var next_url_array = next_url.split('; ');
var link_counter_start = next_url_array[0].indexOf("page_info=")+10;
var link_counter_length = (next_url_array[0].length);
var next_cursor="";
var link_counter;
for(link_counter=link_counter_start; link_counter<link_counter_length; link_counter++){
next_cursor+=(next_url_array[0][link_counter])
}
}
for the very first api But if you have more then two pages use the following code to seprate next link from previos and next flag
next_url = order.headers.link;
var next_url_array,
link_counter_start, link_counter_length,
link_counter;
if(next_url.includes(',')){
next_url = next_url.split(',');
next_url = next_url[1];
}
next_url = next_url.replace("<","");
next_url = next_url.replace(">", "");
next_url_array = next_url.split('; ');
link_counter_start = next_url_array[0].indexOf("page_info=")+10;
link_counter_length = (next_url_array[0].length);
next_cursor="";
for(link_counter=link_counter_start; link_counter<link_counter_length; link_counter++){
next_cursor+=(next_url_array[0][link_counter])
}
if(next_url_array[1] != 'rel="next"'){
flag = false;
}

Loop on scrapy FormRequest but only one item created

So I've tried to loop on a formrequest that call my function that create, fill and yield the item, only pb : only one and only one item is done no matter how many times he looped and I can't figure out why ?
def access_data(self, res):
#receive all ID and request the infos
res_json = (res.body).decode("utf-8")
res_json = json.loads(res_json)
for a in res_json['data']:
logging.warning(a['id'])
req = FormRequest(
url='https://my_url',
cookies={my_cookies},
method='POST',
callback=self.fill_details,
formdata={'valeur': str(a['id'])},
headers={'X-Requested-With': 'XMLHttpRequest'}
)
yield req
def fill_details(self, res):
logging.warning("annonce")
item = MyItem()
item['html'] = res.xpath('//body//text()')
item['existe'] = True
item['ip_proxy'] = None
item['launch_time'] = str(mySpider.init_time)
yield item
To be sure everything is clear :
When I run this, the log "annonce" is printed only one time while my logging a['id'] in my request loop is printed a lot and i can't find a way to fix this
I found the way !
If any one has the same pb : as my url is always the same (only formdata change) the scrapy filter is taking the control and destroy the duplicates.
Activate dont_filter to true in the formrequest to make it works

Scrapy only show the first result of each page

I need to scrape the items of the first page and then go to the next button to go to the second page and scrape and so on.
This is my code, but only scrape the first item of each page, if there are 20 pages enter to every page and scrape only the first item.
Could anyone please help me .
Thank you
Apologies for my english.
class CcceSpider(CrawlSpider):
name = 'ccce'
item_count = 0
allowed_domain = ['www.example.com']
start_urls = ['https://www.example.com./afiliados value=&categoria=444&letter=']
rules = {
# Reglas Para cada item
Rule(LinkExtractor(allow = (), restrict_xpaths = ('//li[#class="pager-next"]/a')), callback = 'parse_item', follow = True),
}
def parse_item(self, response):
ml_item = CcceItem()
#info de producto
ml_item['nombre'] = response.xpath('normalize-space(//div[#class="news-col2"]/h2/text())').extract()
ml_item['url'] = response.xpath('normalize-space(//div[#class="website"]/a/text())').extract()
ml_item['correo'] = response.xpath('normalize-space(//div[#class="email"]/a/text())').extract()
ml_item['descripcion'] = response.xpath('normalize-space(//div[#class="news-col4"]/text())').extract()
self.item_count += 1
if self.item_count > 5:
#insert_table(ml_item)
raise CloseSpider('item_exceeded')
yield ml_item
As you haven't given an working target url, I'm a bit guessing here, but most probably this is the problem:
parse_item should be a parse_page (and act accordingly)
Scrapy is downloading a full page which has - according to your description - multiple items and then passes this as a response object to your parse method.
It's your parse method's responsibility to process the whole page by iterating over the items displayed on the page and creating multiple scraped items accordingly.
The scrapy documentation has several good examples for this, one is here: https://doc.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths
Basically your code structure in def parse_XYZ should look like this:
def parse_page(self, response):
items_on_page = response.xpath('//...')
for sel_item in items_on_page:
ml_item = CcceItem()
#info de producto
ml_item['nombre'] = # ...
# ...
yield ml_item
Insert the right xpaths for getting all items on the page and adjust your item xpaths and you're ready to go.

Web page is showing weird unicode(?) letters: \u200e

How can I remove that? I Tried so many things and I am exhausted of trying to defeat this error by myself. I spent the last 3 hours looking at this and trying to get through it and I surrender to this code. Please help.
The first "for" statement grabs article titles from news.google.com
The second "for" statement grabs the time of submisssion from that article on news.google.com.
This is on django btw and this page shows the list of article titles and their time of submission in a list, going down. The weird unicode letters are popping up from the second "for" statement which is the time submissions. Here is my views.py:
def articles(request):
""" Grabs the most recent articles from the main news page """
import bs4, requests
list = []
list2 = []
url = 'https://news.google.com/'
r = requests.get(url)
try:
r.raise_for_status() == True
except ValueError:
print('Something went wrong.')
soup = bs4.BeautifulSoup(r.text, 'html.parser')
for (listarticles) in soup.find_all('h2', 'esc-lead-article-title'):
if listarticles is not None:
a = listarticles.text
list.append(a)
for articles_times in soup.find_all('span','al-attribution-timestamp'):
if articles_times is not None:
b = articles_times.text
list2.append(b)
list = zip(list,list2)
context = {'list':list}
return render(request, 'newz/articles.html', context)

Paypal ipn/cart variables

Wondered if anyone could give me any pointers?
The website I'm currently building sells 'event' tickets....holidays.
What I'm trying to do is decrease the number of tickets in a database field by the number purchased BUT can't find a paypal cart variable which will pull the information back. The custom variable it seems can only be used once and the item_name and item_number variables are already being used, the (item_number) to identify the 'event_id' field in the database and the (item_name) to obviously identify the event name.
I can pass the correct number of tickets to be updated over to Paypal by decreasing the amount and echoing that out in a hidden form field prior to sumbitting to Paypal BUT I can't get the results back. I'm looking for a Paypal form 'name' field that can be adapted to my needs, if one exists
Below is the database loop: The 'custom' variable doesn't work as it can't be custom1, custom2 etc.
mysql_connect('xxxxxxxx', 'xxxxxx', 'xxxxxxxx') or exit(0);
mysql_select_db('xxxxxx') or exit(0);
$num_cart_items = $_POST['num_cart_items'];
$i=1;
while (isset($_POST['item_number'.$i]))//read the item details
{
$item_ID[$i]=$_POST['item_number'.$i];
$custom[$i]=$_POST['custom'.$i];
$i++;
}
$item_count = $i-1;
for ($j=1;$j<=$item_count;$j++)
{
$struery = "UPDATE events SET event_tickets = '".$custom[$j]."' WHERE event_id = '".$item_ID[$j]."'";
$result = mysql_query($struery) or die("Cart - paypal_cart_info, Query failed:<br>" . mysql_error() . "<br>" . mysql_errno());
$i++;
}// end database loop
Thanks for any info.
Os
You can put custom variable as JSON encoded string. Then decode it.
$i=1;
$custom = json_decode($_POST['custom], TRUE);
while (isset($_POST['item_number'.$i]))//read the item details
{
$item_ID[$i] = $_POST['item_number'.$i];
$custom[$i] = $custom[$i];
$i++;
}
Or you can even base64_encode it on send to paypal and верут decode it back it you want variable not to be readable when you redirect user to paypal.