Shopify API Paginate Through Orders to Get Accurate Count - shopify

In Shopify ORDERS API, I use
/admin/api/2021-01/orders/count.json
In order to get the orders count, so I wanted to get all the orders. And by followign the REST API Documentation, I used two endpoints to do this.
/admin/api/2021-01/orders.json?status=any
/admin/api/2021-01/orders.json?limit=250&status=any; rel=next
First I would request the orders using the first endpoint where I get up to 50 orders/items in a list.
Then by using the counter as a limit, lets say I have 550 orders that I got from the response of orders/count.json
I do:
accumulated = []
iter = 0
while True:
if len(accumulated) > count:
break
if iter != 1:
url = #user first url
else:
url = $use second url that has next
items = #make a request here for that url and save each order item
accumulated+=items #this saves each list to the accumulated list so we know that we hit the count
But for some reason im only getting a fraction of the count. Lets say out of 550 on count, I only get 350 that are not duplicates of each other. Im thinking that maybe the second url, only requests the second page and doesnt proceed to the third page. Hence I was doing
first iteration = first page
second iteration = second page
third iteration = second page
all those gets into the accumulated list and stops the loop because of the condition that when accumulated exceeds count the loop will stop.
How can I make it so when I request ORDERS Endpoint in shopify. I go from the next pages properly?
I tried following shopify's tutorial in making paginated requests, but its unclear for me. on how to use it. THeres this page_info variable thats hard to me to understand where to find it and how to use it.

Hy! In Shopify REST api you can get max 250 orders per api call and if there are more orders you get LINK from header which contain URL for your next page request like
Here you can see I have LINK variable in my response headers you just need to get this LINK and check for rel='next' flag
But keep in mind when you hit the new URL and still you have more orders to fetch then the Header sent LINK with two URL 1 for Previous and 1 for the next.
run this snippet to get LINK from headers
var flag = false;
var next_url = order.headers.link;
if(next_url){
flag = true
next_url = next_url.replace("<","");
next_url = next_url.replace(">", "");
var next_url_array = next_url.split('; ');
var link_counter_start = next_url_array[0].indexOf("page_info=")+10;
var link_counter_length = (next_url_array[0].length);
var next_cursor="";
var link_counter;
for(link_counter=link_counter_start; link_counter<link_counter_length; link_counter++){
next_cursor+=(next_url_array[0][link_counter])
}
}
for the very first api But if you have more then two pages use the following code to seprate next link from previos and next flag
next_url = order.headers.link;
var next_url_array,
link_counter_start, link_counter_length,
link_counter;
if(next_url.includes(',')){
next_url = next_url.split(',');
next_url = next_url[1];
}
next_url = next_url.replace("<","");
next_url = next_url.replace(">", "");
next_url_array = next_url.split('; ');
link_counter_start = next_url_array[0].indexOf("page_info=")+10;
link_counter_length = (next_url_array[0].length);
next_cursor="";
for(link_counter=link_counter_start; link_counter<link_counter_length; link_counter++){
next_cursor+=(next_url_array[0][link_counter])
}
if(next_url_array[1] != 'rel="next"'){
flag = false;
}

Related

How to scrape a data table that has multiple pages using selenium?

I'm extracting NBA stats from my yahoo fantasy account. Below is the code that I made in jupyter notebook using selenium. Each page shows 25 players and a total of 720 players. I did a for loop that will scrape players in increments of 25 instead of one by one.
for k in range (0,725,25):
Players = driver.find_elements_by_xpath('//tbody/tr/td[2]/div/div/div/div/a')
Team_Position = driver.find_elements_by_xpath('//span[#class= "Fz-xxs"]')
Games_Played = driver.find_elements_by_xpath('//tbody/tr/td[7]/div')
Minutes_Played = driver.find_elements_by_xpath('//tbody/tr/td[11]/div')
FGM_A = driver.find_elements_by_xpath('//tbody/tr/td[12]/div')
FTM_A = driver.find_elements_by_xpath('//tbody/tr/td[14]/div')
Three_Points = driver.find_elements_by_xpath('//tbody/tr/td[16]/div')
PTS = driver.find_elements_by_xpath('//tbody/tr/td[17]/div')
REB = driver.find_elements_by_xpath('//tbody/tr/td[18]/div')
AST = driver.find_elements_by_xpath('//tbody/tr/td[19]/div')
ST = driver.find_elements_by_xpath('//tbody/tr/td[20]/div')
BLK = driver.find_elements_by_xpath('//tbody/tr/td[21]/div')
TO = driver.find_elements_by_xpath('//tbody/tr/td[22]/div')
NBA_Stats = []
for i in range(len(Players)):
players_stats = {'Name': Players[i].text,
'Position': Team_Position[i].text,
'GP': Games_Played[i].text,
'MP': Minutes_Played[i].text,
'FGM/A': FGM_A[i].text,
'FTM/A': FTM_A[i].text,
'3PTS': Three_Points[i].text,
'PTS': PTS[i].text,
'REB': REB[i].text,
'AST': AST[i].text,
'ST': ST[i].text,
'BLK': BLK[i].text,
'TO': TO[i].text}
driver.get('https://basketball.fantasysports.yahoo.com/nba/28951/players?status=ALL&pos=P&cut_type=33&stat1=S_AS_2021&myteam=0&sort=AR&sdir=1&count=' + str(k))
The browser will go page by page after it's done. I print out the results. It only scrape 1 player. What did I do wrong?
A picture of my codes and printing the results
It's hard to see what the issue here is without looking at the original page (can you provide a URL?), however looking at this:
next = driver.find_element_by_xpath('//a[#id = "yui_3_18_1_1_1636840807382_2187"]')
"1636840807382" looks like a Javascript timestamp, so I would guess that the reference you've got hardcoded there is dynamically generated, so the element "yui_3_18_1_1_1636840807382_2187" no longer exists.

Loop on scrapy FormRequest but only one item created

So I've tried to loop on a formrequest that call my function that create, fill and yield the item, only pb : only one and only one item is done no matter how many times he looped and I can't figure out why ?
def access_data(self, res):
#receive all ID and request the infos
res_json = (res.body).decode("utf-8")
res_json = json.loads(res_json)
for a in res_json['data']:
logging.warning(a['id'])
req = FormRequest(
url='https://my_url',
cookies={my_cookies},
method='POST',
callback=self.fill_details,
formdata={'valeur': str(a['id'])},
headers={'X-Requested-With': 'XMLHttpRequest'}
)
yield req
def fill_details(self, res):
logging.warning("annonce")
item = MyItem()
item['html'] = res.xpath('//body//text()')
item['existe'] = True
item['ip_proxy'] = None
item['launch_time'] = str(mySpider.init_time)
yield item
To be sure everything is clear :
When I run this, the log "annonce" is printed only one time while my logging a['id'] in my request loop is printed a lot and i can't find a way to fix this
I found the way !
If any one has the same pb : as my url is always the same (only formdata change) the scrapy filter is taking the control and destroy the duplicates.
Activate dont_filter to true in the formrequest to make it works

function with loop for urls won't return all of them

I am working on a project to pull details from state reports on restaurant inspections. Each inspection has its own url. I am able to gather the values into a dictionary, but only return one at a time. In the call to the function, if I don't specify a specific library entry, I get an error: ''list' object has no attribute 'timeout'' If it ask for a specific entry, I get a good return. How can I get them all?
# loop through the url list to gather inspection details
detailsLib = {}
def get_inspect_detail(urlList):
html = urlopen(urlList)
soup = bs4.BeautifulSoup(html.read(), 'lxml')
details = soup.find_all('font', {'face': 'verdana'})[10:]
result = []
for detail in details:
siteName = details[0].text
licNum = details[2].text
siteRank = details[4].text
detailsLib = {
'Restaurant': siteName,
'License': licNum,
'Rank': siteRank,
}
result.append(detailsLib)
return result
get_inspect_detail(urlList[21])
So I can get the 21st restaurant on the list, repeating 36 times, but not all of them.
Another question for another day is where to do the clean-up. The details will need some regex work but I'm unsure whether to do that inside the function (one at a time), or outside the function by calling all values from a specific key in the library.
Call get_inspect_detail() once per item in urlList, and save all the results.
all_results = []
for url in urlList:
details = get_inspect_detail(url)
all_results.extend(details)

Scrapy only show the first result of each page

I need to scrape the items of the first page and then go to the next button to go to the second page and scrape and so on.
This is my code, but only scrape the first item of each page, if there are 20 pages enter to every page and scrape only the first item.
Could anyone please help me .
Thank you
Apologies for my english.
class CcceSpider(CrawlSpider):
name = 'ccce'
item_count = 0
allowed_domain = ['www.example.com']
start_urls = ['https://www.example.com./afiliados value=&categoria=444&letter=']
rules = {
# Reglas Para cada item
Rule(LinkExtractor(allow = (), restrict_xpaths = ('//li[#class="pager-next"]/a')), callback = 'parse_item', follow = True),
}
def parse_item(self, response):
ml_item = CcceItem()
#info de producto
ml_item['nombre'] = response.xpath('normalize-space(//div[#class="news-col2"]/h2/text())').extract()
ml_item['url'] = response.xpath('normalize-space(//div[#class="website"]/a/text())').extract()
ml_item['correo'] = response.xpath('normalize-space(//div[#class="email"]/a/text())').extract()
ml_item['descripcion'] = response.xpath('normalize-space(//div[#class="news-col4"]/text())').extract()
self.item_count += 1
if self.item_count > 5:
#insert_table(ml_item)
raise CloseSpider('item_exceeded')
yield ml_item
As you haven't given an working target url, I'm a bit guessing here, but most probably this is the problem:
parse_item should be a parse_page (and act accordingly)
Scrapy is downloading a full page which has - according to your description - multiple items and then passes this as a response object to your parse method.
It's your parse method's responsibility to process the whole page by iterating over the items displayed on the page and creating multiple scraped items accordingly.
The scrapy documentation has several good examples for this, one is here: https://doc.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths
Basically your code structure in def parse_XYZ should look like this:
def parse_page(self, response):
items_on_page = response.xpath('//...')
for sel_item in items_on_page:
ml_item = CcceItem()
#info de producto
ml_item['nombre'] = # ...
# ...
yield ml_item
Insert the right xpaths for getting all items on the page and adjust your item xpaths and you're ready to go.

How to display playback_count for a user using the SoundCloud API

I am trying to display the total track views ("playback_count") for a specific SoundCloud user using the SoundCloud API.
According to the API documentation I get the info using the below function call:
http://api.soundcloud.com/tracks/13158665.json?client_id=YOUR_CLIENT_ID
This is fine because it displays the number "13158665".
What is this number? Is it the trackid?
I need to get the "playback_count" for a user using the users username.
I tried getting the UserId from the Username using this:
$soundcloud_playsAPI = "MY_SOUNDCLOUD_API_KEY";
/* Get the SoundCloud UserId from the username */
$json = wp_remote_get("http://api.soundcloud.com/users/jwagener.json?client_id=".$soundcloud_playsAPI);
$soundcloudData = json_decode($json['body'], true);
$soundcloud_userid = $soundcloudData['id'];
This returns the UserId: 3207181
Now I tried to substitute that response into the previous URL to get the "playback_count" but it failed.
$json = wp_remote_get("http://api.soundcloud.com/tracks/3207181.json?client_id=".$soundcloud_playsAPI);
$soundcloudPlaysData = json_decode($json['body'], true);
echo $soundcloudPlaysData['playback_count'];
Any guidance would be greatly appreciated.
Thanks.
The first number is the id of a track, the second number is the id of a user.
Now that you have the user id, you will need to fetch each of their tracks and tally how many times they have been played
First, get the id numbers for all tracks made by the user
GET: /users/{id}/tracks: list of tracks of the user
$json = wp_remote_get("http://api.soundcloud.com/users/3207181/tracks.json?client_id=".$soundcloud_playsAPI);
Now you have a list of track IDs so you will need to get each of those tracks and save the playback_count of each
$json = wp_remote_get("http://api.soundcloud.com/tracks/track-id-here.json?client_id=".$soundcloud_playsAPI);
$soundcloudPlaysData = json_decode($json['body'], true);
echo $soundcloudPlaysData['playback_count'];
Here is the full solution:
function listPlays()
{SC.initialize({ client_id: 'YOUR ID HERE'});
// Get the SoundCloud UserId from the username
var userName="jwagener";
SC.get("/users/"+userName, function (users)
{console.log(users.id);
var myId=users.id;
getTracks(myId);
});
var getTracks=function (myId)
{var totalPlays=0;
SC.get("/users/"+myId+"/tracks", function(getTracks)
{for (var key in getTracks) //get each track and look at it's playback_count
{console.log(getTracks[key].title+" "+getTracks[key].playback_count);
totalPlays+=getTracks[key].playback_count; //add the playback count for this track to the total
}
console.log("Total Plays for all tracks: "+totalPlays);
});
};
};