Retrieving data from multiple-page API using Talend - api

I have an API with 59 pages, on each page 1000 rows of data. I would like to retrieve all that data and store it in a Microsoft SQL Server.
When I use tloop with a condition run until i<59, it returns the first 1000 rows of data 59 times which is clearly not what I need.
I have tried to create a global variable next_page but I do not know how to connect it to the next_page present in the API, so that when "next_page"="" the program will know to break the loop.

I had a similar case (difference is that I didn't have a "nextPage" element but a "nextLink" which was giving me the complete URL to get to the nextPage).
I created a globalvariable "endJob" with value "false" at the beginning (tJava right before tLoop)
My tLoop is from int i=1, iteration is i++ , condition is !endJob (thus it will loop as long as the job is not marked as ended).
In a tJava right after tLoop, create the URL for your API request, using your page number, which is the tLoop_1_CURRENT_ITERATION
Then after my tRestClient, I put a tReplicate : first flow is for your needed transformations, the other one retrieve only the "nextPage" item. If nextPage is empty, then you update "endJob" variable to "true" : you want to stop the loop.

Related

Retry until with dynamic path or params [duplicate]

I have a use-case where I need to first call an API to get a list of ID's. From this response a random ID is chosen. Next I call a 2nd API which uses the random ID as a component in the path.
It's possible the 2nd API call can return an empty response, therefore I want to utilize retry until but use a different random ID in the path per retry iteration.
I've tried a couple of things:
First "in-lining" the JS function in the path to get a random ID:
Given path firstPart, myGetRandomId(idList), lastPart
And retry until response.length > 1
Second, tried putting the JS function in a Examples: as part of a Scenario Outline:
Given path firstPart, <ID>, lastPart
And retry until response.length > 1
Examples:
| ID |
| myGetRandomId(idList) |
The general issue I can't figure out is how to get the JS function to evaluate in either of this "in-line" kind of approaches.
And ideas/suggestions appreciated.
The way that the Karate retry until works is that it will re-play the request as-is and you can't modify it.
So you have to take a different approach. Use a JS loop. Look at this example in the demos:
https://github.com/intuit/karate/blob/master/karate-demo/src/test/java/demo/polling/polling.feature

Extract portion of HTML from website?

I'm trying to use VBA in Excel, to navigate a site with Internet explorer, to download an Excel file for each day.
After looking through the HTML code of the site, it looks like each day's page has a similar structure, but there's a portion of the website link that seems completely random. But this completely random part stays constant and does not change each time you want to load the page.
The following portion of the HTML code contains the unique string:
<a href="#" onClick="showZoomIn('222698519','b1a9134c02c5db3c79e649b7adf8982d', event);return false;
The part starting with "b1a" is what is used in the website link. Is there any way to extract this part of the page and assign it as a variable that I then can use to build my website link?
Since you don't show your code, I will talk too in general terms:
1) You get all the elements of type link (<a>) with a Set allLinks = ie.document.getElementsByTagName("a"). It will be a vector of length n containing all the links you scraped from the document.
2) You detect the precise link containing the information you want. Let's imagine it's the 4th one (you can parse the properties to check which one it is, in case it's dynamic):
Set myLink = allLinks(3) '<- 4th : index = 3 (starts from zero)
3) You get your token with a simple split function:
myToken = Split(myLink.onClick, "'")(3)
Of course you can be more synthetic if the position of your link containing the token is always the same, like always the 4th link:
myToken = Split(ie.document.getElementsByTagName("a")(3).onClick,"'")(3)

Token to the next page in pagination API

I have a list of records on the server sorted by a key and use pagination API to return list of segments one by one. Since items can be inserted in the middle of the list, I return the first key of the next page as a pagination token that has to be passed to get the next page.
However, I've found that DynamoDB uses the last key of the current page instead for querying API, which is null if the next page does not exist.
Question:
What are pros and cons between using the last item of the current page and the first item of the next page as a pagination token?
N.B:
As for me returning the first item is more intuitive since it's null only if the next page does not exist.
Using the "last item of the current page" (LICP) is better than using the "first item of the next page" (FINP) because it deals better with the possibility that, in the meantime, some item is inserted between these two items.
For example suppose the first page contains 3 alphabetically ordered names: Adam/Basil/Claude. And suppose the next page is Elon/Francis/Gilbert.
Then with LICP the token is Claude, while with FINP the token is Elon. If no new names are inserted, the result is the same when we get the next page.
However, suppose we insert the name Daniel after getting the first page but before getting the second page. In this case, when we get the second page with LICP we get Daniel/Elon/Francis, while with FINP we get Elon/Francis/Gilbert. That is to say, FINP will miss Daniel, while LICP will not.
Also, FINP may consume more computing resources than LICP, since you must retrieve one extra item (4 items, in the above example, instead of only 3).

Linkedin REST API - How to return more job bookmarks / records each call

I'm trying to get all my job bookmarks (30+) via Linkedin Rest API but it seems that every call only returns the same exact & only 10 records max.
GET https://api.linkedin.com/v1/people/~/job-bookmarks
then I found the end https://developer.linkedin.com/docs/rest-api
It seems that I can pass the parameter - count: The maximum number of items you want included in the result set. So I thought maybe I can just add that at the end of the GET url...
New query GET https://api.linkedin.com/v1/people/~/job-bookmarks&count=30
then I got an error - 400 Bad Request
Does someone know how to solve this problem? Many thanks!
You need to start the query string with a '?' instead of '&'.
https://api.linkedin.com/v1/people/~/job-bookmarks?count=30
You use '&' to separate query parameters. For example if you wanted to page through all the job book marks you could use the 'start' parameter to do offset paging. So the path to get the next page if you have more than 30 bookmarks would look like this.
https://api.linkedin.com/v1/people/~/job-bookmarks?count=30&start=31

How to pass some information between parse_item calls?

Ok, imagine a website with some list. The items of this list have one piece of information needed. The second piece is located at some other url, which is unique from item to item.
Currently our crawler opens a list page, scrapes each item, and for each item it opens that 2nd URL and gets the 2nd piece of the info from there. We use requests lib which is excellent in almost all cases but now it seems to be slow and ineffective. It looks that the whole Twisted is being blocked until one 'requests' request ends.
pseudo-code:
def parse_item():
for item in item_list:
content2 = requests.get(item['url'])
We can't just let Scrapy parse these 2nd urls because we need to 'connect' the first and the second url somehow. Something like Redis would work, but hey, is there any better (simpler, faster) way to do that in Scrapy? I can't believe the things must be so complicated.
You can do this my passing variable in meta
For example:
req = Request(url=http://somedonain.com/path, callback=myfunc)
req.meta['var1'] = 'some value'
yeld(req)
And in ur myfunc, you read passed variable as:
myval = response.request.meta['var1']