How to grab Xpath query in Googlesheet IMPORTXML function? - google-sheets-api

Trying to grab from a link (https://www.valueresearchonline.com/stocks/1764/infosys-ltd?utm_source=direct-click&utm_medium=stocks&utm_term=&utm_content=Infosys&utm_campaign=vro-search#snapshot)- this is the relevant HTML:
I've made the following query to try and work with the subsequent HTML:
Essential Checks
Altman Z-Score
=IMPORTXML($A$2,"//*[#id='z-score']/div/div[2]/div/div")
A2 having the relevant URL.
I think the Xpath is correct there, but not sure why it won't give me the result.

According to the IMPORTXML documentation:
IMPORTXML imports data from any of various structured data types including XML, HTML, CSV, TSV, and RSS and ATOM XML feeds.
Therefore, the =IMPORTXML() command you are using reads the HTML source of the page without any JavaScript code associated with it and without executing it.
So since the website you are trying to import the data from is a dynamic website, the results you are getting are not the expected ones. In this case, unfortunately, the use of IMPORTXML() is not possible.

Related

How to access all text from a website, including the a tag?

I'm trying to extract all the article text from the following site:
https://www.phonearena.com/reviews/Samsung-Galaxy-S9-Plus-Review_id4494
I tried findAll(text=True) but it extracts lot of useless information.
So I did findAll(text=True, recursive=False) but it ignores text data in certain tags like ? What's the most effective way of extracting the text in this case?
The website seems to be javascript protected. It loads the body content when requests already retrieved the http response. You need to simulate a real page request. With the python module Selenium Webdriver it would be possible.

regular expression does nothing in import.io

I'm trying to figure out how to use regular expressions on import.io. I have an HTML column that successfully pulls data from a link on the web page. I want to extract just part of the querystring on the link, so I go to the regexp field and enter a regular expression that tests successfully on regex101.com. The problem is, the extracted data does not change at all. In fact, I can type complete gibberish in the regexp field and it has absolutely no effect on the extracted data. I'm a bit mystified.
If my regular expression is wrong, shouldn't the extracted data change to nothing? Is there some trick to using the regexp field? Do I have to enter something in the xpath field? I clicked on View JSON button and copied the xpath for this column there and pasted that into the manual xpath box, but that didn't change anything either.
Is there a tutorial somewhere for how to use the regexp field? And I'm not asking about how to use regular expressions, just the interface for it on import.io.
Grant,
You are correct. At the moment it is not possible to apply regexp to HTML columns. There is a post in the idea forum capturing this as a feature request, you may want to upvote it, this way you'd also be notified if the idea gets built:
http://support.import.io/forums/199278-ideas-forum/suggestions/6328279-apply-regular-expressions-to-html

Get output of a template call in a page from MediaWiki API

I am trying to parse a page on a wikia to get additional information for a Infobox Book template that is on the page. The problem is the I can only get the template's source instead of the transformed template on the page.
I'm using the following url as a base:
http://starwars.wikia.com/api.php?format=xml&action=expandtemplates&text={{Infobox%20Book}}&generatexml=1
The documentation doesn't really tell me how to point it to a specific page and parse the transformed template from the page. Is this even possible or do I need to parse it all myself?
To expand a template with the parameters from a given page, you will have to provide those parameters. There is no way for the API to know how the template is used in different pages (it could even be used twice!).
This works:
action=expandtemplates&text={{Infobox Book|book name=Lost Tribe of the Sith: Skyborn}}
You will, of course have to keep adding all the parameters you want to parse (there are 14 in your example).
If you have templates that change automatically depending on which page they are (that is not the case here), e.g. by making use of magic words such as {{PAGENAME}}, you can add &page=Lost_Tribe_of_the_Sith:_Skyborn to your API call, to set the context the template should be expanded in.
If you to not know the parameters given, you can either:
Render the whole page with index.php?action=render&title=Lost_Tribe_of_the_Sith:_Skyborn, and parse the returned html to carve out the actual infobox
Fetch (action=query&prop=revisions) and parse the wikicode to get the parameters to the template, and supply them to the expandtemplates call
Start using an extension like Semantic MediaWiki, that allows you to treat your wiki more like a database
1 and 2 can go wrong in any number of ways, of course, as with a wiki you have, by definition, no way of knowing that the content is always entered in a consistent way.

How does api archive.org works?

As you surely know web.archive.org lets you inspect the history of a domain, ie:http://web.archive.org/web/*/besttatoo.com
I also has an API: http://archive.org/help/json.php
I need to get data from the API but I can't get many info on how to use it, has anyone used it and can paste some examples of use?
This link provides details about the item LovingU on archive.org:
http://archive.org/details/LovingU&output=json
To create an API query to your liking, use this page:
https://archive.org/advancedsearch.php#raw
That page allows you to choose your output format: JSON, XML, HTML, CSV or RSS and also the parameters your want to see. You can limit the number of results, too.

Programmatic access to On-Line Encyclopedia of Integer Sequences

Is there a way to search and retrieve the results from On-Line Encyclopedia of Integer Sequences (http://oeis.org) programmatically?
I have searched their site and the results are always returned in html. They do not seem to provide an API but in the policy statement they say its acceptable to access the database programmatically. But how to do it without screen scraping?
Thanks a lot for your help.
The OEIS now provides several points of access, not just ones using their internal format. These seem largely undocumented, so here are all of the endpoints that I have found:
https://oeis.org/search?fmt=json&q=<sequenceTerm>&start=<itemToStartAt>
Returns a JSON formatted response of the results found from the sequenceTerm given. If too many results were returned, count will be > 0 whilst results will be null. If no results were returned, count will be 0. itemToStartAt is used for pagination of results, as only a maximum of 10 are ever returned. This starts at 0. If you wanted to return a second page of results, this would equal 10. Information about what each of the entries means can be found here.
https://oeis.org/search?fmt=text&q=<sequenceTerm>&start=<itemToStartAt>
Exactly the same arguments as before, however this returns it in the OEIS internal format. Which is largely written about here. Unless your project requires it, I'd highly recommend using the JSON format over this.
https://oeis.org/search?fmt=<json|text>&q=id:A<sequenceNumber>
Will return a single result if the sequenceNumber is found. This is the suggested method for obtaining single sequences, as it appears to be far more optimised than some of the alternative methods that can be used as queries. Requests often take under a second. Alternative search query methods can be found on this page.
https://oeis.org/A<sequenceNumber>/graph?png=1
This endpoint can be used to grab the images used to graph the data points. Alternatively, setting png to equal to zero returns the HTML page containing a graph of it.
https://oeis.org/recent.txt
This returns a list of recently updated entries in the OEIS internal format. There are no parameters available, or JSON format, as this seems like a static text file that is simply being served to the client. Due to the length of replies from the OEIS database (for some sequences replies can take above five seconds), I'd highly recommend heavily caching requests and using the above endpoint to update them when they change.
A URL of the form http://oeis.org/search?fmt=text&q=2,5,14,50,233 gives a nicely formatted text output.
But it seems there is no way to get a single sequence in text form.
If you happen to use Mathematica, it sounds like the following notebook might help. It allows you to specify a sequence and automatically import a detailed list of matching entries from the OEIS:
http://www.brotherstechnology.com/math/oeis_mathematica.html
It looks like direct use of their CGI program is the only API they provide.
URL for Searching the Database
https://oeis.org/search?q=id:A000032&fmt=text
gives the plain text form of an entry in their internal format
https://oeis.org/eishelp1.html