regular expression does nothing in import.io - import.io

I'm trying to figure out how to use regular expressions on import.io. I have an HTML column that successfully pulls data from a link on the web page. I want to extract just part of the querystring on the link, so I go to the regexp field and enter a regular expression that tests successfully on regex101.com. The problem is, the extracted data does not change at all. In fact, I can type complete gibberish in the regexp field and it has absolutely no effect on the extracted data. I'm a bit mystified.
If my regular expression is wrong, shouldn't the extracted data change to nothing? Is there some trick to using the regexp field? Do I have to enter something in the xpath field? I clicked on View JSON button and copied the xpath for this column there and pasted that into the manual xpath box, but that didn't change anything either.
Is there a tutorial somewhere for how to use the regexp field? And I'm not asking about how to use regular expressions, just the interface for it on import.io.

Grant,
You are correct. At the moment it is not possible to apply regexp to HTML columns. There is a post in the idea forum capturing this as a feature request, you may want to upvote it, this way you'd also be notified if the idea gets built:
http://support.import.io/forums/199278-ideas-forum/suggestions/6328279-apply-regular-expressions-to-html

Related

How to grab Xpath query in Googlesheet IMPORTXML function?

Trying to grab from a link (https://www.valueresearchonline.com/stocks/1764/infosys-ltd?utm_source=direct-click&utm_medium=stocks&utm_term=&utm_content=Infosys&utm_campaign=vro-search#snapshot)- this is the relevant HTML:
I've made the following query to try and work with the subsequent HTML:
Essential Checks
Altman Z-Score
=IMPORTXML($A$2,"//*[#id='z-score']/div/div[2]/div/div")
A2 having the relevant URL.
I think the Xpath is correct there, but not sure why it won't give me the result.
According to the IMPORTXML documentation:
IMPORTXML imports data from any of various structured data types including XML, HTML, CSV, TSV, and RSS and ATOM XML feeds.
Therefore, the =IMPORTXML() command you are using reads the HTML source of the page without any JavaScript code associated with it and without executing it.
So since the website you are trying to import the data from is a dynamic website, the results you are getting are not the expected ones. In this case, unfortunately, the use of IMPORTXML() is not possible.

Use category name in Sitefinity blog URL

I followed the instructions here on establishing a new provider and generating custom URLs, and it works as expected. There doesn't seem to be a clear reference for what parameters can be utilized in the settings as the example given is very basic.
I want to use the category name of the post in the URL. I tried:
/[Category]/[UrlName]
but what I got in the frontend was:
http://localhost:60327/my-page/Telerik.OpenAccess.TrackedList%601[System.Guid]/my-post-name
I also tried
/[Category.Title]/[UrlName]
which just threw errors.
Anyone know how to do this, or better yet, a good reference for the parameters?
I don't think this is possible since the Category property is actually a collection (TrackedList).
In theory you would need one of the collection items, let's say the first one, and your URL expression would be /[Category[0].Title]/[UrlName], but this is currently not supported by the expression parser.
Also, the idea of making the URL dependent on a complex (related) field is not a good idea. If someone deletes that category, they will break all your blog post URLs.
I would suggest you to create a custom text field for the blog post item (ex: CategoryUrl) and then you should be able to set the URL format to /[CategoryUrl]/[UrlName]. Make sure CategoryUrl field is required.

Google Custom Search automatic spell checking

We're having a problem with the automatic spell checking on queries in the XML results of the Google Custom Search.
Queries which are spelled incorrectly return results with the correct spelling e.g. socer becomes soccer and returns the correct results. On Google.com there is the option to then search for results on the original query using nfpr=1 in the query string. However this doesn't work in the Google Custom search, and I've been unable to find any other way to search for the incorrect spelling.
For a standard google search this behavior can be avoided by adding the argument &nfpr=1 to the query url.
For a custom search based on the AJAX API, this unfortunately isn't possible. The only way I've found is to use javascript to parse the user's query, then use a regular expression to put quotes around each single word that is not yet quoted. So for example, if the keywords received are
"bmw z4" manual
you would change that to
"bmw z4" "manual"
which has the same effect, except that it disables the auto-correction. Unfortunately if you want to deal with all the special cases of advanced logical syntax (AND, OR, |, -, etc.), your regexp gets a bit complex.
Myself, I just parse the response from Google to see if this is happening, and if so notify the user how to prevent it (by putting quotes around the offending word(s)).

Programmatic access to On-Line Encyclopedia of Integer Sequences

Is there a way to search and retrieve the results from On-Line Encyclopedia of Integer Sequences (http://oeis.org) programmatically?
I have searched their site and the results are always returned in html. They do not seem to provide an API but in the policy statement they say its acceptable to access the database programmatically. But how to do it without screen scraping?
Thanks a lot for your help.
The OEIS now provides several points of access, not just ones using their internal format. These seem largely undocumented, so here are all of the endpoints that I have found:
https://oeis.org/search?fmt=json&q=<sequenceTerm>&start=<itemToStartAt>
Returns a JSON formatted response of the results found from the sequenceTerm given. If too many results were returned, count will be > 0 whilst results will be null. If no results were returned, count will be 0. itemToStartAt is used for pagination of results, as only a maximum of 10 are ever returned. This starts at 0. If you wanted to return a second page of results, this would equal 10. Information about what each of the entries means can be found here.
https://oeis.org/search?fmt=text&q=<sequenceTerm>&start=<itemToStartAt>
Exactly the same arguments as before, however this returns it in the OEIS internal format. Which is largely written about here. Unless your project requires it, I'd highly recommend using the JSON format over this.
https://oeis.org/search?fmt=<json|text>&q=id:A<sequenceNumber>
Will return a single result if the sequenceNumber is found. This is the suggested method for obtaining single sequences, as it appears to be far more optimised than some of the alternative methods that can be used as queries. Requests often take under a second. Alternative search query methods can be found on this page.
https://oeis.org/A<sequenceNumber>/graph?png=1
This endpoint can be used to grab the images used to graph the data points. Alternatively, setting png to equal to zero returns the HTML page containing a graph of it.
https://oeis.org/recent.txt
This returns a list of recently updated entries in the OEIS internal format. There are no parameters available, or JSON format, as this seems like a static text file that is simply being served to the client. Due to the length of replies from the OEIS database (for some sequences replies can take above five seconds), I'd highly recommend heavily caching requests and using the above endpoint to update them when they change.
A URL of the form http://oeis.org/search?fmt=text&q=2,5,14,50,233 gives a nicely formatted text output.
But it seems there is no way to get a single sequence in text form.
If you happen to use Mathematica, it sounds like the following notebook might help. It allows you to specify a sequence and automatically import a detailed list of matching entries from the OEIS:
http://www.brotherstechnology.com/math/oeis_mathematica.html
It looks like direct use of their CGI program is the only API they provide.
URL for Searching the Database
https://oeis.org/search?q=id:A000032&fmt=text
gives the plain text form of an entry in their internal format
https://oeis.org/eishelp1.html

yql and firebug xpath copy/paste returning no result

I'm trying to do a little bit of screenscraping of a 3rd party vendors bug tracking system (jira) where I can scrape the count/category of all the unresolved bugs. I want to put this info on our intranet so management can see it without going to the 3rd party site (which they don't have login credentials for).
I'm having problems getting xpath results back, though. Here's what I'm doing. Using Firebug, I select the DOM element I'm interested in and right-click "copy as xpath". Then I paste that into the YQL console, so I have something that looks like:
select *
from html
where url='http://username:password#jira.3rdparty.com/path/to/page_i_want.aspx'
and xpath='//*[#id="primary"]'
My JSON results come back null. If I remove the xpath in my query, I get back results. If I select other elements on the page, my JSON results come back null. If I start tweaking the xpath, say remove the last div in the path, I can sometimes get results, it just depends on what I've selected and what I've tweaked in the xpath.
Anyone know why I'm not getting any results doing the Firebug copy as xpath? I can't really say I'm an xpath pro :)
Edit: Actually, looking at the results I'm getting back with no xpath, it looks like I'm not authenticating. My username has an # and domain in it, so I log in via a browser with something like:
username#domain
password
YQL doesn't seem to like the #domain and \ escaping the # doesn't seem to work. Anyone have any ideas?
This will work as long as following criteria are met:
The module will only fetch HTML pages under 1.5MB and the page must also be indexable (e.g. allowed by the site’s robots.txt file.) .
Since it is behind a login, it's probably not indexable. The robots.txt is public, such as:
http://internet.com/robots.txt
For future reference, use double quotes to escape the commercial-at symbol:
'http://"username#domain:password"#jira.3rdparty.com/path/to/page_i_want.aspx'
Here are some resources:
Pipes XPath fetch page
Commercial-at Unicode