Extract portion of HTML from website?

Extract portion of HTML from website? - vba

I'm trying to use VBA in Excel, to navigate a site with Internet explorer, to download an Excel file for each day.
After looking through the HTML code of the site, it looks like each day's page has a similar structure, but there's a portion of the website link that seems completely random. But this completely random part stays constant and does not change each time you want to load the page.
The following portion of the HTML code contains the unique string:
<a href="#" onClick="showZoomIn('222698519','b1a9134c02c5db3c79e649b7adf8982d', event);return false;
The part starting with "b1a" is what is used in the website link. Is there any way to extract this part of the page and assign it as a variable that I then can use to build my website link?

Since you don't show your code, I will talk too in general terms:
1) You get all the elements of type link (<a>) with a Set allLinks = ie.document.getElementsByTagName("a"). It will be a vector of length n containing all the links you scraped from the document.
2) You detect the precise link containing the information you want. Let's imagine it's the 4th one (you can parse the properties to check which one it is, in case it's dynamic):
Set myLink = allLinks(3) '<- 4th : index = 3 (starts from zero)
3) You get your token with a simple split function:
myToken = Split(myLink.onClick, "'")(3)
Of course you can be more synthetic if the position of your link containing the token is always the same, like always the 4th link:
myToken = Split(ie.document.getElementsByTagName("a")(3).onClick,"'")(3)

Related

SSRS - How to show external image based on URL inside column

I am trying to show images for products inside a basic report. The image needs to be dynamic, meaning the image should change based on the SKU value.
Right now I am inserting an image into a table, setting to external, and i've tried:
=Fields!URL.Value
=http://externalwebservername/sku= & Fields!SKU.Value
="http://externalwebservername/sku=" & Fields!SKU.Value
I do not get any images in my table.
My stored proc has all the data, including a URL with the image I wan't to show. Here is a sample of what the URL looks like:
http://externalwebservername/sku=123456
If I enter the URL in the field without "=" it will show that ONE image only.
How should I set up the expression to properly show the external image based on a dynamic URL? Running SQL 2016

Alan's answer should work, but in our environment we have strict proxy/firewall rules, so the two servers could not contact each other.
Instead we are navigating to the file stored on our storage system.
We altered the URL column to point to file path in the stored procedure. Insert image, set Source to External and Value set to [URL].
URL= file://server\imagepath.jpg

As long as the account executing the report has permissions to access the URLs then your 3rd expression should have worked.
I put together a simple example as follows.
I created a new blank report then added a Data Source. It doesn't matter where this points, we won't use it directly.
Then I created a dataset (Dataset1) with the following SQL to give me list of image names.
SELECT '350x120' AS suffix
UNION SELECT '200x100'
UNION SELECT '500x500'
Actually, these are just parameters for the website http://placehold.it/ which will generate images based on the size you request, but that's not relevant for this exercise.
We'll be showing three images from the following URLs
http://placehold.it/350x120
http://placehold.it/200x100
http://placehold.it/500x500
Next, create a table, I used 3 columns to give me more testing options. Set the DataSetName to DataSet1 if it isn't already.
In the first column the expression is just =Fields!suffix.Value
In the second column I added an image, set it's source property to External and the Value to ="http://placehold.it/" & Fields!suffix.Value
I then added a 3rd column with the same expression as the image Value so I could see what was being used as the image URL. I also added an action that goes to the same URL, just to check the URL did not have any unprintable characters in it that might cause a problem.
The basic report design looks like this.
The rendered result looks like this.

Netsuite PDF Templating: get number of pages as attribute

I am templating pdfs in Netsuite using freemarker and I want to display the footer only on the last page. I have been doing some research, but couldn't find a solution (since looks like the environment does not allow me to include or import libs), so I thought that just comparing the number of the page with the total pages in an if tag would be a nice and easy workaround. I already know how to display the numbers by using the <pagenumber/> and <totalpages/> tags, but still cannot get them as values so I can use them like this:
<#if (pagenumber == totalpages) >
... footer html...
</#if>
Any ideas of how or where can I get those values from?

The approach you are trying won't work, because you are mixing BFO and Freemarker syntax. Netsuite uses two different "engines" to process PDF Templates. The first step is Freemarker, which merges the record fields with your template and produces an XML file, which is then converted by BFO into a PDF file. The <totalpages/> element is meaningless to Freemarker, as it is only converted into a number by BFO later.
Unfortunately, the ability to add a footer to only the last page of a document is currently a limitation of BFO, as per the BFO FAQ:
At the moment we do not have a facility for explicitly assigning a
footer or header to the last page in a document when the number of
pages is unknown.

You CAN add it after a page break - and put the page break at the end of the body
<pbr footer="nlfooter" footer-height="25%"></pbr>
</body>
The issue here is - on a one page output - you will get 2 pages minimum... it will always ADD a page for the disclaimer / footer...

Google Search Image "no longer available: 403" [duplicate]

I am using google image search API. Till yesterday it was working, but today morning it says "This API is no longer available"
Is it officially closed, Or any error at my side
Request
https://ajax.googleapis.com/ajax/services/search/images?v=1.0&rsz=8&q=cute+kittens
Response
{"responseData": null, "responseDetails": "This API is no longer available.", "responseStatus": 403}

The answer I found was using Google's Custom Search Engine (CSE) API. Note that this is limited to 100 free requests per day.
Creating cx and modifying it to search for images
Create custom search engine at https://cse.google.com/cse/create/new based on your search criteria.
Choose sites to search (leave this blank if you want to search the entire web, otherwise you can enter a site to search in one particular site)
Enter a name and a language for your search engine.
Click "create." You can now find cx in your browser URL.
Under "Modify your search engine," click the "Control Panel" button. In the "edit" section you will find an "Image Search" label with an ON/OFF button, change it to ON. Click "update" to save your changes.
Conducting a search with the API
The API endpoint url is https://www.googleapis.com/customsearch/v1
The following JSON parameters are used for this API:
q: specifies search text
num: specifies number of results. Requires an integer value between 1 and 10 (inclusive)
start: the "offset" for the results, which result the search should start at. Requires an integer value between 1 and 101.
imgSize: the size of the image. I used "medium"
searchType: must be set to "image"
filetype: specifies the file type for the image. I used `"jpg", but you can leave this out if file extension doesn't matter to you.
key: an API key, obtained from https://console.developers.google.com/
cx: the custom search engine ID from the previous section
Simply make a GET request by passing above parameters as JSON to the API endpoint (also listed above).
Note: If you set a list of referrers in the search engine settings, visiting the URL via your browser will likely not work. You will need to make an AJAX call (or the equivalent from another language) from a server specified in this list. It will work for only the referrers which were specified in the configuration settings.
Reference:
https://developers.google.com/custom-search/json-api/v1/reference/cse/list

Now You can search images with Custom image search API.
You can do this with two steps:
Get CUSTOM_SEARCH_ID
Go to - https://cse.google.ru/cse/all
Here you must create new Search Engine. Do this and enable Image Search at there.
Screen(i am Russian... sorry)
then get this search engine ID. To do this press at Get Code button:
And there find line with cx = "here will be your CUSTOM_SEARCH_ID":
Ok. It's done, now second step:
Get SERVER_KEY
Go to google Console - https://console.developers.google.com/project
Press to Create project button, enter the name and other required information.
Pick this project and go to Enable Apis
Now find Custom Search Engine.
And Enable it.
Now we must go to Credentials and create new Server Key:
Ok. Now we can use Image Search.
Query:
https://www.googleapis.com/customsearch/v1?key=SERVER_KEY&cx=CUSTOM_SEARCH_ID&q=flower&searchType=image&fileType=jpg&imgSize=xlarge&alt=json
Replace the SERVER_KEY and CUSTOM_SEARCH_ID and call this request.
Limit: for free you can search only 100 images per day.

If this is just for your own purposes (not for production) and you're not planning to abuse Google Image Search, you can simply extract first image URL from Google search results using JSOUP.
For example:
Code to retrieve image URL of the first thumbnail:
public static String FindImage(String question, String ua) {
String finRes = "";
try {
String googleUrl = "https://www.google.com/search?tbm=isch&q=" + question.replace(",", "");
Document doc1 = Jsoup.connect(googleUrl).userAgent(ua).timeout(10 * 1000).get();
Element media = doc1.select("[data-src]").first();
String finUrl = media.attr("abs:data-src");
finRes= "<img src=\"" + finUrl.replace("&quot", "") + "\" border=1/>";
} catch (Exception e) {
System.out.println(e);
}
return finRes;
}
Guide:
question - image search term
ua - user agent of the browser

After I read several responses I compiled a response with images:
Access the website: https://developers.google.com/custom-search/v1/introduction, on the page you will find this part, so click in the button Get a Key:
Create or select a project, and then NEXT:
Copy the API KEY:
Access the website to create your CX: https://cse.google.com/cse/create/new, write some random domain like “www.anypage.com”, (after we will delete), select a language, and define some name for your search engine. Click on the Button CREATE.
Will see this page, then click in Control Panel:
Copy the Search engine ID for later (this is your CX). After you can set to search in all websites (active Search the entire web, select on the random website www.anypage.com then click on the button Delete) and you can active Image search. So will see like this:
And Using REST you can get the results, using this example code (searching for flower):
<html lang="pt">
<head>
<title>JSON Custom Search API Example</title>
</head>
<body>
<div id="content"></div>
<script>
function hndlr(response) {
console.log(response);
for (var i = 0; i < response.items.length; i++) {
var item = response.items[i];
// in production code, item.htmlTitle should have the HTML entities escaped.
document.getElementById("content").innerHTML += "<br>" + item.htmlTitle;
}
}
</script>
<script src="https://www.googleapis.com/customsearch/v1?key=API_KEY&cx=SEARCH_ENGINE_KEY&q=flower&searchType=image&callback=hndlr"></script>
</body>
</html>
The base code is found here: https://developers.google.com/custom-search/v1/using_rest
After setting your API_KEY (key) and your SEARCH ENGINE KEY (cx), the result will see like this:
Thanks to #Vijay Shegokar, #aftamat4ik and #Alladinian

This is the full URL template to be used
We can eliminate unnecessary parameters.
https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json
I am using
https://www.googleapis.com/customsearch/v1?key=ap_key&cx=cx&q=hello&searchType=image&imgSize=xlarge&alt=json&num=10&start=1

Change the API url to
Google Custom Image search
Provide the same parameters along with with API KEY and CX.
More Info and Explorer

The Yahoo Boss API is a reasonable substitute, although it's not free and the results are not quite as good.
UPDATE: YAHOO BOSS JSON Search API will discontinue on March 31, 2016

SerpAPI enables to search through Google Images and returns a clean json. it integrates with most of the programming languages: python, php, java, golang, nodejs...
https://serpapi.com/images-results
Google limit the number of search per day.
but this service provides unlimited searches...

looks like we need to implement google custom search API
https://developers.google.com/custom-search/
says so on top of the page you provided yourself

How to pass some information between parse_item calls?

Ok, imagine a website with some list. The items of this list have one piece of information needed. The second piece is located at some other url, which is unique from item to item.
Currently our crawler opens a list page, scrapes each item, and for each item it opens that 2nd URL and gets the 2nd piece of the info from there. We use requests lib which is excellent in almost all cases but now it seems to be slow and ineffective. It looks that the whole Twisted is being blocked until one 'requests' request ends.
pseudo-code:
def parse_item():
for item in item_list:
content2 = requests.get(item['url'])
We can't just let Scrapy parse these 2nd urls because we need to 'connect' the first and the second url somehow. Something like Redis would work, but hey, is there any better (simpler, faster) way to do that in Scrapy? I can't believe the things must be so complicated.

You can do this my passing variable in meta
For example:
req = Request(url=http://somedonain.com/path, callback=myfunc)
req.meta['var1'] = 'some value'
yeld(req)
And in ur myfunc, you read passed variable as:
myval = response.request.meta['var1']

What is the maximum number of url parameters that can be added to the exclusion list for google analytics

I set up a profile for Google Analytics. I have several dozen url parameters that various pages use and I want to exclude. Luckily, google has a field you can modify under the general profile settings [Exclude URL Query Parameters:]. Of the several dozen items I have they are all working, and not being considered part of the URL. Except for the parameter propid
I added propid to the comma separated list on Monday. But, everyday when I check GA, sure enough they are coming through with that parameter still attached.
So, am I trying to exclude too many parameters? I couldn't find any documentation on GA's site to say there was a limit.
here is the exact content of the exclude URL Query parameter field
There reason there are so many is the bh before me didn't know the difference between get/post.
propid,account,pp,kw1,kw2,kw3,sortby,page,msg,sd,ed,ea,ec,sc,subname,subcode,sa,qc,type,code,propid,acct,minbr,maxbr,minfb,maxfb,minhb,maxhb,minrm,maxrm,minst,maxst,minun,maxun,minyb,maxyb,minla,maxla,minba,maxba,minuc,maxuc,card,print,year,type
update
I thought after more time had passed the "bad data" would fall of of GA. But as of yesterday it is still reporting on the propid querystring value despite adding that as well as other variables to the exclude list.
update2
I found this post on google https://www.google.com/support/forum/p/Google+Analytics/thread?tid=72de4afc7b734c4e&hl=en
It reads that the field only allows 255 char, Ok. Problem Solved. Except my field of values is only 247 charcters.. ARGGGHH!
*Update 3 *
So Here is the code I've added to the googleAnalytics.asp include page that goes at the top of everyone of my asp classic pages. Can anyone see a flaw in the design? I don't care about ANY query string info. (it could have been named *.inc, but I like having intellisense working)
<script type="text/javascript">
<% GAPageDisplayName = REQUEST.ServerVariables("PATH_INFO") %>
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-20842347-1']);
_gaq.push(['_setDomainName', '.sc-pa.com']);
<% if GAPageDisplayName <> "" then %>
_gaq.push(['_trackPageview','<%=GAPageDisplayName %>']);
<% else %>
_gaq.push(['_trackPageview']);
<% end if %>
(function () {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
Update 4
I'll only accept an answer if you will include something talking to the original question. My question was very specific, I wanted to know exactly the number of characters google allows. Everything I included in my original question body was simply to backfill the question to put everything in context.

Might I suggest an alternate solution to the reliance on manually excluding all of these (and feasibly any string ever used)?
I'd suggest passing a parameter to the trackPageView function to 'force' the recording of a manually/programatically set 'page name' value.
Whereas by default, GA records/defines a page based on a unique URL, the inclusion of a pagename parameter would associate all pageviews of a page with that parameter as pageviews to a single page.
For example, standard GA pageview code looks like this: _gaq.push(['_trackPageview']);, whereas the inclusion of a specific page name looks like this: _gaq.push(['_trackPageview', 'Homepage']);. With the latter, presuming that the homepage is at www.site.com, regardless of how that page is accessed GA will always consolidate all pageview stats for it as 'Homepage'. So, www.site.com/index.php, www.site.com/?a=b and www.site.com/?1=2&x=y will always report as 'Homepage' as if it was one page.
The only drawback here is that you need to be incredibly careful around any occurences of pagination, nested pages, content swapping, site search, or any functionality which may in fact rely on the use of query strings; you may need to consider some logic on how the page name values are output, rather than attempting to define on a per-page basis depending on the site of your site(s).
Hope that's helpful!

Do you realize that you have propid listed twice in the exclusion field? Once at the beginning and then again about one-third of the way through. That's the only thing that stands out to me. See what happens if you remove either of these.
You also have type duplicated, so if the above fixes the problem for propid, also consider removing the second type.

Google limits the characters in the "Exclude Url Query" field (2048 characters max), not the number of queries. I had the same issue you're having and what I discovered was that I had populated my query string parameter list based on the pagenames in my pages report. Well those pagenames first pass through a view-level lowercase filter that I have set up. And since the "Exclude URL Query" field is case sensitive, some of the parameters were getting through. Hopefully this helps.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas