Is there a way to get complete html content of pages in bulk in sitefinity? - sitefinity

I need to extract the HTML content from different pages and put them in XML file. Is there a way to get complete HTML content from child pages of a group page in Sitefinity?

Get the page nodes you need with the Sitefinity API, loop through them and do a GET request, e.g. by using WebClient() and DownloadString method.
Then do whatever you wish with the html string.

Related

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

So, I'm trying to scrape a website with infinite scrolling.
I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
But the example given looks pretty easy, it's an orderly JSON object with the data you want.
I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000
The XHR response for each page is weird, looks like corrupted html code
This is how the Network tab looks
I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.
In the past I've succesfully done this with normal pagination and rules guided by xpaths.
https://www.bahiablancapropiedades.com/buscar/resultados/0
This is XHR url.
While scrolling the page it will appear the 8 records per request.
So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests.
do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.
pagination_count = xpath of presented number
value = int(pagination_count) / 8
for pagination_value in value:
url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]
pass this url to your scrapy funciton.
It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.
To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).
You can try the following in scrapy shell to get all the links in one of the "next" pages:
scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3
import json
import parsel
json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()

How to follow lazy loading with scrapy?

I am trying to crawl a page that is using lazy loading to get the next set of items. My crawler follows normal links, but this one seems to be different:
The page:
https://www.omegawatches.com/de/vintage-watches
is followed by https://www.omegawatches.com/de/vintage-watches?p=2
But only if you load it within the browser. Scrapy will not follow the link.
Is there a way to make scray follow the pages 1,2,3,4 automatically?
The page follows Virtual scrolling and the api through which it gets data is
https://www.omegawatches.com/de/vintage-watches?p=1&ajax=1
it returns a json data which contains different details including products in html format, and if the next page exist or not in a a tag with class link next
increase the page number till there is no a tag with link next class.

Apache Wicket how to render a (non-wicket) response page

I'm using Apache Wicket and I have following problem:
Inside a onSubmit() method I am sending a POST request to external web address with Apache httpClient. As a response I get html (inside my response object).
How can I get Wicket to render this html in browser?
So basically what I'm trying to do here, is simply what would normally happen if I submitted a html form to this web address. However for security reasons I don't want to give user pages containing forms that contain this data I'm trying to send.
You can get the response via getResponse() in any component. (I assume the onSubmit() is on a form).
How about something like:
getResponse().reset();
getResponse().write(htmlPage);
htmlPage should be a CharSequence containing the html page to be rendered.

Get the HTML output data from a wicket component

I'm currently writing a web widget, and I would like to fill the content of this widget with some HTML data generated by a wicket component on my server.
To do that, the server will output the HTML data via JSONP. So far so good.
However I need to get this HTML data. How can I get on the server the HTML output from some wicket component?
I dont know if this can be applied to your configuration, but I am using a view lines of code to retrieve rendered html which I wrote some time ago for building html based emails to be able to use wicket components in it
protected final String renderPage(Component page) {
final Response oldResponse = RequestCycle.get().getResponse();
BufferedWebResponse tempResponse = new BufferedWebResponse((WebResponse) RequestCycle.get().getOriginalResponse());
try {
RequestCycle.get().setResponse(tempResponse);
page.render();
}
finally {
RequestCycle.get().setResponse(oldResponse);
}
return tempResponse.toString();
}
As this rendering is made within an actual webapplication cycle but independently from the actual requestcycle, it is recommended to preserve the original requestcycle. The page will be rendered in your temporary webresponse from which you can retrieve the rendered html output.
Hope this may be what you are looking for
You might find everything you need in this Wicket Wiki article and the linked source code: Use wicket as template engine
Although I must admit that I never tried that, just read it and remembered for further reference...

iText header for html

I am generating a PDF using itext 5.0.5.I am reading different mime types image,pdf,html content etc. and then reading those files from database and generating pdf.
There are two type of document user can view a individual document and a collection of documents in one single generated pdf.
I HAVE ONE PROBLEM WITH HTML content pdf header part.This html content is coming from a text area on a form,there a user will get the header information prepoulated in text area then he can type and create document.At the time of pdf generation if i am using page event to generate the header for each page for every mime type document.
For html content the header is coming two times.What i want to do is for html type document on first page header should not be generated for first page.I got the solution for pdf if i am reading the individual document but when i am reading the final pdf which is containing all documents of different mime types then it's not working.Is there any way so that i can do like header will not be generated for html type content's first page for rest of pages it will be generated using page event.
please help.
Perhaps you could use two different pageEvents when dealing with HTML. One that added headers (the current one), and one that set the page event handler to the original one.
You start off with the new one. The first page event comes along, and that new event handler changes the current page event handler. The remaining pages are stamped with headers as usual.