Scrapy: Scraping JS rendered page with Splash

Scrapy: Scraping JS rendered page with Splash - scrapy

My goal is to scrape the following page: https://www.festicket.com/de/festivals/?country=DE
From what I read and learned so far, the best approach is either to find the API (which I wasn't able to) or use Splash as an extension. Is that the approach you would take, or did I miss anything?

Related

(Karate) How to intercept the XHR request response code?

I am testing a login functionality on a 3rd party website. I have this url example.com/login . When I copy and paste this into the browser (chrome), page sometimes load, but sometime does not (empty blank white page).
The problem is that I have to run a script on this page to click one of the elements (all the elements are embedded inside #shadow-root). If the page loads, no problem, script is evaluated successfully. But page sometimes does not load and it returns a 404 in response to an XHR request, and as a result, my * eval(scrip("script") step returns "js eval failed...".
So I found the solution to refresh the page, and to do that, I am considering to capture the xhr request response. If the status code is 404, then refresh the page. If not, continue with the following steps.
Now, I think this may work, but I do not know how to implement karate's Intercepting HTTP Requests. And firstly, is that something doable?
I have looked into documentation here, but could not understand the examples.
https://github.com/karatelabs/karate/tree/master/karate-netty
Meanwhile, if there is another way of refreshing the page conditionally, I will be more than happy to hear about it. Thanks anyone in advance.

First, using JavaScript you should be able to handle shadow roots: https://stackoverflow.com/a/60618233/143475
And the above answer links to advanced examples of executing JS in the context of the current page. I suggest you do some research into that, try to take the help of someone who knows JS, the DOM and HTML well - and you should be find a way to know if the XHR has been made successfully or not - for e.g. based on whether some element on the page has changed etc.
Finally here is how you can do interception: https://stackoverflow.com/a/61372471/143475

Reason SPA pages are refreshing?

Just finished learning Vuejs and after visiting a few websites that use Vuejs like;
a) https://coderstape.com
b) https://www.thenetninja.co.uk
c) https://laracasts.com
I noticed that by navigating around the websites we by clicking on navbar links and some other links then the pages refresh and I haven't been able to find out the reason online. Could someone kindly explain what's happening in that? Doesn't it go against the purpose of SPA?

For example the last site you specified: https://laracasts.com.
On its main page there is a white button "BROWSE COURSES". If you open Chrome DevTools panel(look at the picture with explanations), go to tab "Networks" (1) and then click on this white button, you can see GET request to "series?curated" (2). If you open its details, you can see that as response, new page is received in the form of an HTML code (3), not JSON for example, as is usually the case in SPA.
Also, if you look at what programming language is used on this site, for example, using service https://whatcms.org/?s=laracasts.com, you can see that this is a PHP, namely Laravel.
From all this, I can make the assumption that they use Vue.js only partially, maybe in several components, but the site navigation itself is presented in the form of traditional static pages, which is why the page reloads.
Also, for example, if you take a look at this website https://www.spendesk.com/, you can see that they use Vue.js+Nuxt.js, as well as Node.js, as indicated by service whatcms.org, and if you try to navigate to various pages on this site, you will see no page loading. I can say that this site is a true SPA in the form in which you mean it.
I heard that you can do a SPA with a Laravel backend, but I think that's another story.

Nested Routes in Gridsome?

In Gridsome, I am basically looking for Vue's nested route functionality (or Nuxt's child-view) to achieve something like this /:userId/profile and /:userId/posts for example. And since Gridsome uses Vue Router there should be a way to achieve this I believe
Let me try to explain what I am trying to build with Gridsome:
At mywebsite.com/ I want to show a Grid of images showing thumbnails of my video portfolio. When you then click on a thumbnail I want a modal to pop-up showing the video. The modal is semi transparent showing the portfolio in the background. So far so good.
But for people to be able to share the URL of the respective video, I need the path changing to mywebsite.com/video-1 and so on. When I then close the modal the path should be mywebsite.com/ again. This is something I already achieved within Nuxt with <child-view>.
Is there some similar functionality in Gridsome? I appreciate your help.

From the feedback you got here:
Gridsome doesn't support child routes yet. But you can kind of achieve what you want if you create a new content type called User and add each user as a node. Then generate pages for them with the Pages API. The pages you create can share a layout component.
In the same way, you can also generate pages for each video for having direct URLs to them. And use the $fetch() method to load a video in a pop-up. Or just query the videos in the front-page query instead of using $fetch().

Google Search API with joomla

I would like to implement the google search API in a joomla site. What would be the best way to pass the search query and show the results inside of my templates content area. Making a custom component or there is a lighter workaround?

You could always try RokAjaxSearch as I believe this has the ability to display results from Google. And above all, it's Ajax, therefore doesn't refresh the page.
Hope this helps

Does changing the order of HTML with Javascript help SEO

On my website, I have a booking widget at the top of each page to allow visitors to enter our booking engine. The code behind it uses quite a bit of HTML, pushing down the content on each page in the source. In an attempt to better my SEO, I decided to have the code placed in a DIV tag at the bottom of the page, and, when the DOM is ready, I use JQuery to physically move the DIV from the bottom of the DOM to the top where it needs to be to render correctly.
My question is if this is really helping SEO? Does Google look at the DOM/Source after all Javascript has run, or before? Does moving these few hundred lines of HTML to the bottom of the HTML source gain me any advantage?

Spiders do not process javascript. So any content that appears/moves or is created by javascript will appear as if it hasn't been moved or created at all.

I'd be really surprised if web crawlers execute the scripts on the page. They probably scan the raw response.

That doesnot have any effect on the SEO.
But placing the javascript at the bottom will defnitely help you to load the webpages faster.
There is no harm for SEO as well, you can defnitely proceed with your approach

There is a distinction between javascript executed on load versus during the user session. The on-load javascript is more times than not indexed by google. The dynamic content or alterations on the client side are not well indexed.
So, it can't be ignored.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scrapy: Scraping JS rendered page with Splash - scrapy

My goal is to scrape the following page: https://www.festicket.com/de/festivals/?country=DE From what I read and learned so far, the best approach is either to find the API (which I wasn't able to) or use Splash as an extension. Is that the approach you would take, or did I miss anything?

Related

(Karate) How to intercept the XHR request response code?

Reason SPA pages are refreshing?

Nested Routes in Gridsome?

Google Search API with joomla

Does changing the order of HTML with Javascript help SEO

Categories

Resources