I'm new to python and scrapy. I had hoped I could combine the two to scrape some gambling websites. This is an example;
https://www.oddschecker.com/football/scottish/premiership/kilmarnock-v-aberdeen/winner
If i simply view source html for that page, the main table of odds isn't in the html. Which confuses me greatly. Have tried using scrapy on it but it's got the same issue.
What's going on at that page that nothing can see the data in the table? And what would be an easy way to scrape it?
Thanks!
It looks like the initial response is all JavaScript. Generally gambling and stats sites don't want to be scraped, and this looks like a means to thwart bots. I would think that this will deep on the difficult end of the pool, and you would likely be better off with something that employs a browser so you can render it.
Related
Our clients website load really slow on the first load (the TTFB on the page document can be 10-20s). If I reload the page, the site loads a lot faster.
This may be because of a lot of the files are cached?
Website is here: https://www.mortels.com.au/
This happens for a lot of the pages.
I have tried merging some of the .css files, and will try to attempt the .js files if I cannot find anything else (I never built the original theme, so finding it hard to figure out what is done where and do not have much experience with developing in Shopify.
I also tried adding a lazyloader however it doesn't look like it is working.
Would anyone have any solutions to make the website load quicker? Could it be just the apps we have running on the website causing the initial response to be so slow?
One of the things that can hinder your site's load speed is having too much logic happening through Liquid tags. Shopify has to parse all of the page's Liquid code before it can serve the page, and that has a direct effect on the TTFB
For the files that have unacceptable TTFB ratings, some things you can try to do to help make Shopify's servers serve your content faster include:
Reducing the number of lookups (eg: through all_products[handle] on the page
Avoiding nested for whenever possible
Replacing loops with map whenever you need to make an array of values
Rewriting logic-heavy sections to run in Javascript instead of Liquid (and using the | json filter to drop your liquid variables in a Javascript-friendly version)
Hope this helps!
I'm attempting a scrapy-with-splash project to get a few fields off the website "https://sailing-channels.com/by-subscribers". This site uses java to retrieve and delete listings as you scroll.
I've not had any luck getting the splash server to give me the whole set of data, or any of the detailed listings for that mater.
My first question is can splash even do this?
I really don't care how I get this data. I would prefer doing it with a program but any tool that can get me fields from this site in an .csv file would do the job. Anyone have any suggestions?
Thanks for any advice
Why do you want render it? They have pretty good API, check https://sailing-channels.com/api/channels/get?sort=subscribers&skip=0&take=5&_=1548520116425. So you can iterate, increasing skip argument and parsing json each time.
Looks like very promising way.
I've searched all over the place and I can't figure out what I'm doing wrong. No matter what I still get a Page does not contain authorship markup on the structured data testing tool
I have two sites with almost identical pages. The rel=author tags are inserted the same way.
Here is an example of one page that works: http://bit.ly/18odGef
Here is an example of one page that doesn't: http://bit.ly/12vXdAm
I tried adding ?rel=author to the end of the Google+ profile URL, which doesn't seem to work on either site. I am not blocking anything via nofollow or robots.txt. The tool is not being blocked by a firewall or anything. Can anyone see what I'm doing wrong here and why it works for one site, but not the other?
FYI, the site that does not work used to work without a problem. I hadn't changed anything with how the author markup was organized until I realized it wasn't working anymore.
When I test both of those pages in Google's structured data test tool, it shows that authorship is working correctly for both pages.
Here are the results for the page you said was working: https://www.google.com/webmasters/tools/richsnippets?q=http%3A%2F%2Fnikonites.com%2Fd5100%2F2507-d5100-vs-d90.html%23axzz2rFFm1eVv
Here are the results for the page you said wasn't working: https://www.google.com/webmasters/tools/richsnippets?q=http%3A%2F%2Fcellphoneforums.net%2Fsamsung-galaxy%2Ft359099-enable-auto-correct-galaxy-note-ii.html%23axzz2rFFlwz3W
I'm having a problem when trying to A/B test certain nodes in my node-tree in Umbraco.
What I want to do is to copy a node in the node-tree to a specific spot and use that B-structure to see which of the structures works best, using Google analytics.
For example we have two node structures, let's call them "Private" and "Sweden".
Their structure with childnodes and properties are exactly the same. The only difference between them is the propertyvalues (content). The "Private"-URL is www.mysite.com/Private and the "Sweden"-URL is www.mysite.com/Sweden.
What I would like to do is to change every link on the B-structure, so that it points to its match at the A-structure. The problem is that since it's two different structures, it will have two different alternative links.
With other words, it should be a coinsidence that it enters the B-structure, then be moved back to the A-structure in the next click.
We manage what page it should load (either the A-node or the B-node) with scripts, so that it has a 50% chance for each node, and if it lands on the B-node, Google analytics will save data. What we can't manage is that every link on that page will be to the A-node.
I'd appreciate any help I can get.
Regards,
David
There's a couple of ways that seem likely to give you a start at least.
The /config/urlrewriting.config file allows you to set up multiple redirect rules within umbraco so a section like the following might work in sending all requests (whether (/sweden/pagename/ or /private/pagename/) back to the private structure. Not sure how GA will handle it:
rewriteUrlParameter="ExcludeFromClientQueryString" destinationUrl="http://www.mysite.com/private/$1" redirect="Domain" redirectMode="Permanent" ignoreCase="true" />
Secondly a simple httpmodule (http://support.microsoft.com/kb/307996) can process all page requests and redirect as required - you could do a gaq_push here directly or indirectly.
I'd be interested to know how you get on - it seems a good area for extension to Umbraco.
I'm not sure I have understood perfectly what you need to do, so please excuse any assumptions that may prove mistaken. Here's what I think:
Since A & B nodes should share the same html content (besides the links of course), why don't you make the link href attribute dynamic by using a bit of razor in the template or macro:
#{var isANode = CurrentPage.Parent.Name == "Sweden"; }
A similar approach would work if you are using web forms.
We finally came to the final decision to use the alternative template-solution. Since there seem to be no generic solution for my case of this problem we had to create an alternative template with specific macros to render the different information for every documenttype we're using.
Creating dynamic links for every page is a hell of a job in this stage in the project, since there are so many pages and links. Also some links are made in javascript, so there's another problem.
I copied the a-structure to another node, only for the reason to be able to change propertyvalues. There might be a problem logging and track the information with Google Analytics though, so that's the next step for us in this project. In our alternative templates we're getting the propertyvalues from the b-structure.
Still, if anyone have some better solution I'd highly appreciate it!
Regards,
David
I'm maintaining an existing website that wants a site search. I implemented the search using the YAHOO API. The problem is that the API is returning irrelevant results. For example, there is a sidebar with a list of places and if a user searches for "New York" the top results will be for pages that do not have "New York" in the main content section. I have tried adding Yahoo's class="robots-nocontent" to the sidebar however that was two weeks ago and there has been no update.
I also tried out Google's Search API but am having the same problem.
This site has mostly static content and about 50 pages total so it is very small.
How can I implement a simple search that only searches the main content portions of the page?
At the risk of sounding completely self-promoting as well as pushing yet another API on you, I wrote a blog post about implementing Bing for your site using jQuery.
The advantage in using the jQuery approach is that you can tune the results quite specifically based on filters passed to the API and playing around with the JSON (or XML / SOAP if you prefer) result Bing returns, as well as having the ability to be more selective about what data you actually have jQuery display.
The other thing you should probably be aware of is how to effectively use #rel attributes on your content (esp. links) so that search engines are aware of what the relationship is between the actual content they're crawling and the destination content it links to.
First, post a link to your website... we can probably help you more if we can see the problem.
It sound like you're doing it wrong. Google Search should work on your website, unless your content is hidden behind javascript or forms or something, or your site isn't properly interlinked. Google solved crawling static pages, so if that's what you have, it will work.
So, tell me... does your site say New York anywhere? If it does, have a look at the page and see how the word is used... maybe your site isn't as static as you think. Also, are people really going to search your site for New York? Why don't you input some search terms that are likely on your site.
Another thing to consider is if your site is really just 50 pages, is it really realistic that people will want to search it? Maybe you don't need search... maybe you just need like a commonly used link section.
The BOSS Site Search Widget is pretty slick.
I use the bookmarklet thing but set as my "home" page in my browser. So whatever site I'm on I can hit my "home" button (which I never used anyway) and it pops up that handy site search thing.