I'm using scrapy to get data from a website.The website But there is a problem that I don't know how to get the increment data after the website has been updated in server or how to know the website has been updated?
The table in webpage is what I want to crawl, like this:
Just as you can see, there is a column named "Add Date". So when the data has been updated, I just want to get the data that has been added lately. And there is a problem that after updated the url of website won't have any changes. It's still https://gold.jgi.doe.gov/projects.
I've read this Q&A Strategy for how to crawl/index frequently updated webpages?. I understand a little bit of the theory. But I still don't know how to implement this when using scrapy, can anybody give an example or some detailed information?
Related
I work with a company who outsources their website. I'm trying to retrieve data from the site without having to contact those who run it directly. The table data I'm trying to retrieve can be found here:
http://pointstreak.com/prostats/scoringleaders.html?leagueid=49&seasonid=5967
My methodology thus far has been to use google chrome's Developer Tools to find the source page, but when I filter under the network tab for XHL, only the info of the current games can be found. Is there anyway to scrape this data (I have no idea how to do that; any resources or direction would be appreciated) or another way to get it? Am I missing it in the developer tools?
If I had to contact those who run the website, what exactly should I ask for? I'm trying to get JSON data that I can easily turn into my own UITableViewController.
Thank you.
Just load the page source and parse the html.
Depending on your usage there may well be a copyright issue, the page has an explicit copyright notice so you will need to obtain explicit permission for your use.
It must be the full url, not just history or type. I have seen, on statcounter.com, where they show stats regarding where the referrer of our site came from. I want to show the exact link like statcounter shows under a tab.
Does anyone know how to do this?
You will want to use Google Analytics.
Google will generate a unique javascript code for you, you then paste it in your html, which activates the service. It's very easy to get started and the dashboard is very robust considering it's free. To get started, you can check this out: http://www.google.com/analytics/learn/setupchecklist.html
I am working on enhancing the a search functionality of a website.
The current search is working as
1.reading all the rows from the database
2.find keywords from each rows and return the result.
The problem is it is too slow and it has to prepare all the data in the backend which mean read all the data from different database and put them to html.
The solution comes to my mind is:
show partial search results (like 10) which means as long as it find enough result in the databse it will stop reading and searching rows.
once user scroll down the page, using ajax to trigger another process of searching
My questions is:
Is it a good way(possible way) to do that?
Any tutorial source I should look up.
i know it is kinda abstract question, but I need advice for this.
Thanks in advance.
Update my research:
https://github.com/webcreate/infinite-ajax-scroll
this jquery lib can do the front end job
I have a website and in my website I have, for example, a list of Audi models. I saw, using google webmaster tools, that my website appears in the google search by the word audi, but the target page was the 22nd page from my result set, not the first. I need my first page to appead, not my last (or middle), but I cannot tell google that this is a parameter, because my URLs are rewritten using mod rewrite. Any ideas?
BTW, I have read in a SEO forum, that it's a bad idea to use a cannonical tag. So is it really a bad idea in my case?
You can't force Google to do anything, however, they have made it easier to deal with pagination issues with a recent post on rel="next" and rel="prev".
But the primary problem you face is signalling to Google that your first (main) page is the starting point - this is achieved using internal link and back-link "juice" focussed on that page. You need to ensure that the first page of results is linked to properly from higher-value pages (like the home-page).
Google recently announced that you can use View All which will allow them to find and index entire articles that are normally broken up using pagination and display them all as one result.
I have a personal website that I want to see when the last post was made to it. Is there a way to find the last posted date on my blog?
In my application, I have a notification that I want to fire if we've made a 'News' post on our site so that our users are aware of any issues and I figured the best way would be to see when the last post was made.
Anyone have any ideas?
Thanks!
Since WordPress supports the metaWeblog API, you could use the XML-RPC.NET library to create a client that comminicates with your blog. You would use the metaWeblog.getRecentPosts method to get the most recent posts. You can find an example here.
http://www.pluralsight-training.net/community/blogs/aaron/archive/2008/08/19/programming-the-metaweblog-api-in-net-c.aspx
You might even be able to automate the login process, and scrape the post titles, comparing the first one to the one that was stored last. If they're different, it would indicate an update has been made.
Here's a method I came up with to automate the login part:
http://stateofidleness.com/2011/01/vbnet-automated-login-wordpress-site/
You could even connect to the mySQL database and query for the last entry date. (probably easier)