Bing - how to index more than 10 URLs per day? - seo

I've started working with Bing and it appears it only allows 10 URLs to be indexed per day, in a newline delimited format, in a Web form, once logged in to an account.
I'm thinking of using a Google sitemap.xml file as a starting point and cron to periodically submit all the urls over time, but it seems like there has to be a better way..

It looks like Bing supports sitemap files, so you might not need to manually submit the URLs.

Related

How does google index web chats that load messages dynamically via XHR or WebSocket?

Why i am able to google messages in (for example) gitter.im? How did google indexed all this: https://gitter.im/neoclide/coc.nvim?at=5ea00cdda3612210839689f1 ?
Does gitter.im return its content to google in another format or via some specific interface/protocol declared in special section for web crawlers somewhere? Did google spent some resources on development to build a gitter.im-specific crawler that is able to do specific XHR-requests?
Simple:
Google ask https://gitter.im/gitter/developers
There is N recent messages embedded in HTML already, say 50. Then google just extract all the links from the HTML (from that time-tag "18:15", for example). Each time-tag gives you url of form https://gitter.im/gitter/developers?at=610011abc9f8852a970e808e and google doesnt care why. Just remember urls.
Google asks that grabbed 50 urls of form https://gitter.im/gitter/developers?at=610011abc9f8852a970e808e
Each such URL gives you ~50 messages around that exact message. So search engine think: "ok, this URL gives you THIS text".
So when you search THIS test it just gives you the url closer-to that text or maybe just any url with that text...

Apache Nutch 2.3.1 Website home page handling

I have configured Nutch 2.3.1 to crawl some news websites. As websites homepages are going to change after one day that why I want to handle home page in some different way so that for homepage, only main categories are crawled instead of text as text will change after sometime ( I have observed similar things in Google).
For rest of pages, its working fine ( crawling text etc.)
At the moment Nutch doesn't offer any special treatment for homepages, it is just one more URL to crawl. If you want to do this you'll probably need to customise some portions of Nutch.
If you're collecting a fixed set of URLs (that you usually put in the seed file) you can attach some metadata to these URLs and use a different strategy for these URLs. For instance setting a really high score & short fetch interval (https://github.com/apache/nutch/blob/release-2.3.1/src/java/org/apache/nutch/crawl/InjectorJob.java#L56-L59).
Since the generator job will sort the URLs by score, this should work as long as all other URLs have a score lower than the value that you use for the seed URLs. Keep in mind that this will cause Nutch to crawl this URLs every time that a new cycle starts (since the seed URLs are going to be on the top all the time).
If you discover new homepages during your normal craw cycle, then it is tricky because Nutch doesn't have any way of detecting if a given URL is a homepage or not. For this case you'll need to check if the current URL is a homepage, if it is indeed a homepage then, you'll need to modify the score/fetch interval to ensure that this URL ends up in the top ranking URLs.
This workaround could potentially cause some issues: Nutch could end up crawling always only the homepages and not the rest of the URLs, which is not a good case.
You could also write your own generator, this way you have more control and don't rely only on the score, fetch interval alone.
Full disclosure: Although I've used a similar approach in the past we ended up changing this system to use StormCrawler (we were building a news search engine) so we needed more control over when the pages were being fetched (the batch nature of Nutch it is not a great fit for this use case), and some other business cases that needed a more NRT approach.

Incrementally crawl a website with Scrapy

I am new to crawling and would like to know whether it's possible to use Scrapy to crawl a site, like CNBC.com, incrementally? For example, if today I crawled all pages from a site, then from tomorrow I only want to collect pages that are newly posted to this site, to avoid crawling all the old pages.
Thank you for any info. or input on this.
Yes you can and it's actually quite easy. Every news website has a few very important index pages like the homepage and the categories (eg politics, entertainment etc.) There is no article that doesn't go through these pages for at least a few minutes. Scan those pages every minute or so and save just the links. Then do a diff with what you already have in your databases and a few times a day issue a crawl to scrape all the missing links. Very standard practice.
Please try the scrapy plugin scrapy-deltafetch , which would make your life easier.
Short answer: no.
Longer answer: What you could do is write the article id or the article url to a file and during the scraping, you would match the id or url with the records in the file.
Remember to load your file only once and assign it to a variable. Don't load it during your iteration when scraping.

How google sees the updated content daily?

I have a website with updated content daily. I have two questions:
How does google see this content? Do I need a SEO in this case?
Does the error 404 page have an influence on the ranking on the search engine. (I do not have a static page)
So Google can "know" the content is supposed to be updated daily, it may be useful (if you don't do it yet) to implement a sitemap (and update, if necessary, dynamically). In this simemap, you can specify for each page, the update period.
This is not a constraint for Google, but it can help to adjust the frequency of indexing robots visit.
If you do, you must be "honest" with Google about times updates. If Google realizes that the frequency defined in the sitemap does not correspond to the actual frequency, it can be bad for your rankings.
404 errors (and other HTTP errors) can actually indirectly have an adverse effect on the ranking of the site. Of course, if the robot can not access content at a given moment, it can not be indexed. But scoffers, if too many problems are encountered during the visit of your site by web crawlers, Google will adjust the frequency of visits to the downside.
You can get some personalized advice and monitor the process of indexing your site using Google Webmaster Tools (and to a lesser extent, Analytics or any other tool that could monitor the web crawlers visits).
You can see the date and time when Google last visited your page. So you can see that Google adapted your updated content or not. If you have a website with updated content daily then you can ping your website to many search engines and can also submit your site in Google.
You can make a sitemap for only those urls who have a daily updated content and submit to google webmaster tools. You can define your date and time when the url was last modified in tag. You can also give a hint how frequently your page will likely to change under tag. You can set high priority for the pages which are modified daily under tag.
If you have 404 error (file not found) page then then put them in one directory and define it in your robots.txt file. So Google will not crawl that web pages and automatically it will not be indexed. It will not make any influence on your SERP ranking.

Count the number of pages in a site

I'd like to know how many public pages there are in a site, say for example, smashingmagzine.com. Is there are way to count the number of pages?
You can query Google's index using the site operator. e.g:
site:domain-to-query.com
This will return a list of the pages from the site that are currently indexed by Google. Other search engines provide similar functionality but I don't know the syntax off hand.
Of course not all pages may be indexed, and the index may contain pages which no longer exist.
You need to basically crawl the site. Your process would be something like:
Start at root domain / homepage
Look for all links that point within the same domain
For each of those links, repeat the steps
Your loop terminates when there are no more links to crawl that are pointing in the same domain. Remember to stay in the site otherwise you'll start crawling external sites.
You can also try parsing the sitemap if they provide one.
One tool that might prove useful if using Java is JSpider or Sphider in PHP.
You'll need to recursively scan the markup of each page, starting with your top level page, looking for any kind of links to other pages, and recursively crawl through them. You'll also need to keep track of what has been scanned as to not get caught in an infinate loop.