Does anyone know of a tool or script that will crawl my website and count the number of headings on every page within my website? I would like to know how many pages in my website have more than 4 headings (h1). I have Screaming Frog, but it only counts the first two H1 elements. Any help is appreciated.
My Xidel can do that, e.g.:
xidel http://stackoverflow.com/questions/14608312/seo-web-crawling-tool-to-count-number-of-headings-h1-h2-h3 -e 'concat($url, ": ", count(//h1))' -f '//a[matches(#href, "http://[^/]*stackoverflow.com/")]'
The xpath expression in the -e argument tells it to count the h1-tags and the -f option on which pages
This is such a specific task that I would just recommend you write it yourself. The simplest thing you need is an XPATH selector to give you the h1/h2/h3 tags.
Counting the headings:
Pick any one of your favorite programming languages.
Issue a web request for a page on your website (Ruby, Perl, PHP).
Parse the HTML.
Invoke the XPATH heading selector and count the number of elements that it returns.
Crawling your site:
Do step 2 through 4 for all of your pages (you'll probably have to have a queue of pages that you want to crawl). If you want to crawl all of the pages, then it will be just a little more complicated:
Crawl your home page.
Select all anchor tags.
Extract the URL from each href and discard any URLs that don't point to your website.
Perform a URL-seen test: if you have seen it before, then discard, otherwise queue for crawling.
URL-Seen test:
The URL-seen test is pretty simple: just add all the URLs you've seen so far to a hash map. If you run into a URL that is in your hash map, then you can ignore it. If it's not in the hash map, then add it to the crawl queue. The key for the hash map should be the URL and the value should be some kind of a structure that allows you to keep statistics for the headings:
Key = URL
Value = struct{ h1Count, h2Count, h3Count...}
That should be about it. I know it seems like a lot, but it shouldn't be more than a few hundred lines of code!
I found a tool in Code Canyon: Scrap(e) Website Analyser: http://codecanyon.net/item/scrap-website-analyzer/3789481.
As you will see from some of my comments, there was a small amount of configuration, but it is working well so far.
Thanks BeniBela, I will also look at your solution and report back.
You might use xPather chrome extension or similar, and the xPath query:
count(//*[self::h1 or self::h2 or self::h3])
Thanks to:
SEO/Web Crawling Tool to Count Number of Headings (H1, H2, H3...)
https://devhints.io/xpath
Related
I am calling the Wikipedia API from Java using the following search query to generate redirected pages:
https://en.wikipedia.org//w/api.php?format=json&action=query&generator=allpages&gapfilterredir=redirects&prop=links&continue=&gapfrom=D
where the final 'D' is just an example for the continue-from.
I am interested in only iterating over items in namespace:0. In fact, if I don't, the continue return value includes category pages, which break the next query iteration.
Thank you in advance.
The parameter you need from the Allpages api is
…&gapnamespace=0&…
but notice that when you omit it, then 0 is the default anyway.
I have a Drupal 7 website that is running apachesolr search and is using faceting through the facetapi module.
When I use the facets to narrow my searches, everything works perfectly and I can see the filters being added to the search URL, so I can copy them as links (ready-made narrowed searches) elsewhere on the site.
Here is an example of how the apachesolr URL looks after I select several facets/filters:
search_url/search_keyword?f[0]=im_field_tag_term1%3A1&f[1]=im_field_tag_term2%3A100
Where the 'search_keyword' portion is the text I'm searching for and the '%3A' is just the url encoded ':' (colon).
Knowing this format, I can create any number of ready-made searches by creating the correct format for the URL. Perfect!
However, these filters are always ANDed, the same way they are when using the facet interface. Does anyone know if there is a syntax I can use, specifically in the search URL, to OR my filters/facets? Meaning, to make it such that the result is all entries that contains EITHER of the two filters?
New edit:
I do know how to OR terms for one facet through the URL im_field_tag_term1:(x or y) but I need to know how to apply OR condition between two facets .
Thanks in advance .
Many of the Wiktionary pages for Chinese Characters (Hanzi) include links at the top of the page to other similar-looking characters. I'd like to use the Wiktionary API to send a single character in the query and receive a list of similar characters as the response. Unfortunately, I can't seem to find any query that includes the "See Also" field. Is this kind of query possible?
The “see also” field is just a line of wiki code in the page source, and there is no way for the API to know that it's different from any other piece of text on the page.
If you are happy with using only the English version of Wiktionary, you can fetch the wikicode: index.php?title=太&action=raw, and then parse the result for the template also. In this case, the line you are looking for is {{also|大|犬}}.
To check if the template is used on the page at all, query the API for titles=太&prop=templates&tltemplates=Template:also
Similar templates are avilable in more language editions of Wiktionary, in case you want to use other sources than the English one. The current list is:
br:Patrom:gwelet
ca:Plantilla:vegeu
cs:Šablona:Viz
de:Vorlage:Siehe auch
el:Πρότυπο:δείτε
es:Plantilla:desambiguación
eu:Txantiloi:Esanahi desberdina
fi:Malline:katso
fr:Modèle:voir
gl:Modelo:homo
id:Templat:lihat
is:Snið:sjá einnig
it:Template:Vedi
ja:テンプレート:see
no:Mal:se også
oc:Modèl:veire
pl:Szablon:podobne
pt:Predefinição:ver também
ru:Шаблон:Cf
sk:Šablóna:See
sv:Mall:se även
It has been suggested that the WikiData project be expanded to cover Wiktionary. If and when that happens, you might be able to query theWikiData API for that kind of stuff!
I followed this blog Tips 1and created a crawl rule http://.*forms/allitems.aspx and ran full crawl. I no longer get the results with AllItems.aspx. However, if there is any document with name Something.doc in a Document Library , it no longer gets pulled in the search results.
I think what I desire is a basic functionality, like the user should not get to see Allitems.aspx in the search results but should get the item/document with names entered in the search box.
Please let me know if I am missing anything. I have already put in 24 hours...googled the max I could.
It seems that an Index Reset is required. Here's the steps I did:
1. Add the following crawl rule to exclude: *://*allitems.aspx.
2. Index Reset.
3. Full Crawl.
I could not find a good way to do this using crawl rules. Instead, I opted to set up a restriction on the search results web part.
In the search results web part properties, select "Change Query"
Add a property filter to exclude anything with "AllItems" (and any other exclusions you want in place.
Used Steve Mann's blog as a reference and for the images: http://stevemannspath.blogspot.com/2013/04/sharepoint-2013-search-removing-junk.html
In my website there is a recently uploaded image section.
in this section all recently uploaded images are displayed randomly
using firepath i traced the xpath of that location
//div[#id='udtkbdf50']/a/div[2]/div
so on each time page refresh this #id='udtkbdf50' value changes ,only one thing is common that is the value is always starting with u
so i want to use pattern matching technique [regular expression or Globbing Patterns ]
#id='udtkbdf50' for this value and rest of the path i.e /a/div[2]/div will remain same.
//div[contains(#id,'u')]/a/div[2]/div will work.
UPDATE:
//div[starts-with(#id,'u')]/a/div[2]/div will be more specific.
All d best.