sharepoint crawl rule to exclude AllItems.aspx , but get an item/document in search resu lts if queried in the search box - sharepoint-2010

I followed this blog Tips 1and created a crawl rule http://.*forms/allitems.aspx and ran full crawl. I no longer get the results with AllItems.aspx. However, if there is any document with name Something.doc in a Document Library , it no longer gets pulled in the search results.
I think what I desire is a basic functionality, like the user should not get to see Allitems.aspx in the search results but should get the item/document with names entered in the search box.
Please let me know if I am missing anything. I have already put in 24 hours...googled the max I could.

It seems that an Index Reset is required. Here's the steps I did:
1. Add the following crawl rule to exclude: *://*allitems.aspx.
2. Index Reset.
3. Full Crawl.

I could not find a good way to do this using crawl rules. Instead, I opted to set up a restriction on the search results web part.
In the search results web part properties, select "Change Query"
Add a property filter to exclude anything with "AllItems" (and any other exclusions you want in place.
Used Steve Mann's blog as a reference and for the images: http://stevemannspath.blogspot.com/2013/04/sharepoint-2013-search-removing-junk.html

Related

Downloading all full-text articles in PMC and PubMed databases

According to one of the answered questions by NCBI Help Desk , we cannot "bulk-download" PubMed Central. However, can I use "NCBI E-utilities" to download all full-text papers in PMC database using Efetch or at least find all corresponding PMCids using Esearch in Entrez Programming Utilities? If yes, then how? If E-utilities cannot be used, is there any other way to download all full-text articles?
First of all, before you go downloading files in bulk, I highly recommend you read the E-utilities usage guidelines.
If you want full-text articles, you're going to want to limit your search to open access files. Furthermore, I suggest also restricting your search to Medline articles if you want articles that are any good. Then you can do the search.
Using Biopython, this gives us :
search_query = 'medline[sb] AND "open access"[filter]'
# getting search results for the query
search_results = Entrez.read(Entrez.esearch(db="pmc", term=search_query, retmax=10, usehistory="y"))
You can use the search function on the PMC website and it will display the generated query that you can copy/paste into your code.
Now that you've done the search, you can actually download the files :
handle = Entrez.efetch(db="pmc", rettype="full", retmode="xml", retstart=0, retmax=int(search_results["Count"]), webenv=search_results["WebEnv"], query_key=search_results["QueryKey"])
You might want to download in batches by changing retstart and retmax by variables in a loop in order to avoid flooding the servers.
If handle contains only one file, handle.read() contains the whole XML file as a string. If it contains more, the articles are contained in <article></article> nodes.
The full text is only available in XML, and the default parser available in pubmed doesn't handle XML namespaces, so you're going to be on your own with ElementTree (or an other parser) to parse your XML.
Here, the articles are found thanks to the internal history of E-utilities, which is accessed with the webenv argument and enabled thanks to the usehistory="y" argument in Entrez.read()
A few tips about XML parsing with ElementTree : You can't delete a grandchild node, so you're probably going to want to delete some nodes recursively. node.text returns the text in node, but only up to the first child, so you'll need to do something along the lines of "".join(node.itertext()) if you want to get all the text in a given node.
According to one of the answered questions by NCBI Help Desk , we cannot "bulk-download" PubMed Central.
https://www.nlm.nih.gov/bsd/medline.html + https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ will download a good portion of it (I don't know the percentage). It will indeed miss the PMC full-texts articles whose license doesn't allow redistribution as explained on https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/.

Drupal 7 Apache solr faceted search with OR condition on two fields instead of drill down/AND

I have a Drupal 7 website that is running apachesolr search and is using faceting through the facetapi module.
When I use the facets to narrow my searches, everything works perfectly and I can see the filters being added to the search URL, so I can copy them as links (ready-made narrowed searches) elsewhere on the site.
Here is an example of how the apachesolr URL looks after I select several facets/filters:
search_url/search_keyword?f[0]=im_field_tag_term1%3A1&f[1]=im_field_tag_term2%3A100
Where the 'search_keyword' portion is the text I'm searching for and the '%3A' is just the url encoded ':' (colon).
Knowing this format, I can create any number of ready-made searches by creating the correct format for the URL. Perfect!
However, these filters are always ANDed, the same way they are when using the facet interface. Does anyone know if there is a syntax I can use, specifically in the search URL, to OR my filters/facets? Meaning, to make it such that the result is all entries that contains EITHER of the two filters?
New edit:
I do know how to OR terms for one facet through the URL im_field_tag_term1:(x or y) but I need to know how to apply OR condition between two facets .
Thanks in advance .

How to specify multiple values on siteSearch in google custom search api?

I'm using the google custom search api and want to create a search using the siteSearch:
https://www.googleapis.com/customsearch/v1?key=k&cx=cx&q=cocos2d&siteSearch=www.cocos2d-iphone.org&siteSearchFilter=i
and it works fine (returns all the result only from the given site).
Then I want to specify TWO sites to search so I tried to change the :
siteSearch=www.cocos2d-iphone.org
to
siteSearch=www.cocos2d-iphone.org www.XXXXXXXX.org
siteSearch=www.cocos2d-iphone.org|www.XXXXXXXX.org
siteSearch=www.cocos2d-iphone.org||www.XXXXXXXX.org
but none of these works.
hope someone can help here, thanks:)
Currently I don't believe you can specify more site through the query param siteSearch.
nevertheless you can configure your Custom Search Engine here: https://www.google.com/cse/manage/all
in the "Site to search" area.
This also works for excluding, as you can read here: https://support.google.com/customsearch/bin/answer.py?hl=en&answer=2631038&topic=2601037&ctx=topic
You cannot do this with the as_sitesearch parameter as that only accepts a single value. But you can achieve what you want with the as_q parameter, setting it to some value like: "site:google.com OR site:microsoft.com" - that will work in a similar way to this search.
The as_q parameter is documented here as:
The as_q parameter provides search terms to check for in a document.
This parameter is also commonly used to allow users to specify
additional terms to search for within a set of search results.
Examples q=president&as_q=John+Adams
Use "space" as seperator
Below is sample PHP code which works for me
$url="https://www.googleapis.com/customsearch/v1?key=k&cx=cx&q=cocos2d&siteSearch=".urlencode("www.cocos2d-iphone.org www.XXXXXXXX.org")."&siteSearchFilter=i"
Thanks,
Ojal Suthar

SEO/Web Crawling Tool to Count Number of Headings (H1, H2, H3...)

Does anyone know of a tool or script that will crawl my website and count the number of headings on every page within my website? I would like to know how many pages in my website have more than 4 headings (h1). I have Screaming Frog, but it only counts the first two H1 elements. Any help is appreciated.
My Xidel can do that, e.g.:
xidel http://stackoverflow.com/questions/14608312/seo-web-crawling-tool-to-count-number-of-headings-h1-h2-h3 -e 'concat($url, ": ", count(//h1))' -f '//a[matches(#href, "http://[^/]*stackoverflow.com/")]'
The xpath expression in the -e argument tells it to count the h1-tags and the -f option on which pages
This is such a specific task that I would just recommend you write it yourself. The simplest thing you need is an XPATH selector to give you the h1/h2/h3 tags.
Counting the headings:
Pick any one of your favorite programming languages.
Issue a web request for a page on your website (Ruby, Perl, PHP).
Parse the HTML.
Invoke the XPATH heading selector and count the number of elements that it returns.
Crawling your site:
Do step 2 through 4 for all of your pages (you'll probably have to have a queue of pages that you want to crawl). If you want to crawl all of the pages, then it will be just a little more complicated:
Crawl your home page.
Select all anchor tags.
Extract the URL from each href and discard any URLs that don't point to your website.
Perform a URL-seen test: if you have seen it before, then discard, otherwise queue for crawling.
URL-Seen test:
The URL-seen test is pretty simple: just add all the URLs you've seen so far to a hash map. If you run into a URL that is in your hash map, then you can ignore it. If it's not in the hash map, then add it to the crawl queue. The key for the hash map should be the URL and the value should be some kind of a structure that allows you to keep statistics for the headings:
Key = URL
Value = struct{ h1Count, h2Count, h3Count...}
That should be about it. I know it seems like a lot, but it shouldn't be more than a few hundred lines of code!
I found a tool in Code Canyon: Scrap(e) Website Analyser: http://codecanyon.net/item/scrap-website-analyzer/3789481.
As you will see from some of my comments, there was a small amount of configuration, but it is working well so far.
Thanks BeniBela, I will also look at your solution and report back.
You might use xPather chrome extension or similar, and the xPath query:
count(//*[self::h1 or self::h2 or self::h3])
Thanks to:
SEO/Web Crawling Tool to Count Number of Headings (H1, H2, H3...)
https://devhints.io/xpath

Google Custom Search API, Howto return country specific results only

I am making some PHP code which takes a given search phrase and url and searches through the google search results until it finds the url (only first 100 results). My problem is, this is only working for the US. I have tried adding the "&cr=" option, but it still on returns US results.
The full URL I am using for the request is:
https://www.googleapis.com/customsearch/v1?key=API_KEY&cx=CX_VALUE&q=KEYWORD&cr=COUNTRY&alt=JSON
Does anyone have any experience with this? I want to be able to see UK results. Tried inserting &cr=countryUK , but still only does US results.
Thanks :)
Regards,
Stian
Use the gl=<country code> param to limit it to your country of choice (so gl=gb for the uk).
More info here:
http://googleajaxsearchapi.blogspot.com/2009/10/web-search-in-your-country.html