I am using Apache Solr for my search , using this i am indexing variety of resources such as (PDF,MS Word document).
If let say the user giving the query like "PDF: java" then i wants to search only the PDF files
Any ideas
Thanks
Dilip.
Well, like I commented. Set up a filetype[string] in your schema and set it when you upload that file.
http://localhost:8983/solr/update/extract?literal.id=123&literal.filetype=pdf
and when you search
http://localhost:8983/solr/select?q=text:electrical design AND filetype:pdf
Quick hack: if your documents are identified by filename, you can tell Solr to limit results to those ending in *.pdf by
http://localhost:8983/solr/select?q=text:electrical design AND id:*.pdf
Related
According to one of the answered questions by NCBI Help Desk , we cannot "bulk-download" PubMed Central. However, can I use "NCBI E-utilities" to download all full-text papers in PMC database using Efetch or at least find all corresponding PMCids using Esearch in Entrez Programming Utilities? If yes, then how? If E-utilities cannot be used, is there any other way to download all full-text articles?
First of all, before you go downloading files in bulk, I highly recommend you read the E-utilities usage guidelines.
If you want full-text articles, you're going to want to limit your search to open access files. Furthermore, I suggest also restricting your search to Medline articles if you want articles that are any good. Then you can do the search.
Using Biopython, this gives us :
search_query = 'medline[sb] AND "open access"[filter]'
# getting search results for the query
search_results = Entrez.read(Entrez.esearch(db="pmc", term=search_query, retmax=10, usehistory="y"))
You can use the search function on the PMC website and it will display the generated query that you can copy/paste into your code.
Now that you've done the search, you can actually download the files :
handle = Entrez.efetch(db="pmc", rettype="full", retmode="xml", retstart=0, retmax=int(search_results["Count"]), webenv=search_results["WebEnv"], query_key=search_results["QueryKey"])
You might want to download in batches by changing retstart and retmax by variables in a loop in order to avoid flooding the servers.
If handle contains only one file, handle.read() contains the whole XML file as a string. If it contains more, the articles are contained in <article></article> nodes.
The full text is only available in XML, and the default parser available in pubmed doesn't handle XML namespaces, so you're going to be on your own with ElementTree (or an other parser) to parse your XML.
Here, the articles are found thanks to the internal history of E-utilities, which is accessed with the webenv argument and enabled thanks to the usehistory="y" argument in Entrez.read()
A few tips about XML parsing with ElementTree : You can't delete a grandchild node, so you're probably going to want to delete some nodes recursively. node.text returns the text in node, but only up to the first child, so you'll need to do something along the lines of "".join(node.itertext()) if you want to get all the text in a given node.
According to one of the answered questions by NCBI Help Desk , we cannot "bulk-download" PubMed Central.
https://www.nlm.nih.gov/bsd/medline.html + https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ will download a good portion of it (I don't know the percentage). It will indeed miss the PMC full-texts articles whose license doesn't allow redistribution as explained on https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/.
I have a Drupal 7 website that is running apachesolr search and is using faceting through the facetapi module.
When I use the facets to narrow my searches, everything works perfectly and I can see the filters being added to the search URL, so I can copy them as links (ready-made narrowed searches) elsewhere on the site.
Here is an example of how the apachesolr URL looks after I select several facets/filters:
search_url/search_keyword?f[0]=im_field_tag_term1%3A1&f[1]=im_field_tag_term2%3A100
Where the 'search_keyword' portion is the text I'm searching for and the '%3A' is just the url encoded ':' (colon).
Knowing this format, I can create any number of ready-made searches by creating the correct format for the URL. Perfect!
However, these filters are always ANDed, the same way they are when using the facet interface. Does anyone know if there is a syntax I can use, specifically in the search URL, to OR my filters/facets? Meaning, to make it such that the result is all entries that contains EITHER of the two filters?
New edit:
I do know how to OR terms for one facet through the URL im_field_tag_term1:(x or y) but I need to know how to apply OR condition between two facets .
Thanks in advance .
I followed this blog Tips 1and created a crawl rule http://.*forms/allitems.aspx and ran full crawl. I no longer get the results with AllItems.aspx. However, if there is any document with name Something.doc in a Document Library , it no longer gets pulled in the search results.
I think what I desire is a basic functionality, like the user should not get to see Allitems.aspx in the search results but should get the item/document with names entered in the search box.
Please let me know if I am missing anything. I have already put in 24 hours...googled the max I could.
It seems that an Index Reset is required. Here's the steps I did:
1. Add the following crawl rule to exclude: *://*allitems.aspx.
2. Index Reset.
3. Full Crawl.
I could not find a good way to do this using crawl rules. Instead, I opted to set up a restriction on the search results web part.
In the search results web part properties, select "Change Query"
Add a property filter to exclude anything with "AllItems" (and any other exclusions you want in place.
Used Steve Mann's blog as a reference and for the images: http://stevemannspath.blogspot.com/2013/04/sharepoint-2013-search-removing-junk.html
I would like to index files in Solr.
I have already made an "output script" with PHP, but my project leader has given me the task of displaying the page number of the found text.
So:
- I am searching for the Word "Foo".
- Solr returns the results and also the highlighted text.
- Now I would like to know on which page this highlighted text is, to find it.
The files are *.pdf files.
One solution I have thought of would be to import the Text of the PDF Files in different fields? Or maybe in this one multivalued field named "content".
Maybe like this:
Json:
content:
1: "page one text",
2: "page two text"
and so on?
Is this possible? Or is there a better way to find this information out? Thanks for your help! :-)
You need to create a separate Solr document for every page of every PDF file. If you want to return only one result per file, then you can use FieldCollapsing to group all the results from the same PDF file.
In alfresco I need to write a lucene query such a way that It has to eliminate/exclude the xml tags from content while searching.
Example If a file try.xml is searched against the content, my search should not search for the xml tags.
try.xml
<sample>This is an example</sample>
If I give the search text as "sample" it should not return the file name "try.xml".
So how could I achieve this?
Edit
I have tried with the below query and no change.
#cm\:name:"try*" -TEXT:"<*>" +TEXT:"sample"
Whats wrong in the above query. I just tried to get the file name which starts with "try" and eliminating the text inside tag, and trying to search for text "sample".
By default Alfresco treats XML files as plain text and indexes the xml tags as words, that's why they can be found via full text search. XML content is handled by the StringExtractingContentTransformer in Alfresco which converts text/xml to text/plain before indexing it.
To check which transformers are registered in your Alfresco installation you can check
http://localhost:8080/alfresco/service/mimetypes?mimetype=text/xml#text/xml
To prevent the indexing of xml attributes you have to write a special transformer which strips out the XML tags. See http://wiki.alfresco.com/wiki/Content_Transformations for an introduction in content transformation with Alfresco. The easiest way would be to integrate a command line utility that converts the xml file into text or you could implement a java class which does the transformation.
There's no standard way to do what you need, here's an excerpt of the official documentation:
Wild card queries Wildcard queries
using * and ? are support as terms and
phrases. For tokenized fields the
pattern match can not be exact as all
the non token characters (whitespace,
punctuation, etc) will have been lost
and treated as equal.
Basically, angle brackets are stripped out by default. You need to hack the indexing and query parsing processes in order to enable your wanted behavior.
Could you not just exclude the xml mimetype? (See http://wiki.alfresco.com/wiki/Search#Finding_nodes_by_content_mimetype for the syntax)
I guess you might want to exclude html too (so you'd exclude text/html and text/xml), that'd prevent you getting any nodes in your results that contain xml tags.