I have tried going through the Azure data lake documentation in MSDN as well as couple of slides in slideshare to figure out an answer. From what I gathered, The Azure Data Catalog is used for discoverability based on metadata and few annotations user can provide. Would not having a content based search add more value to the lake?
Content search and full-text search on data in the Data Lake can be very useful indeed.
I would expect that you could use either HDINSIGHT or U-SQL's extensibility mechanism, to add content search (and indexing) with something like Lucene or Solr.
If you would like to see something out of the box, please file a feature request at http://aka.ms/adlfeedback. Thanks!
Related
Is there any standard data source available for FM 'CRM_ORDER_READ'?
I want to pull CRM data to BW using standard data source.
Thanks.
There are a bunch of CRM order related extractors (search RSA5), but you need to have clear insight what type of order information and schema you expect.
If you actually seek full control like you have with CRM_ORDER_READ, you best chance is to write your own generic extractor. Follow this guide
I am a newbie to search engines and information retrieval. Can someone explain how different is Lucene search engine compared to Azure Search.
I read the Azure Search documents and see that Azure Search supports Lucene queries as well, so is Azure Search built on top of Lucene or inherits certain features of it?
There is no proper documentation as such, can someone point me in the right direction.
Thanks in advance.
According to this Microsoft page the full text search is built on Lucene.
https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search
"The full text search engine in Azure Search is built on Apache Lucene, an industry standard in information retrieval."
Azure Search is not built on top of Apache Lucene as such, but it does support Lucene Query syntax.
https://learn.microsoft.com/en-us/rest/api/searchservice/lucene-query-syntax-in-azure-search
I am new to databases, and have some data stored as entities in Google Cloud Datastore. I would like to be able to analyze and plot this data in a web interface, and it seems like Google Data Studio provides an easy-to-use way to do this. However, I'm a bit confused as to how I can actually use the two interfaces together; it seems like either Google Cloud Storage or Google BigQuery could be a middleman in between, but I'm not sure how this might work. Could anyone advise on whether using Google Data Studio would be the best approach to plotting/analyzing data in Google Cloud Datastore, and if so, offer tips on how I could go about this? There are a large number of tutorials but it seems like none that I've found have explained how to load data from the Datastore into a useable file for Data Studio.
Thanks!
As Graham Polley says, the question is answered here. The workaround to connect Cloud Datastore to Google Data Studio is to first export Datastore entities to BigQuery, as explained in this guide.
Then see this in order to connect Data Studio to BigQuery tables.
Finally in this blog post, there's a tutorial for building a dashboard with Google Data Studio and BigQuery.
I understand that Splunk does not need a lot of functionality that a MySQL database would provide, and to index and perform searches on Big Data it might not be a good option to use a relational database.
Does Splunk use Lucene as a search engine, or have they made their on-disk data format?
I am sorry if there are any problems in the way I am asking the question. This is my first question on Stack Overflow.
Splunk uses its own search engine, it's not based on any 3rd party.
Its search engine is based on files only, no database behind it.
It does not store fields, but raw data only. The fields are extracted during search time, and due to that are very dynamic.
Its also very fast in finding keywords in the data (needle in haystack).
Breaking the data into time-based events, attaching time for each raw event.
Marking every word found in the events and their location across the index
Storing the events in compressed format (tar.gz)
To be more detailed, Splunk is storing data in the following way:
Very fast search for keywords inside the events
Look in the original raw data
Create new fields on the raw data and use them with statistics commands.
Source:
http://www.splunk.com/web_assets/pdfs/secure/Splunk_for_BigData.pdf
http://docs.splunk.com/Documentation/Splunk/6.5.1/Indexer/Howindexingworks
+3 Years experience
Splunk architect.
Googling would have helped: http://answers.splunk.com/answers/43533/search-capabilities-of-splunk-how-powerful-is-it-really --> No Lucene
Splunk has proprietary data format for their indexes. Lucene is not used, and Splunk has it's own Search language called SPL.
I am using BigQuery for SEO reasons. I am a search TC and I am a little confused why you are not using the Google Forum as I thought that was standard. What I want to use BigQuery for is to pull when my competitors change data on their website and which pages that were changed. So I need the URL that was changed and when it was changed (date) so I can also pull the page title and description to see what they are doing different than I am.
Is there anyone that knows how to use BigQuery to pull:
Date the page was changed
URL
Title
Description
We've switched to using Stack Overflow for support for many of our developer products, such as BigQuery. There's a great community here on StackOverflow, and the interface for formatting technical questions and interacting with the community is fantastic.
BigQuery does not collect the data for you-- it's a cloud service for performing ad hoc queries on massive datasets. Before first performing the queries, you need to upload the data to the service (as a CSV format).
So, if you have a job which collects this data -- URL, title, description, date and perhaps a hash of the webpage, you could potentially ingest a CSV file of this data into BigQuery and use it to understand when webpages have changed.
Of course, there are also 3rd-party services (such as Changedetection.com) which may be easier to use for your purposes.