How to implement full text search using AWS Neptune as the source - amazon-neptune

I've seen on other questions that Gremlin & Neptune do not support full text search natively.
How can I provide this feature as part of my web-site?
Ideally it would be not require running up more infrastructure/software that I have to look after.
Thinking that some options are using an external search service like Solr or ElasticSearch. What about another AWS service? Cloudsearch?
thanks

Your question is very timely. Integration between Amazon Neptune and ElasticSearch was just launched [1]. As you add data to a graph, Neptune will automatically keep an ElasticSearch index up to date using the Streams feature. This is similar in approach to what you were considering building but with the added advantage that you can access the index directly from your graph queries rather than needing to write wrapper code that calls the index and then calls Neptune. You can use the ElasticSearch index in your Gremlin and/or SPARQL queries by simply including some "magic" strings in the query that will tell Neptune to use the ElasticSearch index rather than its own internal indices. You have control over things like which ElasticSearch query API a given Gremlin/SPARQL query should use. Instructions on how to setup the environment, which does not take long are included at the same link [1]. Hopefully this will help with your use case. As a side note, another benefit of the feature is that, when working with Gremlin, you do not need a specially modified client library to take advantage of these new capabilities.
[1] https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html

CloudSearch is good choice but note that "looking after" it will be inevitable. You will at least need an error monitoring/logging mechanism that enables you to see what queries have failed and to track down why. Maybe diacritics handling was not ok, for instance. However, do note that some code for wiring Neptune with CloudSearch will be needed, I do not know of any out of the box method to transfer certain data from Neptune as index to CS. At least a Lambda-function. Lambda functions are worth considering.

Related

Is there a way to get Splunk Data to BigQuery?

I have some app data which is currently stored in Splunk. But i am looking for a way where I can input the Splunk data directly to BigQuery. My target is to analyze the app data on BigQuery and perhaps create Data Studio dashboards based on the BigQuery.
I know there are a lot of third party connectors that can help me with this, but I am looking for a solution where I can use features from Splunk or BigQuery to conncet both of them together and not rely on third party connectors.
Based on your comment indicating that you're interested in resources to egress data from Splunk into BigQuery with custom software, I would suggest using either tool's REST API on either side.
You don't indicate whether this is a one-time or a recurring asking - that may impact where you want the software to run that performs this operation. If it's a one-time thing and you've got a fair internet connection yourself, you may just want to write a console application from your own machine to perform the migration. If it's a recurring operation, you might instead look at any of the various "serverless" hosting options out there (e.g. Azure Functions, Google Cloud Functions, or AWS Lambda). In addition to development experience, note that you may have to pay an egress bandwidth cost for each on top of normal service charges.
Beyond that, you need to decide whether it makes more sense to do a bulk export from Splunk out to some external file that you load into Google Drive and then import into Big Query. But maybe it makes more sense to download the records as paged data via HTTPS so you can perform some ETL operation on top of it (e.g. replace nulls with empty strings, update Datetime types to match Google's exacting standards, etc.). If you go this route, it looks as though this is the documentation you'd use from Splunk and you can either use Google's newer, and higher-performance Storage Write API to receive the data or their legacy streaming API to ingest into BigQuery. Either option supports SDKs across varied languages (e.g. C#, Go, Ruby, Node.js, Python, etc.), though only the legacy streaming API supports plain HTTP REST calls.
Beyond that, don't forget your OAuth2 concerns to authenticate on either side of the operation, though this is typically abstracted away by the various SDKs offered by either party, and less of something you'd have to deal with the ins and outs of.

What is the recommended way to provide an API for Apache Spark application results

We have a huge set of data stored on hadoop cluster. We need to do some analysis to these data using apache spark and provide the result of this analysis to other applications via an API.
I have two ideas but I can not figure out which one is the recommended.
The first option is to make spark application(s) that make its analysis and store the result in another datastore (relation DB or even HDFS), then develop another application that reads the result of the analysis from the other datastore and provide an API for querying.
The second option is to make merge the two applications into one application. This way I deduce the need to another datastore but I this way the application will up running all the time.
What is the recommended way to go for in this situation? and if there is another options kindly list it.
It depends on How frequently the user going to hit the get api.as if client want real time result should go for in line api.else can use first aproach of storing result in another data storage.

Collect and Display Hadoop MapReduce resutls in ASP.NET MVC?

Beginner questions. I read this article about Hadoop/MapReduce
http://www.amazedsaint.com/2012/06/analyzing-some-big-data-using-c-azure.html
I get the idea of hadoop and what is map and what is reduce.
The thing for me is, if my application sits on top of a hadoop cluster
1) No need for database anymore?
2) How do I get my data into hadoop in the first place from my ASP.NET MVC application? Say it's Stackoverflow (which is coded in MVC). After I post this question, how can this question along with the title, body, tags get into hadoop?
3) In the above article, it collects data about "namespaces" used on Stakoverflow and how many times they were used.
If this site stackoverflow wants to display the result data from mapreducer in real time, how do you do that?
Sorry for the rookie questions. I'm just trying to get a clear pictures here one piece at a time.
1) That would depend on the application. Most likely you still need database for user management, etc.
2) If you are using Amazon EMR, you'd place the inputs into S3 using .NET API (or some other way) and get the results out the same way. You could also monitor your EMR account via API, fairly straight-forward.
3) Hadoop is not really a real-time environment, more of a batch system. You could simulate
realtime by continuous processing of incoming data, however it's still not true real-time.
I'd recommend taking a look at Amazon EMR .NET docs and pick up a good book on Hadoop (such as Hadoop in Practice to understand the stack and concepts and Hive (such as Programming Hive)
Also, you can, of course, mix the environments for what they are best at; for example, use Azure Websites and SQLAzure for your .NET app and Amazon EMR for hadoop/hive. No need to park everything in one place, considering cost models.
Hope this helps.

riak backup solution for a single bucket

What are your recommendations for solutions that allow backing up [either by streaming or snapshot] a single riak bucket to a file?
Backing up just a single bucket is going to be a difficult operation in Riak.
All of the solutions will boil down to the following two steps:
List all of the objects in the bucket. This is the tricky part, since there is no "manifest" or a list of contents of any bucket, anywhere in the Riak cluster.
Issue a GET to each one of those objects from the list above, and write it to a backup file. This part is generally easy, though for maximum performance you want to make sure you're issuing those GETs in parallel, in a multithreaded fashion, and using some sort of connection pooling.
As far as listing all of the objects, you have one of three choices.
One is to do a Streaming List Keys operation on the bucket via HTTP (e.g. /buckets/bucket/keys?keys=stream) or Protocol Buffers -- see http://docs.basho.com/riak/latest/dev/references/http/list-keys/ and http://docs.basho.com/riak/latest/dev/references/protocol-buffers/list-keys/ for details. Under no circumstances should you do a non-streaming regular List Keys operation. (It will hang your whole cluster, and will eventually either time out or crash once the number of keys grows large enough).
Two is to issue a Secondary Index (2i) query to get that object list. See http://docs.basho.com/riak/latest/dev/using/2i/ for discussion and caveats.
And three would be if you're using Riak Search and can retrieve all of the objects via a single paginated search query. (However, Riak Search has a query result limit of 10,000 results, so, this approach is far from ideal).
For an example of a standalone app that can backup a single bucket, take a look at Riak Data Migrator, an experimental Java app that uses the Streaming List Keys approach combined with efficient parallel GETs.
The Basho function contrib has an erlang solution for backing up a single bucket. It is a custom function but it should do the trick.
http://contrib.basho.com/bucket_exporter.html
As far as I know, there's no automated solution to backup a single bucket in Riak. You'd have to use the riak-admin command line tool to take care of backing up a single physical node. You could write something to retrieve all keys in a single bucket and using low r values if you want it to be fast but not secure (r = 1).
Buckets are a logical namespace, all of the keys are stored in the same bitcask structure. That's why the only way to get just a single node is to write a tool to stream them yourself.

Index replication and Load balancing

Am using Lucene API in my web portal which is going to have 1000s of concurrent users.
Our web server will call Lucene API which will be sitting on an app server.We plan to use 2 app servers for load balancing.
Given this, what should be our strategy for replicating lucene indexes on the 2nd app server?any tips please?
You could use solr, which contains built in replication. This is possibly the best and easiest solution, since it probably would take quite a lot of work to implement your own replication scheme.
That said, I'm about to do exactly that myself, for a project I'm working on. The difference is that since we're using PHP for the frontend, we've implemented lucene in a socket server that accepts queries and returns a list of db primary keys. My plan is to push changes to the server and store them in a queue, where I'll first store them into the the memory index, and then flush the memory index to disk when the load is low enough.
Still, it's a complex thing to do and I'm set on doing quite a lot of work before we have a stable final solution that's reliable enough.
From experience, Lucene should have no problem scaling to thousands of users. That said, if you're only using your second App server for load balancing and not for fail over situations, you should be fine hosting Lucene on only one of those servers and accessing it via NDS (if you have a unix environment) or shared directory (in windows environment) from the second server.
Again, this is dependent on your specific situation. If you're talking about having millions (5 or more) of documents in your index and needing your lucene index to be failoverable, you may want to look into Solr or Katta.
We are working on a similar implementation to what you are describing as a proof of concept. What we see as an end-product for us consists of three separate servers to accomplish this.
There is a "publication" server, that is responsible for generating the indices that will be used. There is a service implementation that handles the workflows used to build these indices, as well as being able to signal completion (a custom management API exposed via WCF web services).
There are two "site-facing" Lucene.NET servers. Access to the API is provided via WCF Services to the site. They sit behind a physical load balancer and will periodically "ping" the publication server to see if there is a more current set of indicies than what is currently running. If it is, it requests a lock from the publication server and updates the local indices by initiating a transfer to a local "incoming" folder. Once there, it is just a matter of suspending the searcher while the index is attached. It then releases its lock and the other server is available to do the same.
Like I said, we are only approaching the proof of concept stage with this, as a replacement for our current solution, which is a load balanced Endeca cluster. The size of the indices and the amount of time it will take to actually complete the tasks required are the larger questions that have yet to be proved out.
Just some random things that we are considering:
The downtime of a given server could be reduced if two local folders are used on each machine receiving data to achieve a "round-robin" approach.
We are looking to see if the load balancer allows programmatic access to have a node remove and add itself from the cluster. This would lessen the chance that a user experiences a hang if he/she accesses during an update.
We are looking at "request forwarding" in the event that cluster manipulation is not possible.
We looked at solr, too. While a lot of it just works out of the box, we have some bench time to explore this path as a learning exercise - learning things like Lucene.NET, improving our WF and WCF skills, and implementing ASP.NET MVC for a management front-end. Worst case scenario, we go with something like solr, but have gained experience in some skills we are looking to improve on.
I'm creating the Indices on the publishing Backend machines into the filesystem and replicate those over to the marketing.
That way every single, load & fail balanced, node has it's own index without network latency.
Only drawback is, you shouldn't try to recreate the index within the replicated folder, as you'll have the lockfile lying around at every node, blocking the indexreader until your reindex finished.