Amazon CloudSearch and Amazon Kendra - amazon-cloudsearch

I was wondering what is the main difference between Amazon CloudSearch and Kendra? Why there are 2 different tools of the same company products and compete each other? Both looks like same, I am not sure what are the differences in features. How it is being differentiated one among the other.
Amazon CloudSearch: Set up, manage, and scale a search solution for your website or application. Amazon CloudSearch enables you to search large collections of data such as web pages, document files, forum posts, or product information. With a few clicks in the AWS Management Console, you can create a search domain, upload the data you want to make searchable to Amazon CloudSearch, and the search service automatically provisions the required technology resources and deploys a highly tuned search index;
Amazon Kendra: Enterprise search service powered by machine learning. It is a highly accurate and easy to use enterprise search service that’s powered by machine learning. It delivers powerful natural language search capabilities to your websites and applications so your end users can more easily find the information they need within the vast amount of content spread across your company.

The key difference between the two services is that AWS Cloud Search is based on Solr, a keyword engine, while Amazon Kendra is an ML-powered search engine designed to provide more accurate search results over unstructured data such as Word documents, PDFs, HTML, PPTs, and FAQs. Kendra was designed from the ground up to natively handle natural language queries and return specific answers, instead of just lists of documents like keyword engines do.
Another key difference is that in CloudSearch, to upload data to your domain, it must be formatted as a valid JSON or XML batch. Kendra, on the other hand, provides out of the box connectors that allow customers to automatically index content from popular repositories like Sharepoint Online, S3, Salesforce, Servicenow, etc., directly into the Kendra index. So, depending on your use case, Kendra may be a better choice, especially if you’re considering the service for enterprise search applications, or even web site search where deeper language understanding is important. Hope this helps, happy to address follow-up questions. You can also visit our Kendra FAQ page for more specific answers around the service: https://aws.amazon.com/kendra/faqs/

Related

Is there a way to get Splunk Data to BigQuery?

I have some app data which is currently stored in Splunk. But i am looking for a way where I can input the Splunk data directly to BigQuery. My target is to analyze the app data on BigQuery and perhaps create Data Studio dashboards based on the BigQuery.
I know there are a lot of third party connectors that can help me with this, but I am looking for a solution where I can use features from Splunk or BigQuery to conncet both of them together and not rely on third party connectors.
Based on your comment indicating that you're interested in resources to egress data from Splunk into BigQuery with custom software, I would suggest using either tool's REST API on either side.
You don't indicate whether this is a one-time or a recurring asking - that may impact where you want the software to run that performs this operation. If it's a one-time thing and you've got a fair internet connection yourself, you may just want to write a console application from your own machine to perform the migration. If it's a recurring operation, you might instead look at any of the various "serverless" hosting options out there (e.g. Azure Functions, Google Cloud Functions, or AWS Lambda). In addition to development experience, note that you may have to pay an egress bandwidth cost for each on top of normal service charges.
Beyond that, you need to decide whether it makes more sense to do a bulk export from Splunk out to some external file that you load into Google Drive and then import into Big Query. But maybe it makes more sense to download the records as paged data via HTTPS so you can perform some ETL operation on top of it (e.g. replace nulls with empty strings, update Datetime types to match Google's exacting standards, etc.). If you go this route, it looks as though this is the documentation you'd use from Splunk and you can either use Google's newer, and higher-performance Storage Write API to receive the data or their legacy streaming API to ingest into BigQuery. Either option supports SDKs across varied languages (e.g. C#, Go, Ruby, Node.js, Python, etc.), though only the legacy streaming API supports plain HTTP REST calls.
Beyond that, don't forget your OAuth2 concerns to authenticate on either side of the operation, though this is typically abstracted away by the various SDKs offered by either party, and less of something you'd have to deal with the ins and outs of.

How to create a CDN to store and serve images and videos?

We have a requirement to store and retrieve content(Audio, Video, Images) quickly. We are not allowed to use Commercial providers like AWS S3 etc.
Any suggestions on how to go about? Challenges I forsee are
a) Storage
b) Fast Retrieval
c) Caching
Would cassandra help in the above?
This is a very typical use case for Cassandra for things like streaming services or media-sharing social apps.
The difference is that the media files are saved in an object store and only the metadata (such as URL of the media file) is stored in Cassandra so you can retrieve information about where the media is stored really quickly.
As a side note, I wanted to warn you that others will likely vote to close your question because it is soliciting opinions vs a specific software issue. Cheers!

Cloud scale user management

I am building a service to handle a large number of devices, for a large number of users.
We have a complex schema of access roles assigned to each entity. Some data entries can be written to by certain users, while some users can only read from some entities (but can write to others).
This is a cloud service: there are more devices, and users than can be handled by a single server machine (we are using non relational Cloud databases for this).
I was wondering if there was an established cloud-scale user/role management backend system which I could integrate to enforce the access rules, instead of writing my own. This tech should preferably be cloud agnostic, so I would prefer not to use a SAAS solution, but deploy my own.
I am looking for a system which can scale to millions of users, and billions of data entities
I think authentication is not going to be a big issue, there are very robust cloud based solutions available for storing identities and authenticating millions of users. Authorization will be trickier, and will depend a lot on how granular you want it to be. You could look at Apigee for example as a very scalable proxy that might help you implement this. So getting to the point where you have a token that you can verify the users identity with and that might contain some scopes is not going to be hard imo. If that is enough for you then I would just look at Auth.0, Okta and the native IDM solution of whatever cloud platform you are using (Cognito, Cloud Identity etc.).
I think you will find that more features come with a very hefty pricetag. So Auth.0 is far superior compared to Cognito, but Cognito still has enough features for basic use cases and will end up costing a fraction of Auth.0 in large deployments. So everything comes with pros and cons. If you have very complex requirements such as a bunch of big legacy repositories that you need to integrate then products like Auth.0 rapidly start looking more attractive.
Personally I would look at Auth.0, Cognito and Apigee and my decision would depend massively on parameters that you haven't mentioned in your question. Obviously these are all SaaS solutions, which I think you should definitely be using anyways. I would not host this myself unless I had no other choice, and going that route will radically limit your choices and probably increase expenses. All the cool stuff is happening in the cloud.

Storing Files ( images, Microsoft office documents )

What is the best way for storing images and Microsoft Office documents:
Google Drive
Google Storage
You may want to consider checking this page to help you choose which storage option suits you best and also learn more.
To differentiate the two:
Google Drive
A collaborative space for storing, sharing, and editing files, including Google Docs and is good for the following:
End-user interaction with docs and files
Collaborative creation and editing
Syncing files between cloud and local devices
Google Cloud Storage
A scalable, fully-managed, highly reliable, and cost-efficient object / blob store and good for these:
Images, pictures, and videos
Objects and blobs
Unstructured data
In addition to that, see Google Cloud Platform - FAQ for more insights.
Different approaches can be taken into consideration, google docs are widely used for online working with office documents etc, it provides probably same layout in comparison to Microsoft office, the advantage is that you can share the document with other people as well, plus you can edit it online at any time.
Google Drive (useful way to store your files)
Every Google Account starts with 15 GB of free storage that's shared across Google Drive, Gmail, and Google Photos. When you upgrade to Google One, your total storage increases to 100 GB or more depending on what plan you choose.
Mediafire (another useful way to store your files)
In mediafire on the basic package it allows you 10 GB of cloud space for free, the files you store in the MediaFire can be encrypted by password encryption. It allows more other features as well. A suggestion to explore.

Collect and Display Hadoop MapReduce resutls in ASP.NET MVC?

Beginner questions. I read this article about Hadoop/MapReduce
http://www.amazedsaint.com/2012/06/analyzing-some-big-data-using-c-azure.html
I get the idea of hadoop and what is map and what is reduce.
The thing for me is, if my application sits on top of a hadoop cluster
1) No need for database anymore?
2) How do I get my data into hadoop in the first place from my ASP.NET MVC application? Say it's Stackoverflow (which is coded in MVC). After I post this question, how can this question along with the title, body, tags get into hadoop?
3) In the above article, it collects data about "namespaces" used on Stakoverflow and how many times they were used.
If this site stackoverflow wants to display the result data from mapreducer in real time, how do you do that?
Sorry for the rookie questions. I'm just trying to get a clear pictures here one piece at a time.
1) That would depend on the application. Most likely you still need database for user management, etc.
2) If you are using Amazon EMR, you'd place the inputs into S3 using .NET API (or some other way) and get the results out the same way. You could also monitor your EMR account via API, fairly straight-forward.
3) Hadoop is not really a real-time environment, more of a batch system. You could simulate
realtime by continuous processing of incoming data, however it's still not true real-time.
I'd recommend taking a look at Amazon EMR .NET docs and pick up a good book on Hadoop (such as Hadoop in Practice to understand the stack and concepts and Hive (such as Programming Hive)
Also, you can, of course, mix the environments for what they are best at; for example, use Azure Websites and SQLAzure for your .NET app and Amazon EMR for hadoop/hive. No need to park everything in one place, considering cost models.
Hope this helps.