Collect and Display Hadoop MapReduce resutls in ASP.NET MVC? - asp.net-mvc-4

Beginner questions. I read this article about Hadoop/MapReduce
http://www.amazedsaint.com/2012/06/analyzing-some-big-data-using-c-azure.html
I get the idea of hadoop and what is map and what is reduce.
The thing for me is, if my application sits on top of a hadoop cluster
1) No need for database anymore?
2) How do I get my data into hadoop in the first place from my ASP.NET MVC application? Say it's Stackoverflow (which is coded in MVC). After I post this question, how can this question along with the title, body, tags get into hadoop?
3) In the above article, it collects data about "namespaces" used on Stakoverflow and how many times they were used.
If this site stackoverflow wants to display the result data from mapreducer in real time, how do you do that?
Sorry for the rookie questions. I'm just trying to get a clear pictures here one piece at a time.

1) That would depend on the application. Most likely you still need database for user management, etc.
2) If you are using Amazon EMR, you'd place the inputs into S3 using .NET API (or some other way) and get the results out the same way. You could also monitor your EMR account via API, fairly straight-forward.
3) Hadoop is not really a real-time environment, more of a batch system. You could simulate
realtime by continuous processing of incoming data, however it's still not true real-time.
I'd recommend taking a look at Amazon EMR .NET docs and pick up a good book on Hadoop (such as Hadoop in Practice to understand the stack and concepts and Hive (such as Programming Hive)
Also, you can, of course, mix the environments for what they are best at; for example, use Azure Websites and SQLAzure for your .NET app and Amazon EMR for hadoop/hive. No need to park everything in one place, considering cost models.
Hope this helps.

Related

Is there a way to get Splunk Data to BigQuery?

I have some app data which is currently stored in Splunk. But i am looking for a way where I can input the Splunk data directly to BigQuery. My target is to analyze the app data on BigQuery and perhaps create Data Studio dashboards based on the BigQuery.
I know there are a lot of third party connectors that can help me with this, but I am looking for a solution where I can use features from Splunk or BigQuery to conncet both of them together and not rely on third party connectors.
Based on your comment indicating that you're interested in resources to egress data from Splunk into BigQuery with custom software, I would suggest using either tool's REST API on either side.
You don't indicate whether this is a one-time or a recurring asking - that may impact where you want the software to run that performs this operation. If it's a one-time thing and you've got a fair internet connection yourself, you may just want to write a console application from your own machine to perform the migration. If it's a recurring operation, you might instead look at any of the various "serverless" hosting options out there (e.g. Azure Functions, Google Cloud Functions, or AWS Lambda). In addition to development experience, note that you may have to pay an egress bandwidth cost for each on top of normal service charges.
Beyond that, you need to decide whether it makes more sense to do a bulk export from Splunk out to some external file that you load into Google Drive and then import into Big Query. But maybe it makes more sense to download the records as paged data via HTTPS so you can perform some ETL operation on top of it (e.g. replace nulls with empty strings, update Datetime types to match Google's exacting standards, etc.). If you go this route, it looks as though this is the documentation you'd use from Splunk and you can either use Google's newer, and higher-performance Storage Write API to receive the data or their legacy streaming API to ingest into BigQuery. Either option supports SDKs across varied languages (e.g. C#, Go, Ruby, Node.js, Python, etc.), though only the legacy streaming API supports plain HTTP REST calls.
Beyond that, don't forget your OAuth2 concerns to authenticate on either side of the operation, though this is typically abstracted away by the various SDKs offered by either party, and less of something you'd have to deal with the ins and outs of.

Calling API from PigLatin

Complete newbie to PigLatin, but looking to pull data from the MetOffice DataPoint API e.g.:
http://datapoint.metoffice.gov.uk/public/data/val/wxfcs/all/xml/350509?res=3hourly&key=abc123....
...into Hadoop.
My question is "Can this be undertaken using PigLatin (from within Pig View, in Ambari)"?
I've hunted round for how to format a GET request into the code, but without luck.
Am I barking up the wrong tree? Should I be looking to use a different service within the Hadoop framework to accomplish this?
It is very bad idea to make calls to external services from inside of map-reduce jobs. The reason being that when running on the cluster your jobs are very scalable whereas the external system might not be so. Modern resource managers like YARN make this situation even worse, when you swamp external system with the requests your tasks on the cluster will be mostly sleeping waiting for reply from the server. The resource manager will see that CPU is not being used by tasks and will schedule more of your tasks to run which will make even more requests to the external system, swamping it with the requests even more. I've seen modest 100 machine cluster putting out 100K requests per second.
What you really want to do is to either somehow get the bulk data from the web service or setup a system with a queue and few controlled number of workers that will pull from the external system at set rate.
As for your original question, I don't think PigLatin provides such service, but it could be easily done with UDFs either Python or Java. With Python you can use excellent requests library, which will make your UDF be about 6 lines of code. Java UDF will be little bit more verbose, but nothing terrible by Java standards.
"Can this be undertaken using PigLatin (from within Pig View, in
Ambari)"?
No, by default Pig load from HDFS storage, unless you write your own loader.
And i share same point with #Vlad, that this is not a good idea, you have many other other components used for data ingestion, but this not a use case of Pig !

How to use Apache Nifi to query a REST API?

For a project i need to develop an ETL process (extract transform load) that reads data from a (legacy) tool that exposes its data on a REST API. This data needs to be stored in amazon S3.
I really like to try this with apache nifi but i honestly have no clue yet how i can connect with the REST API, and where/how i can implement some business logic to 'talk the right protocol' with the source system. For example i like to keep track of what data has been written so far so it can resume loading where it left of.
So far i have been reading the nifi documentation and i'm getting a better insight what the tool provdes/entails. However it's not clear to be how i could implement the task within the nifi architecture.
Hopefully someone can give me some guidance?
Thanks,
Paul
The InvokeHTTP processor can be used to query a REST API.
Here is a simple flow that
Queries the REST API at https://api.exchangeratesapi.io/latest every 10 minutes
Sets the output-file name (exchangerates_<ID>.json)
Stores the query response in the output file on the local filesystem (under /tmp/data-out)
I exported the flow as a NiFi template and stored it in a gist. The template can be imported into a NiFi instance and run as is.

Database for live mobile tracking

I'm developing an app that allows to track a mobile device instantly (live) ... I need an of advice. The application must send the location to a webservice that in it's turn records the received data in a database.
What would be, in your opinion, the best way to store the location values?
I'm new in using bigdata and I'm afraid that simple sql requests wont be able to do the work properly ... I imagine if there is lot of users and each user send a request each 1sec I'll have issue with the database ...
An advice ? Thank you very much
i think you could have a look into the geospatial queries in mongo, if you chose to go ahead with mongodb.
Refer here
And here
for the design of the database would depend on the nature of the query (essentially the read and write).
Worth having a look into
Working at Cintric we landed on using elasticsearch. We process billions of location points in real time and provide advanced analytics to our users.
We started with mongoDB and ran into a lot of troubles, eventually leading to a painful migration.
Our stack currently has mobile devices dump location updates into AWS Kinesis, which are then processed by AWS Lambda handlers, and then dumped into elasticsearch. We're able to serve, process and store 300 million requests/month for only a few hundred dollars/month. Analytics for our dashboard add additional cost but for your needs I would highly recommend checking out your options on AWS.

How to proceed with query automation using Import.io

I've successfully created a query with the Extractor tool found in Import.io. It does exactly what I want it to do, however I need to now run this once or twice a day. Is the purpose of Import.io as an API to allow me to build logic such as data storage and schedules tasks (running queries multiple times a day) with my own application or are there ways to scheduled queries and make use of long-term storage of my results completely within the Import.io service?
I'm happy to create a Laravel or Rails app to make requests to the API and store the information elsewhere but if I'm reinventing the wheel by doing so and they provides the means to address this then that is a true time saver.
Thanks for using the new forum! Yes, we have moved this over to Stack Overflow to maximise the community atmosphere.
At the moment, Import does not have the ability to schedule crawls. However, this is something we are going to roll out in the near future.
For the moment, there is the ability to set a Cron job to run when you specify.
Another solution if you are using the free version is to use a CI tool like travis or jenkins to schedule your API scripts.
You can query live the extractors so you don't need to make them run manually every time. This will consume one of your requests from your limit.
The endpoint you can use is:
https://extraction.import.io/query/extractor/extractor_id?_apikey=apikey&url=url
Unfortunately the script will not be a very simple one since most websites have very different respond structures towards import.io and as you may already know, the premium version of the tool provides now with scheduling capabilities.