Pyspark: how to streaming Data from a given API Url - api

I was given an API url, and a method getUserPost() which returns the data needed for my data processing function. I am able to get the data by using Client from suds.client as follow:
from suds.client import Client
from suds.xsd.doctor import ImportDoctor, Import
url = 'url'
imp = Import('http://schemas.xmlsoap.org/soap/encoding/')
imp.filter.add('filter')
d = ImportDoctor(imp)
client = Client(url, doctor=d)
tempResult = client.service.getUserPosts(user_ids = '',date_from='2016-07-01 03:19:57', date_to='2016-08-01 03:19:57', limit=100, offset=0)
Now, each tempResult will contain 100 records. I want to stream the data from given API url to RDD for parallelized processing. However, after reading the pySpark.Streaming documentation I can't find a streaming method for customized data source. Could anyone give me an ideal how to do so?
Thank you.

After a while digging, I found out how to solve the problem. I employed the use of Kafka Streaming. Basically you need to create a producer from given API, specify topic and Port for communication. Then a consumer to listen to that specific topic and Port to start streaming the data.
Note that the Producer and Consumer must be working as different threads in order to archive real-time streaming.

Related

gRPC and C#: receive message bigger than maximum allowed

I am doing some test to request some data to a remote database from a client. For that, I have a client gRPC that call a method in the gRPC, this gRPC server use EF to get the data and send the result to the client.
Well, in my case, I get about 3MB of data, that is higher than the default maximum size allowed for the channel.
I know that I can resolve the problem when I create the channel in the client, in this way, for example, to 60 mb:
var channel = GrpcChannel.ForAddress("http://localhost:5223",
new GrpcChannelOptions
{
MaxReceiveMessageSize = 62914560,
MaxSendMessageSize = 62914560,
});
But although I can increase this when I create the channel, I can't ensure that some query returns more data than the maximum allowed.
So I would like to know how I can handle this.
In this case, the method is unaray, it is not a stream.
Thanks.

Akka HTTP Source Streaming vs regular request handling

What is the advantage of using Source Streaming vs the regular way of handling requests? My understanding that in both cases
The TCP connection will be reused
Back-pressure will be applied between the client and the server
The only advantage of Source Streaming I can see is if there is a very large response and the client prefers to consume it in smaller chunks.
My use case is that I have a very long list of users (millions), and I need to call a service that performs some filtering on the users, and returns a subset.
Currently, on the server side I expose a batch API, and on the client, I just split the users into chunks of 1000, and make X batch calls in parallel using Akka HTTP Host API.
I am considering switching to HTTP streaming, but cannot quite figure out what would be the value
You are missing one other huge benefit: memory efficiency. By having a streamed pipeline, client/server/client, all parties safely process data without running the risk of blowing up the memory allocation. This is particularly useful on the server side, where you always have to assume the clients may do something malicious...
Client Request Creation
Suppose the ultimate source of your millions of users is a file. You can create a stream source from this file:
val userFilePath : java.nio.file.Path = ???
val userFileSource = akka.stream.scaladsl.FileIO(userFilePath)
This source can you be use to create your http request which will stream the users to the service:
import akka.http.scaladsl.model.HttpEntity.{Chunked, ChunkStreamPart}
import akka.http.scaladsl.model.{RequestEntity, ContentTypes, HttpRequest}
val httpRequest : HttpRequest =
HttpRequest(uri = "http://filterService.io",
entity = Chunked.fromData(ContentTypes.`text/plain(UTF-8)`, userFileSource))
This request will now stream the users to the service without consuming the entire file into memory. Only chunks of data will be buffered at a time, therefore, you can send a request with potentially an infinite number of users and your client will be fine.
Server Request Processing
Similarly, your server can be designed to accept a request with an entity that can potentially be of infinite length.
Your questions says the service will filter the users, assuming we have a filtering function:
val isValidUser : (String) => Boolean = ???
This can be used to filter the incoming request entity and create a response entity which will feed the response:
import akka.http.scaladsl.server.Directives._
import akka.http.scaladsl.model.HttpResponse
import akka.http.scaladsl.model.HttpEntity.Chunked
val route = extractDataBytes { userSource =>
val responseSource : Source[ByteString, _] =
userSource
.map(_.utf8String)
.filter(isValidUser)
.map(ByteString.apply)
complete(HttpResponse(entity=Chunked.fromData(ContentTypes.`text/plain(UTF-8)`,
responseSource)))
}
Client Response Processing
The client can similarly process the filtered users without reading them all into memory. We can, for example, dispatch the request and send all of the valid users to the console:
import akka.http.scaladsl.Http
Http()
.singleRequest(httpRequest)
.map { response =>
response
.entity
.dataBytes
.map(_.utf8String)
.foreach(System.out.println)
}

twitter stream API track all tweet posted by user who registered in twitter application

I am creating the application which need to track all tweets from user who registered to my application, i tried to track those with streaming API , there are public API, user API , and site API,
in those API it just have an option to follow the user ID by add the comma separated user ID
https://dev.twitter.com/streaming/overview/request-parameters#follow
but i think it is not flexible, if there are a new user registered , i need to rebuild the HTTP request , and also if there are so many users try to listen this stream and query will be so long,
it will be
https://stream.twitter.com/1.1/statuses/filter.json?follow=[user1],[user2],[user3]........[userN],
i afraid the query wont fit, i just need a parameter to filter all user who registered in my application such as, for example.
https://stream.twitter.com/1.1/statuses/filter.json?application=[applicationID]
but i think twitter dev does not provide it
so, is there any way to filter stream by application ID?
I didn't see anything like tracking by application id. If your query become too complex (too many follows/keywords), public streaming api will reject it,
and you can't open more than 2 connections with user stream. So, last solution is using Site Stream, -> you can open as many user connections as you have users registered to your app.
BUT the docs says :
"Site Streams is currently in a closed beta. Applications are no
longer being accepted."
Contact twitter to be sure
Arstechnica has a very interesting article about it. Take a look at this code and the link in the end of this post
If you are using python pycurl will do the job. Its provides a way to execute a function for every little piece of data received.
import pycurl, json
STREAM_URL = "http://chirpstream.twitter.com/2b/user.json"
USER = "YOUR_USERNAME"
PASS = "XXXXXXXXX"
userlist = ['user1',...,'userN']
def on_receive(self, data):
self.buffer += data
if data.endswith("rn") and self.buffer.strip():
content = json.loads(self.buffer)
self.buffer = ""
if "text" in content and content['user'] in userlist:
#do stuff
conn = pycurl.Curl()
conn.setopt(pycurl.USERPWD, "%s:%s" % (USER, PASS))
conn.setopt(pycurl.URL, STREAM_URL)
conn.setopt(pycurl.WRITEFUNCTION, on_receive)
conn.perform()
You can find more information here Real time twitter stream api

How to write to different hbase tables in apache flume

I have configured Apache Flume to receive messages (JSON type) in HTTP source. My sinks are MongoDB and HBase.
How can I write the message according to a specified field to different collections and tables?
For example: let's assume we have T_1 and T_2. Now there is an incoming message that should be saved in T_1. How can I handle those messages and assign them where to be saved?
Try using the Multiplexing Channel Selector. The default one (Replicating Channel Selector copies the Flume event produced by the source to all its configured channels. Nevertheless, the multiplexing one is able to put the event into a specific channel depending on the value of a header within the Flume event.
In order to create such a header accordingly to your application logic you will need to create a custom handler for the HTTPSource. This can be easily done by implementing the HttpSourceHandler interface of the API.
you can use regex for tagging message type + multiplexing for sending it to right destination.
example , based on message "TEST"
regex for a string / field
agent.sources.s1.interceptors.i1.type=regex_extractor
agent.sources.s1.interceptors.i1.regex=(TEST1)
assign interceptor to serializer SE1
agent.sources.s1.interceptors.i1.serializers=SE1
agent.sources.s1.intercetpros.i1.serializers.SE1.name=Test
send to required channel , channels (c1,c2) you can map to different sinks
agent.sources.s1.selector.type=multiplexing
agent.sources.s1.selector.header=Test
agent.sources.s1.selector.mapping.Test=c1
all events of test regex will go to channel c1 , others will be defaulted to C2
agent.sources.s1.selector.default=c2

VEMap and a GeoRSS feed(hosted separately)

The scenario is as follows:
A WCF web service exists that outputs a valid GeoRSS feed. This lives in its own domain as a number of different applications have access to it.
A web page(on a different site) has been created with an instance of a VEMap(Bing/Virtual Earth map object).
Now, VEMap can accept an input feed in this format via the following:
var layer = new VEShapeLayer();
var veLayerSpec = new VEShapeSourceSpecification(VEDataType.GeoRSS, "someurl", layer);
map.ImportShapeLayerData(veLayerSpec, onComplete, true);
onComplete is a callback function I'm using to replace the default pin graphic with something custom.
The question is in regards to "someurl", which is a path to a local xml file containing the geographic information(georss simple format). I've realized this feed and the map must be hosted in the same domain, so I've created a generic handler that reads the remote feed and returns it in the same format.
var veLayerSpec = new VEShapeSourceSpecification(VEDataType.GeoRSS, "/somelocalhandler.ashx", layer);
When I do this, I get the VEMap error("z is null"). This is the same error one would receive when trying to access a remote feed. When I copy the feed into a local xml file(ie, "feed.xml") there is no error.
The order of operations is currently: remote feed -> local handler -> VEMap import
If I'm over complicating this procedure, let me know! I'm a bit new to the Bing Maps API and might have missed something. Any assistance is appreciated.
The format I have above is actually very close to what I needed. A similar solution was found by Mike McDougall. Although I was passing the RSS feed directly through the handler(writing the read stream directly), I just needed to specify the following from within the handler:
context.Response.ContentType = "text/xml";
context.Response.ContentEncoding = System.Text.Encoding.UTF8;
With the above fix, I'm able to have a remote GeoRSS feed successfully load a separately hosted Virtual Earth map instance.