Using a text file as Spark streaming source for testing purpose

Using a text file as Spark streaming source for testing purpose - testing

I want to write a test for my spark streaming application that consume a flume source.
http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ suggests using ManualClock but for the moment reading a file and verifying outputs would be enough for me.
So I wish to use :
JavaStreamingContext streamingContext = ...
JavaDStream<String> stream = streamingContext.textFileStream(dataDirectory);
stream.print();
streamingContext.awaitTermination();
streamingContext.start();
Unfortunately it does not print anything.
I tried:
dataDirectory = "hdfs://node:port/absolute/path/on/hdfs/"
dataDirectory = "file://C:\\absolute\\path\\on\\windows\\"
adding the text file in the directory BEFORE the program begins
adding the text file in the directory WHILE the program run
Nothing works.
Any suggestion to read from text file?
Thanks,
Martin

Order of start and await are indeed inversed.
In addition to that, the easiest way to pass data to your Spark Streaming application for testing is a QueueDStream. It's a mutable queue of RDD of arbitrary data. This means that you could create the data programmatically or load it from disk into an RDD and pass that to your Spark Streaming code.
Eg. to avoid the timing issues faced with the fileConsumer, you could try this:
val rdd = sparkContext.textFile(...)
val rddQueue: Queue[RDD[String]] = Queue()
rddQueue += rdd
val dstream = streamingContext.queueStream(rddQueue)
doMyStuffWithDstream(dstream)
streamingContext.start()
streamingContext.awaitTermination()

I am so stupid, I inverted calls to start() and awaitTermination()
If you want to do the same, you should read from HDFS, and add the file WHILE the program runs.

Related

What is the best practice for downloading large CSV files from S3 in Java?

I'm trying to get a large CSV file from S3 but the download fails with “java.net.SocketException: Connection reset”, which is probably due to the InputStream simply being open for too long (the download often takes more than an hour since I am doing multiple time-consuming processes on the streamed content). This is how I currently parse the file:
InputStream inputStream = new GZIPInputStream(s3Client.getObject("bucket", "key").getObjectContent());
Reader decoder = new InputStreamReader(inputStream, Charset.defaultCharset());
BufferedReader isr = new BufferedReader(decoder);
CSVParser csvParser = new CSVParser(isr, CSVFormat.DEFAULT);
CSVRecord nextRecord = csvParser.iterator().next();
...
I know I have to split the download into multiple short getObject-calls with a defined offset for the GetObjectRequest, but I'm wondering how to define this offset in case of a CSV, since I need complete lines.
Do I have to ditch the parser library and parse each line into an Object myself so I can keep a count of the read bytes and use it as an offset for the next batch? That doesn't seem very robust to me. Is there any best practice way to achieve "batch downloading" of CSV records?

I decided on simply using the dedicated getObject(GetObjectRequest getObjectRequest, File destinationFile) method to copy the entire CSV to a temporary file on disk. This closes the HTTP connection as soon as possible and allows me to get the InputStream from the local file with no problems. It doesn't resolve the question of the best way to download in batches, but it's a nice and simple workaround.

JSR 352 : How do you write to a MVS Dataset from a Java Batch program?

I need to write to a non-VSAM dataset in the mainframe. I know that we need to use the ZFile library to do it and I found how to do it here
I am running my Java batch job in the WebSphere Liberty on zOS. How do I specify the dataset? Can I directly give the DataSet a name like this?
dsnFile = new ZFile("X.Y.Z", "wb,type=record,noseek");
I am able to write it to a text file on the server itself using Java's File Writers but I don't know how to access a mvs dataset.
I am relatively new to the world of zOS and mainframe.

It sounds like you might be asking more generally how to use the ZFile API on WebSphere Liberty on z/OS.
Have you tried something like:
String pdsName = ZFile.getSlashSlashQuotedDSN("X.Y.Z");
ZFile zfile = new ZFile(pdsName , ...options...)
As far as batch-specific use cases, you might obviously have to differentiate between writing to a new file that's created for the first time on an original execution, as opposed to appending to an already-existing one on a restart.
You also might find some useful snipopets in this doctorbatch.io repo, along with the original link you posted.
For reference, I'll copy/paste from the ZFile Javadoc:
ZFile dd = new ZFile("//DD:MYDD", "r");
Opens the DD namee MYDD for reading
ZFile dsn = new ZFile("//'SYS1.HELP(ACCOUNT)'", "rt");
Opens the member ACCOUNT from the PDS SYS1.HELP for reading text records
ZFile dsn = new ZFile("//SEQ", "wb,type=record,recfm=fb,lrecl=80,noseek");
Opens the data set {MVS_USER}.SEQ for sequential binary writing. Note that ",noseek" should be specified with "type=record" if access is sequential, since performance is greatly improved.
One final note, another couple useful ZFile helper methods are: bpxwdyn() and getFullyQualifiedDSN().

Spark issue with the class generated from avro schema

I have piece of code written in spark that loads data from HDFS into java classes generated from avro idl. On RDD created in that way I am executing simple operation which results depends on fact whether I cache RDD before it or not
i.e if I run code below
val loadedData = loadFromHDFS[Data](path,...)
println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()) // 200000
program will print 200000, on the other hand executing next code
val loadedData = loadFromHDFS[Data](path,...).cache()
println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()) // 1
result in 1 printed to stdout.
When I inspect values of the fields after reading cached data it seems
I am pretty sure that root cause of described problem is issue with serialization of classes generated from avro idl, but I do not know how to resolve it. I tried to use Kryo, registering generated class (Data), registering different serializers from chill_avro for given class (SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but none of those ideas helps me.
How I can solve this problem?
Link to minimal, complete, and verifiable example.

Try the code below out -
val loadedData = loadFromHDFS[Data](path,...)
println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()).cache()

Saving RDD to file results in _temporary path for parts

I have data in Spark which I want to save to S3. The recommended method is to save is using the saveAsTextFile method on the SparkContext, which is successful. I expect that the data will be saved as 'parts'.
My problem is that when I go to S3 to look at my data it has been saved in a folder name _temporary, with a subfolder 0 and then each part or task saved in its own folder.
For example,
data.saveAsTextFile("s3:/kirk/data");
results in file likes
s3://kirk/data/_SUCCESS
s3://kirk/data/_temporary/0/_temporary_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00000_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00000/part-00000
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00001_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00001/part-00001
and so on. I would expect and have seen something like
s3://kirk/data/_SUCCESS
s3://kirk/data/part-00000
s3://kirk/data/part-00001
Is this a configuration setting, or do I need to 'commit' the save to resolve the temporary files?

I had the same problem with spark streaming, that was because my Sparkmaster was set up with conf.setMaster("local") instead of conf.SetMaster("local[*]")
Without the [*], spark can't execute saveastextfile during the stream.

Try using coalesce() to reduce the rdd to 1 partition before you export.
Good luck!

Mimic file IO in j2me midlet using RMS

I want to be able to record audio and save it to persistent storage in my j2me application. As I understand j2me does not expose the handset's file system, instead it wants the developer to use the RMS system. I understand the idea behind RMS but cannot seem to think of the best way to implement audio recording using it. I have a continuous stream of bits from the audio input which must be saved, 1) should I make a buffer and then periodically create a new record with the bytes in the buffer. 2) Should I put each sample in a new record? 3) should I save the entire recording file in a byte array and then only write it to the RMS on stop recording?
Is there a better way to achieve this other than RMS?

Consider this code below and edit it as necessary it should solve your problem by writing to the phone filesystem directly
getRoots();
FileConnection fc = null;
DataOutputStream dos = null;
fc = (FileConnection)Connector.open("file:///E:/");
if (!fc.exists())
{
fc.mkdir();
}
fc = (FileConnection) Connector.open("file:///E:/test.wav");
if (!fc.exists())
{
fc.create();
}
dos = fc.openDataOutputStream();
dos.write( recordedSoundArray);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using a text file as Spark streaming source for testing purpose - testing

I am so stupid, I inverted calls to start() and awaitTermination() If you want to do the same, you should read from HDFS, and add the file WHILE the program runs.

Related

What is the best practice for downloading large CSV files from S3 in Java?

JSR 352 : How do you write to a MVS Dataset from a Java Batch program?

Spark issue with the class generated from avro schema

Saving RDD to file results in _temporary path for parts

Mimic file IO in j2me midlet using RMS

Categories

Resources