Hadoop Map reduce Testing - custom record reader - testing

I have written a custom record reader and looking for sample test code to test my custom reader using MRUnit or any other testing framework. Its working fine as per the functionality but I would like to add test cases before I make an install. Any help would be appreciable.

In my opinion, a custom record reader is like any iterator. For testing my record reader I have been able to work without MRUnit or any other hadoop junit frameworks. The test executes quickly and the footprint is small too. Initialize the record reader in your test case and keep iterating on it. Here is a pseudocode from one of my tests. I can provide you more details if you want to proceed in this direction.
MyInputFormat myInputFormat = new MyInputFormat();
//configure job and provide input format configuration
Job job = Job.getInstance(conf, "test");
conf = job.getConfiguration();
// verify split type and count if you want to verify the input format also
List<InputSplit> splits = myInputFormat.getSplits(job);
TaskAttemptContext context = new TaskAttemptContextImpl(conf, new TaskAttemptID());
RecordReader<LongWritable, Text> reader = myInputFormat.createRecordReader(splits.get(1), context);
reader.initialize(splits.get(1), context);
for (; number of expected value;) {
assertTrue(reader.nextKeyValue());
// verify key and value
assertEquals(expectedLong, reader.getCurrentKey());
}

Related

Batch read from DBs

Im a bit confused on how golangs sql package reads large datasets into memory. In this previous stackoverflow question - How to set fetch size in golang?, there seems to be conflicting ideas on whether batching of large datasets on read happens or not.
I am writing a go binary that connects to different remote DBs based on input params given and fetches resutls and subsequently converts them to a csv file. Suppose I have a query that returns a lot of rows; say 20 million rows. Loading this all at once in memory would be very exhaustive. Does the library batch the results automatically and only on row.Next() load the next batch into memory ?
If the db/sql package does not handle it, are there options in the various driver packages ?
https://github.com/golang/go/issues/13067 - From this issue and discussion, I understand that the general idea is to have the driver packages handle this. As mentioned in the issue and also in this blog https://oralytics.com/2019/06/17/importance-of-setting-fetched-rows-size-for-database-query-using-golang/, I found out that golangs oracle driver package has this option that I can pass for batching. But am not able to find an equivalent in the other driver packages.
To summarize -
Does db/sql batch read results automatically.
If yes, then my 2nd & 3rd question does not matter
If no, are there options that I can pass to the various driver pacakges to set the batch size and where can I find what these options are. I have already tried looking at pgx docs and cannot find anything there that sets a batch size.
Is there any other way to batch reads like a prepared statement with configuration specifying the batch size ?
Some clarifications:
My question is when the a query returns a large dataset, is the entire dataset loaded into memory or is it batched whether internally by some code that is called downstream from rows.Next or not.
From what I can see there is a chunk reader that gets created with a default 8kb size and is used to chunk. Are there cases where this does not happen ? Or are the results from db always chunked.
Is there any way this 8kb buffer size that the chunk reader uses configurable ?
For more clarity, I am adding what is existing in java. This is what already exists and I am looking to rewrite it in golang.
private static final int RESULT_SIZE = 10000;
private void generate() {
... //connection and other code...
Statement stmt = connection.createStatement(ResultSet.TYPE_FORWARD_ONLY,
ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(RESULT_SIZE);
ResultSet resultset = stmt.executeQuery(dataQuery);
String fileInHome = getFullFileName(filePath, manager, parentDir);
rsToCSV(resultset, new BufferedWriter(new FileWriter(fileInHome)));
}
private void rsToCSV(ResultSet rs, BufferedWriter os) throws SQLException {
ResultSetMetaData metaData = rs.getMetaData();
int columnCount = metaData.getColumnCount();
try (PrintWriter pw = new PrintWriter(os)) {
readHeaders(metaData, columnCount, pw);
if (rs.next()) {
readRow(rs, metaData, columnCount, pw);
while (rs.next()) {
pw.println();
readRow(rs, metaData, columnCount, pw);
}
}
}
}
The stmt.setFetchSize(RESULT_SIZE); sets the number of rows to return in each result set which is then processed one by one to a csv.

Can anyone help me in recording script using recording controller in jmeter?

I have done till creation of proxy sever.Facing some socket broken issue on running the scripts in firefox
When i perform some actions everything is working then some error occurs
also explain what is jmeter tree model and jmeternode is?
Scanner sc = new Scanner(System.in);
// recordingController recordingcontroller=new recordingController("testrecorder",RecordController.class);
// RecordingController rc= (RecordingController) recordingcontroller.buildTestElement();
RecordingController rc = new RecordingController();
GenericController gc = new GenericController();
rc.initialize();
gc.addTestElement(rc);
LoopController loopController = new LoopController();
loopController.setLoops(1);
loopController.setFirst(true);
loopController.setProperty(TestElement.TEST_CLASS, LoopController.class.getName());
loopController.setProperty(TestElement.GUI_CLASS, LoopControlPanel.class.getName());
loopController.initialize();
rc.addTestElement(loopController);
ThreadGroup threadGroup = new ThreadGroup();
threadGroup.setName("Thread-Group");
threadGroup.setSamplerController(loopController);
ProxyControl proxyController = new ProxyControl();
// proxyController.setProperty(TestElement.TEST_CLASS, ProxyControl.class.getName());
// proxyController.setProperty(TestElement.GUI_CLASS, ProxyControlGui.class.getName());
proxyController.setName("Proxy Recorder");
proxyController.setPort(4444);
// threadGroup.setSamplerController(rc);
// proxyController.setSamplerTypeName("SAMPLER_TYPE_JAVA_SAMPLER");
TestPlan testPlan = new TestPlan("My_Test_Plan");
testPlan.addTestElement(threadGroup);
testPlan.addTestElement(proxyController);
JMeterTreeModel jtm = new JMeterTreeModel();
proxyController.setNonGuiTreeModel(jtm);
JMeterTreeNode node = new JMeterTreeNode(proxyController,jtm);
// JMeterTreeNode node=new JMeterTreeNode();
proxyController.setTarget(node);
// proxyController.setCaptureHttpHeaders(true);
// proxyController.setUseKeepAlive(true);
// proxyController.setGroupingMode(4);
proxyController.setCaptureHttpHeaders(true);
proxyController.setProxyPauseHTTPSample("10000");
proxyController.setSamplerFollowRedirects(true);
proxyController.setSslDomains("www.geeksforgeeks.org");
proxyController.startProxy();
I don't think non-GUI proxy recording is something you can achieve with vanilla JMeter, if you have to automate the recording process you will have to go for desktop applications automation solutions like Appium or LDTP
If you need to record a JMeter script using Firefox on a system which doesn't have GUI I can think of following approaches:
Use Proxy2JMX Converter module of Taurus tool
Use BlazeMeter Proxy Recorder (by the way it has nice feature of exporting recorded scenarios in "SmartJMX" mode with automatic detection and correlation of dynamic parameters)

OutOfMemory on custom extractor

I have stitched a lot of small XML files into one file, and then made a custom extractor to return rows with one byte array that corresponds to each file.
Run on remote/master
Run it for one file (gzipped, 11Mb), it works fine.
Run it for more than one file, I get a System.OutOfMemoryException.
Run on local/master
Run it for one or more files (gzipped 500+ Mbs), works fine.
Extractor looks like this:
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
using (var stream = new StreamReader(input.BaseStream))
{
var xml = stream.ReadToEnd();
// Clean stiched XML
xml = UtilsXml.CleanXml(xml);
// Get nodes - one for each stiched file
var d = new XmlDocument();
d.LoadXml(xml);
var root = d.FirstChild;
for (int i = 0; i < root.ChildNodes.Count; i++)
{
output.Set<object>(1, Encoding.ASCII.GetBytes(root.ChildNodes[i].OuterXml.ToString()));
yield return output.AsReadOnly();
}
yield break;
}
}
and error message looks like this:
==== Caught exception System.OutOfMemoryException
at System.Xml.XmlDocument.CreateTextNode(String text)
at System.Xml.XmlLoader.LoadAttributeNode()
at System.Xml.XmlLoader.LoadNode(Boolean skipOverWhitespace)
at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
at System.Xml.XmlDocument.Load(XmlReader reader)
at System.Xml.XmlDocument.LoadXml(String xml)
at Microsoft.Analytics.Tools.Formats.Text.XmlByteArrayRowExtractor.<Extract>d__0.MoveNext()
at ScopeEngine.SqlIpExtractor<ScopeEngine::GZipInput,Extract_0_Data0>.GetNextRow(SqlIpExtractor<ScopeEngine::GZipInput\,Extract_0_Data0>* , Extract_0_Data0* output) in d:\data\ccs\jobs\bc367467-ef86-43d2-a937-46ba2d4cc524_v0\sqlmanaged.h:line 1924
So what am I doing wrong? And how do I debug this on remote?
Thanks!
Unfortunately local run does not enforce memory allocations, so you would have to check memory in local vertex debug yourself.
Looking at your code above, I see that you are loading XML documents into a DOM. Please note that an XML DOM can explode the data size from the string representation up to a factor of 10 or more (I have seen 2 to 12 in my times as the resident SQL XML guru).
Each UDO today only gets 1/2 GB of RAM to play with. So what I assume is that your XML DOM document(s) start going beyond that.
The recommendation normally is that you use the XMLReader interface (there is a reader extractor in the samples on http://usql.io as well) and scan through the document(s) to find the information you are looking for.
If your documents are always small enough (e.g., <20MB), you may want to make sure that you release the memory of the other documents and operate one document at a time.
We do have plans to allow you to annotate your UDO with memory needs, but that is still a bit out.

JSR 352 : How do you write to a MVS Dataset from a Java Batch program?

I need to write to a non-VSAM dataset in the mainframe. I know that we need to use the ZFile library to do it and I found how to do it here
I am running my Java batch job in the WebSphere Liberty on zOS. How do I specify the dataset? Can I directly give the DataSet a name like this?
dsnFile = new ZFile("X.Y.Z", "wb,type=record,noseek");
I am able to write it to a text file on the server itself using Java's File Writers but I don't know how to access a mvs dataset.
I am relatively new to the world of zOS and mainframe.
It sounds like you might be asking more generally how to use the ZFile API on WebSphere Liberty on z/OS.
Have you tried something like:
String pdsName = ZFile.getSlashSlashQuotedDSN("X.Y.Z");
ZFile zfile = new ZFile(pdsName , ...options...)
As far as batch-specific use cases, you might obviously have to differentiate between writing to a new file that's created for the first time on an original execution, as opposed to appending to an already-existing one on a restart.
You also might find some useful snipopets in this doctorbatch.io repo, along with the original link you posted.
For reference, I'll copy/paste from the ZFile Javadoc:
ZFile dd = new ZFile("//DD:MYDD", "r");
Opens the DD namee MYDD for reading
ZFile dsn = new ZFile("//'SYS1.HELP(ACCOUNT)'", "rt");
Opens the member ACCOUNT from the PDS SYS1.HELP for reading text records
ZFile dsn = new ZFile("//SEQ", "wb,type=record,recfm=fb,lrecl=80,noseek");
Opens the data set {MVS_USER}.SEQ for sequential binary writing. Note that ",noseek" should be specified with "type=record" if access is sequential, since performance is greatly improved.
One final note, another couple useful ZFile helper methods are: bpxwdyn() and getFullyQualifiedDSN().

Integrating RFT Test framework to work with RQM

I designed a framework in RFT where the test cases are written in spreadsheet specifying the data source, object and keyword and a driver script which processes through all this data and routes it to the appropriate method for each test step all in a spreadsheet. Now I want to integrate this with RQM so that each of my test cases in the spreadsheet is shown as passed/failed in RQM. Any ideas?
You could implement now an algorithm to read those testcases in the spreadsheet and pass them to RQM as attachments with logTestResult.
For example:
logTestResult( <your attachment> , true );
And if you are already connected to RQM the adapter will attach files that you indicate automatically to RQM. So, at the end you will see step by step the results and if the script ends correctly RQM will show you the script as "passed".
Thanks for the answer Juan. I solved this by passing the testcase name from Script Argument part of RQM and fetching the arguments in my starter script as shown below:-
public void testMain(Object[] args) throws Exception
{
String n=args[0].toString();
logInfo("Parameter from RQM"+n);
ModuleDriver d=new ModuleDriver();
d.execute_main(n);
}
Since I have verification points setup for each of the steps in my test cases the results get reported based on each of those verification points in RQM which is what i needed.