NullPointerException caught when writing to BigTable using Apache Beam's dataflow sdk

NullPointerException caught when writing to BigTable using Apache Beam's dataflow sdk - nullpointerexception

I'm using Apache's Beam sdk version 0.2.0-incubating-SNAPSHOT
and trying to pull data to a bigtable with the Dataflow runner. Unfortunately I'm getting NullPointerException when executing my dataflow pipeline where I'm using BigTableIO.Write as my sink. Already checked my BigtableOptions and the parameters are fine, according to my needs.
Basically, I create and in some point of my pipeline I have the step to write the PCollection<KV<ByteString, Iterable<Mutation>>> to my desired bigtable:
final BigtableOptions.Builder optionsBuilder =
new BigtableOptions.Builder().setProjectId(System.getProperty("PROJECT_ID"))
.setInstanceId(System.getProperty("BT_INSTANCE_ID"));
// do intermediary steps and create PCollection<KV<ByteString, Iterable<Mutation>>>
// to write to bigtable
// modifiedHits is a PCollection<KV<ByteString, Iterable<Mutation>>>
modifiedHits.apply("writting to big table", BigtableIO.write()
.withBigtableOptions(optionsBuilder).withTableId(System.getProperty("BT_TABLENAME")));
p.run();
When executing the pipeline, I got the NullPointerException, pointing out exactly to the BigtableIO class at the public void processElement(ProcessContext c) method:
(6e0ccd8407eed08b): java.lang.NullPointerException at org.apache.beam.sdk.io.gcp.bigtable.BigtableIO$Write$BigtableWriterFn.processElement(BigtableIO.java:532)
I checked this method is processing all elements before to write on bigtable, but not sure why I'm getting such exception overtime I execute this pipeline. According to the code below, this method uses bigtableWriter attribute to process each c.element(), but I can't even set a breakpoint to debug where is exactly the null. Any kind of advice or suggestion of how to solve this issue?
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
checkForFailures();
Futures.addCallback(
bigtableWriter.writeRecord(c.element()), new WriteExceptionCallback(c.element()));
++recordsWritten;
}
Thanks.

I looked up the job and its classpath, and if I'm not mistaken it looks like you're using version 0.3.0-incubating-SNAPSHOT of beam-sdks-java-{core,io}, but version 0.2.0-incubating-SNAPSHOT of google-cloud-dataflow-java.
I believe the issue is because of this - you have to use the same version (more details: BigtableIO in version 0.3.0 uses #Setup and #Teardown methods, but runner 0.2.0 does not support them yet).

Related

Using Hibernate Reactive Panache with SDKs that switch thread

I'm using Reactive Panache for Postgresql. I need to take an application level lock(redis), inside which I need to perform certain operations. However, panache library throws the following error:
java.lang.IllegalStateException: HR000069: Detected use of the reactive Session from a different Thread than the one which was used to open the reactive Session - this suggests an invalid integration; original thread [222]: 'vert.x-eventloop-thread-3' current Thread [154]: 'vert.x-eventloop-thread-2'
My code looks something like this:
redissonClient.getLock("lock").lock(this.leaseTimeInMillis, TimeUnit.MILLISECONDS, this.lockId)
.chain(() -> return Panache.withTransaction(() -> Uni.createFrom.nullItem())
.eventually(lock::release);
)
Solutions such as the ones mentioned in this issue show the correct use with AWS SDK but not when used in conjunction with something like redisson. Does anyone have this working with redisson?
Update:
I tried the following on lock acquire and release:
.runSubscriptionOn(MutinyHelper.executor(Vertx.currentContext())
This fails with the following error even though I have quarkus-vertx dependency added:
Cannot invoke "io.vertx.core.Context.runOnContext(io.vertx.core.Handler)" because "context" is null

Panache might not be the best choice in this case.
I would try using Hibernate Reactive directly:
#Inject
Mutiny.SessionFactory factory;
...
redissonClient.getLock("lock")
.lock(this.leaseTimeInMillis,TimeUnit.MILLISECONDS, this.lockId)
.chain(() -> factory.withTransaction(session -> Uni.createFrom.nullItem())
.eventually(lock::release))

bytebuddy apply advice to interface methods - specifically spring data jpa

I am trying to LOG all methods that are invoked in my Springboot application using byte-buddy based java agent.
I am able to log all layers except Spring data JPA repositories, which are actually interfaces. Below is agent initialization:
new AgentBuilder.Default()
.type(ElementMatchers.hasSuperType(nameContains("com.soka.tracker.repository").and(ElementMatchers.isInterface())))
.transform(new AgentBuilder.Transformer.ForAdvice()
.include(TestAgent.class.getClassLoader())
.advice(ElementMatchers.any(), "com.testaware.MyAdvice"))
.installOn(instrumentation);
any hints or workaround that I can use to log when my repository methods are invoked. Below is a sample repository in question:
package com.soka.tracker.repository;
.....
#Repository
public interface GeocodeRepository extends JpaRepository<Geocodes, Integer> {
Optional<Geocodes> findByaddress(String currAddress);
}
Modified agent:
new AgentBuilder.Default()
.ignore(new AgentBuilder.RawMatcher.ForElementMatchers(any(), isBootstrapClassLoader().or(isExtensionClassLoader())))
.ignore(new AgentBuilder.RawMatcher.ForElementMatchers(nameStartsWith("net.bytebuddy.")
.and(not(ElementMatchers.nameStartsWith(NamingStrategy.SuffixingRandom.BYTE_BUDDY_RENAME_PACKAGE + ".")))
.or(nameStartsWith("sun.reflect."))))
.type(ElementMatchers.nameContains("soka"))
.transform(new AgentBuilder.Transformer.ForAdvice()
.include(TestAgent.class.getClassLoader())
.advice(any(), "com.testaware.MyAdvice"))
//.with(AgentBuilder.Listener.StreamWriting.toSystemOut())
.with(AgentBuilder.TypeStrategy.Default.REDEFINE)
.installOn(instrumentation);
I see my advice around controller and service layers - JPA repository layer is not getting logged.

By default, Byte Buddy ignores synthetic types in its agent. I assume that Spring's repository classes are marked as such and therefore not processed.
You can set a custom ignore matcher by using the AgentBuilder DSL. By default, the following ignore matcher is set to ignore system classes and Byte Buddy's own types:
new RawMatcher.Disjunction(
new RawMatcher.ForElementMatchers(any(), isBootstrapClassLoader().or(isExtensionClassLoader())),
new RawMatcher.ForElementMatchers(nameStartsWith("net.bytebuddy.")
.and(not(ElementMatchers.nameStartsWith(NamingStrategy.SuffixingRandom.BYTE_BUDDY_RENAME_PACKAGE + ".")))
.or(nameStartsWith("sun.reflect."))
.<TypeDescription>or(isSynthetic())))
You would probably need to remove the last condition.

for anybody visiting this question / problem - I was able to go around the actual problem with logging actual queries invoked during execution - Bytebuddy is awesome and very powerful - for ex- in my case I am simply advice'ing on my db connection pool classes and gathering all required telemetry -
.or(ElementMatchers.nameContains("com.zaxxer.hikari.pool.HikariProxyConnection"))

Running Google Dataflow pipeline from a Google App Engine app?

I am creating a dataflow job using DataflowPipelineRunner. I tried the following scenarios.
Without specifying any machineType
With g1 small machine
with n1-highmem-2
In all the above scenarios, Input is a file from GCS which is very small file(KB size) and output is Big Query table.
I got Out of memory error in all the scenarios
The size of my compiled code is 94mb. I am trying only word count example and it did not read any input(It fails before the job starts). Please help me understand why i am getting this error.
Note: I am using appengine to start the job.
Note: The same code works with beta versoin 0.4.150414
EDIT 1
As per the suggestions in the answer tried the following,
Switched from Automatic scaling to Basic Scaling.
Used machine type B2 which provides 256MB memory
After these configuration, Java Heap Memory problem is solved. But it is trying to upload a jar into stagging location which is more than 10Mb, hence it fails.
It logs the following exception
com.google.api.client.http.HttpRequest execute: exception thrown while executing request
com.google.appengine.api.urlfetch.RequestPayloadTooLargeException: The request to https://www.googleapis.com/upload/storage/v1/b/pwccloudedw-stagging-bucket/o?name=appengine-api-L4wtoWwoElWmstI1Ia93cg.jar&uploadType=resumable&upload_id=AEnB2Uo6HCfw6Usa3aXlcOzg0g3RawrvuAxWuOUtQxwQdxoyA0cf22LKqno0Gu-hjKGLqXIo8MF2FHR63zTxrSmQ9Yk9HdCdZQ exceeded the 10 MiB limit.
at com.google.appengine.api.urlfetch.URLFetchServiceImpl.convertApplicationException(URLFetchServiceImpl.java:157)
at com.google.appengine.api.urlfetch.URLFetchServiceImpl.fetch(URLFetchServiceImpl.java:45)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.fetchResponse(URLFetchServiceStreamHandler.java:543)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getInputStream(URLFetchServiceStreamHandler.java:422)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getResponseCode(URLFetchServiceStreamHandler.java:275)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithoutGZip(MediaHttpUploader.java:545)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:562)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:419)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:427)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357)
at java.util.concurrent.FutureTask.run(FutureTask.java:260)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1168)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:605)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$1$1.run(ApiProxyImpl.java:1152)
at java.security.AccessController.doPrivileged(Native Method)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$1.run(ApiProxyImpl.java:1146)
at java.lang.Thread.run(Thread.java:745)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$2$1.run(ApiProxyImpl.java:1195)
I tried directly uploading the jar file - appengine-api-1.0-sdk-1.9.20.jar, still it tries to upload this jar appengine-api-L4wtoWwoElWmstI1Ia93cg.jar.
which i dont know what jar it is. Any idea on what this jar is appreciated.
Please help me to fix this issue.

The short answer is that if you use AppEngine on a Managed VM you will not encounter the AppEngine sandbox limits (OOM when using a F1 or B1 instance class, execution time limit issues, whitelisted JRE classes). If you really want to run within the App Engine sandbox, then your use of the Dataflow SDK most conform to the limits of the AppEngine sandbox. Below I explain common issues and what people have done to conform to the AppEngine sandbox limits.
The Dataflow SDK requires an AppEngine instance class which has enough memory to execute the users application to construct the pipeline, stage any resources, and send the job description to the Dataflow service. Typically we have seen users require using an instance class with more than 128mb of memory to not see OOM errors.
Generally constructing a pipeline and submitting it to the Dataflow service typically takes less than a couple of seconds if the required resources for your application are already staged. Uploading your JARs and any other resources to GCS can take longer than 60 seconds. This can be solved manually by pre-staging your JARs to GCS beforehand (the Dataflow SDK will skip staging them again if it detects they are already there) or using a task queue to get a 10 minute limit (note that for large applications, 10 mins may not be enough to stage all your resources).
Finally, within the AppEngine sandbox environment, you and all your dependencies are limited to using only whitelisted classes within the JRE or you'll get an exception like:
java.lang.SecurityException:
java.lang.IllegalAccessException: YYY is not allowed on ZZZ
...
EDIT 1
We perform a hash of the contents of the jars on the classpath and upload them to GCS with a modified filename. AppEngine runs a sandboxed environment with its own JARs, appengine-api-L4wtoWwoElWmstI1Ia93cg.jar refers to appengine-api.jar which is a jar that the sandboxed environment adds. You can see from our PackageUtil#getUniqueContentName(...) that we just append -$HASH before .jar.
We are working to solve why you are seeing the RequestPayloadToLarge excepton and it is currently recommended that you set the filesToStage option and filter out the jars not required to execute your Dataflow to get around the issue that you face. You can see how we build the files to stage with DataflowPipelineRunner#detectClassPathResourcesToStage(...).

I had the same problem with the 10MB limit. What I did was filtering out the JAR files bigger than that limit (instead of specific files), and then set the renaming files in the DataflowPipelineOptions with setFilesToStage.
So I just copied the method detectClassPathResourcesToStage from the Dataflow SDK and changed it sightly:
private static final long FILE_BYTES_THRESHOLD = 10 * 1024 * 1024; // 10 MB
protected static List<String> detectClassPathResourcesToStage(ClassLoader classLoader) {
if (!(classLoader instanceof URLClassLoader)) {
String message = String.format("Unable to use ClassLoader to detect classpath elements. "
+ "Current ClassLoader is %s, only URLClassLoaders are supported.", classLoader);
throw new IllegalArgumentException(message);
}
List<String> files = new ArrayList<>();
for (URL url : ((URLClassLoader) classLoader).getURLs()) {
try {
File file = new File(url.toURI());
if (file.length() < FILE_BYTES_THRESHOLD) {
files.add(file.getAbsolutePath());
}
} catch (IllegalArgumentException | URISyntaxException e) {
String message = String.format("Unable to convert url (%s) to file.", url);
throw new IllegalArgumentException(message, e);
}
}
return files;
}
And then when I'm creating the DataflowPipelineOptions:
DataflowPipelineOptions dataflowOptions = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
...
dataflowOptions.setFilesToStage(detectClassPathResourcesToStage(DataflowPipelineRunner.class.getClassLoader()));

Here's a version of Helder's 10MB-filtering solution that will adapt to the default file-staging behavior of DataflowPipelineOptions even if it changes in a future version of the SDK.
Instead of duplicating the logic, it passes a throwaway copy of the DataflowPipelineOptions to DataflowPipelineRunner to see which files it would have staged, then removes any that are too big.
Note that this code assumes that you've defined a custom PipelineOptions class named MyOptions, along with a java.util.Logger field named logger.
// The largest file size that can be staged to the dataflow service.
private static final long MAX_STAGED_FILE_SIZE_BYTES = 10 * 1024 * 1024;
/**
* Returns the list of .jar/etc files to stage based on the
* Options, filtering out any files that are too large for
* DataflowPipelineRunner.
*
* <p>If this accidentally filters out a necessary file, it should
* be obvious when the pipeline fails with a runtime link error.
*/
private static ImmutableList<String> getFilesToStage(MyOptions options) {
// Construct a throw-away runner with a copy of the Options to see
// which files it would have wanted to stage. This could be an
// explicitly-specified list of files from the MyOptions param, or
// the default list of files determined by DataflowPipelineRunner.
List<String> baseFiles;
{
DataflowPipelineOptions tmpOptions =
options.cloneAs(DataflowPipelineOptions.class);
// Ignore the result; we only care about how fromOptions()
// modifies its parameter.
DataflowPipelineRunner.fromOptions(tmpOptions);
baseFiles = tmpOptions.getFilesToStage();
// Some value should have been set.
Preconditions.checkNotNull(baseFiles);
}
// Filter out any files that are too large to stage.
ImmutableList.Builder<String> filteredFiles = ImmutableList.builder();
for (String file : baseFiles) {
long size = new File(file).length();
if (size < MAX_STAGED_FILE_SIZE_BYTES) {
filteredFiles.add(file);
} else {
logger.info("Not staging large file " + file + ": length " + size
+ " >= max length " + MAX_STAGED_FILE_SIZE_BYTES);
}
}
return filteredFiles.build();
}
/** Runs the processing pipeline with given options. */
public void runPipeline(MyOptions options)
throws IOException, InterruptedException {
// DataflowPipelineRunner can't stage large files;
// remove any from the list.
DataflowPipelineOptions dpOpts =
options.as(DataflowPipelineOptions.class);
dpOpts.setFilesToStage(getFilesToStage(options));
// Run the pipeline as usual using "options".
// ...
}

Mockolate Verify Error: Illegal override.. after Flex SDK 4.10 update

Since we upgraded the flex sdk in our application to 4.10 we've been running into Verify Errors while running unit tests that use mockolate.
They seem to occur when mocking an interface where a ByteArray is used in a method signature.
Example interface:
public interface IFileSystemHelper {
function loadFileContents(path:String):ByteArray;
}
Example test class:
public class SomeTest {
[Rule]
public var mockolateRule:MockolateRule = new MockolateRule();
[Mock]
public var fileHelper:IFileSystemHelper;
public function SomeTest() {
}
[Test]
public function testMethod():void {
// ...
}
}
When compiling and running the test with flexmojos 6.0.1 the following error is thrown:
VerifyError: Error #1053: Illegal override of
IFileSystemHelper8F2B5D281827800A824B85B588C6F2A08AE814ED in
mockolate.generated.IFileSystemHelper8F2B5D281827800A824B85B588C6F2A08AE814ED
My initial suspicion was an sdk version problem with playerglobal (or airglobal in our case) so i recompiled mockolate (and flexunit) with sdk 4.10, without any result.
The only thing that seems to work is to remove the ByteArray type from the method signature... but that's not really an option :-) (and this has never been a problem before)
Is there anyone who has had a similar issue?
Thanks

This problem usually occurs when compiling different parts of your application with different versions of the sdk.
I would recommend to have a look at the output of "mvn dependency:tree" as this should output all dependencies (direct and transitive ones). Perhaps this will help you find where the wrong version is comming from.

Hadoop: wrong classpath in map reduce job

I'm running a cloudera cluster in 3 virtual maschines and try to execute hbase bulk load via a map reduce job. But I got always the error:
error: Class org.apache.hadoop.hbase.mapreduce.HFileOutputFormat not found
So, it seems that the map process doesnt find the class. So I tried this:
1) add the hbase.jar to the HADOOP_CLASSPATH on every node
2) adding TableMapReduceUtil.addDependencyJars(job) / TableMapReduceUtil.addDependencyJars(myConf, HFileOutputFormat.class) to my source code
nothing worked. I have absolute no idea why the class is not found, because the jar/class is definitely available in the classpath.
If I take a look into the job.xml I see the following entry:
name=tmpjars value=file:/C:/Users/Thomas/.m2/repository/org/apache/zookeeper/zookeeper/3.4.5-cdh4.3.0/zookeeper-3.4.5-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/org/apache/hbase/hbase/0.94.6-cdh4.3.0/hbase-0.94.6-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.3.0/hadoop-core-2.0.0-mr1-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/com/google/guava/guava/11.0.2/guava-11.0.2.jar,file:/C:/Users/Thomas/.m2/repository/com/google/protobuf/protobuf-java/2.4.0a/protobuf-java-2.4.0a.jar
This seems a little bit odd to me, these are my local jars on the windows system. Maybe this should be the hdfs jars? If yes, how can I change the values for "tmpjars"?
Here is the java code I try to execute:
configuration = new Configuration(false);
configuration.set("mapred.job.tracker", "192.168.2.41:8021");
configuration.set("fs.defaultFS", "hdfs://192.168.2.41:8020/");
configuration.set("hbase.zookeeper.quorum", "192.168.2.41");
configuration.set("hbase.zookeeper.property.clientPort", "2181");
Job job = new Job(configuration, "HBase Bulk Import for "
+ tablename);
job.setJarByClass(HBaseKVMapper.class);
job.setMapperClass(HBaseKVMapper.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
job.setOutputFormatClass(HFileOutputFormat.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
job.setInputFormatClass(TextInputFormat.class);
HFileOutputFormat.configureIncrementalLoad(job, hTable);
FileInputFormat.addInputPath(job, new Path("myfile1"));
FileOutputFormat.setOutputPath(job, new Path("myfile2"));
job.waitForCompletion(true);
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(
configuration);
loader.doBulkLoad(new Path("myFile3"), hTable);
EDIT:
I tried a little bit more and its totaly strange. I add the following line to the java code:
job.setJarByClass(HFileOutputFormat.class);
after I executed this, the error is gone, but another class not found exception appear:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class mypackage.bulkLoad.HBaseKVMapper not found
HBaseKVMapper is my custom Mapper Class I want to execute. So, I tried to add it with "job.setJarByClass(HBaseKVMapper.class)", but it doesnt work since its its only a class file and no jar. So I generated a Jarfile including HBaseKVMapper.class. After that, I executed it again and now got the HFileOutputFormat.class not found exception again.
After debugging a little bit, I found out that the setJarByClass Methode only copies the local jar file to .staging/job_#number/job.jar on HDFS. So, this setJarByClass() Method will only work for one jar file because it overwrites the job.jar after executing setJarByClass() again with another jar.
While searching for the eroor I saw the following strcuture in the the job staging direcotry:
and inside the libjars direcotry I saw the relevant jar files
so, the hbase jar is inside the libjars directory but the jobtracker doesn't use this it for executing the job. Why?

I would try using Cloudera Manager (free version) as it takes care of these issues for you. Otherwise note the following:
Both your own classes and the HBase Class HFileOutputFormat need to be available on the classpath locally and remotely.
Submitting the job
Meaning getting the classpath right locally for when your driver runs:
$ env HADOOP_CLASSPATH=$(hbase classpath) hadoop jar path/to/jar class....
On the server
In your hadoop-env.sh
export HADOOP_CLASSPATH=$(hbase claspath)
or use
TableMapReduceUtil.addDependencyJars

I found a "hacked" solution which worked for me, but I'm not happy with it because it's not really practicable.
My "hacked" solution:
create one big Jar with all necessary class files, I called it "big.jar" and add it to the local (eclipse) classpath
add the line: job.setJarByClass(MyMapperClass.class) ... the MyMapperClass has to be in the big.jar
When I execute this the big.jar will be copied for every job to the filesystem. No errors anymore. The problem is, that the jar is 80mb in size and have to be copied every time.
If anywone knows a better way I would be tahnkful if he could tell me how.
EDIT:
Now I try to execute jobs with Apache Pig and have exactly the same problem. My hacked soultion doesn't work in this case because pig creats the jobs automaticly. Here is the pig error:
java.lang.ClassNotFoundException: Class org.apache.hadoop.hbase.mapreduce.TableSplit not found

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas