Cloud Dataflow: Step to read csv file on AWS S3 (TextIO.read) sometimes get stuck - amazon-s3

example code is below.
// Java
// Apache Beam SDK verison: 2.16.0
final TupleTag<TableRow> successTag = new TupleTag<TableRow>() {};
final TupleTag<TableRow> deadLetterTag = new TupleTag<TableRow>() {};
Pipeline p = Pipeline.create(dataflowOptions)
PCollection<String> input = p.apply("ReadS3File", TextIO.read().from("s3://sourceBucket/sourceFilename.csv"));
PCollectionTuple outputTuple = input.apply("StringToBigQueryTableRow", ParDo.of(new DoFn<String, TableRow>() { /**/ }))).withOutputTags(successTag, TupleTagList.of(deadLetterTag)))
ReadS3File step gets stuck.
I'm reading the Dataflow documentation and examining the thread dump, it appears to be stuck at com.amazonaws.internal.SdkFilterInputStream.read
Common error guidance

I've tried to determine the root cause of the issue by analysing the thread dump, but I'm afraid it's not enough. I would recommend you to open a case on Google Cloud Platform, because this requires more information on your part which shouldn't be publicly shared in here.

Related

How to work with exported Stack Driver logs from Google Cloud Projects into BigQuery

I have created an "export" from my Stackdriver Logging page in my Google Cloud project. I configured the export to go to a BigQuery dataset.
When I go to BigQuery, I see the dataset.
There are no tables in my dataset, since Stackdriver export created the BigQuery dataset for me.
How do I see the data that was exported? Since there are no tables I cannot perform a "select * from X". I could create a table but I don't know what columns to add nor do I know how to tell Stackdriver logging to write to that table.
I must be missing a step.
Google has a short 1 minute video on exporting to Big Query but it stops exactly at the point where I am in the process.
When a new Stackdriver export is defined, it will then start to export newly written log records to the target sink (BQ in this case). As per the documentation found here:
https://cloud.google.com/logging/docs/export/
it states:
Since exporting happens for new log entries only, you cannot export
log entries that Logging received before your sink was created.
If one wants to export existing logs to a file, one can use gcloud (or API) as described here:
https://cloud.google.com/logging/docs/reference/tools/gcloud-logging#reading_log_entries
The output of this "dump" of existing log records can then used in whatever manner you see fit. For example, it could be imported into a BQ table.
To export logs in the bigquery from the stackdrive , you have to create Logger Sink using code or GCP logging UI
Then create Sink, add a filter.
https://cloud.google.com/logging/docs/export/configure_export_v2
Then add logs to stack driver using code
public static void writeLog(Severity severity, String logName, Map<String, String> jsonMap) {
List<Map<String, String>> maps = limitMap(jsonMap);
for (Map<String, String> map : maps) {
LogEntry logEntry = LogEntry.newBuilder(Payload.JsonPayload.of(map))
.setSeverity(severity)
.setLogName(logName)
.setResource(monitoredResource)
.build();
logging.write(Collections.singleton(logEntry));
}
}
private static MonitoredResource monitoredResource =
MonitoredResource.newBuilder("global")
.addLabel("project_id", logging.getOptions().getProjectId())
.build();
https://cloud.google.com/bigquery/docs/writing-results

Nexus 3 Repository Manager Create (Or Run Pre-generated) Task Without Using User Interface

This question arose when I was trying to reboot my Nexus3 container on a weekly schedule and connect to an S3 bucket I have. I have my container set up to connect to the S3 bucket just fine (it creates a new [A-Z,0-9]-metrics.properties file each time) but the previous artifacts are not found when looking though the UI.
I used the Repair - Reconcile component database from blob store task from the UI settings and it works great!
But... all the previous steps are done automatically through scripts and I would like the same for the final step of Reconciling the blob store.
Connecting to the S3 blob store is done with reference to examples from nexus-book-examples. As below:
Map<String, String> config = new HashMap<>()
config.put("bucket", "nexus-artifact-storage")
blobStore.createS3BlobStore('nexus-artifact-storage', config)
AWS credentials are provided during the docker run step so the above is all that is needed for the blob store set up. It is called by a modified version of provision.sh, which is a script from the nexus-book-examples git page.
Is there a way to either:
Create a task with a groovy script? or,
Reference one of the task types and run the task that way with a POST?
depending on the specific version of repository manager that you are using, there may be REST endpoints for listing and running scheduled tasks. This was introduced in 3.6.0 according to this ticket: https://issues.sonatype.org/browse/NEXUS-11935. For more information about the REST integration in 3.x, check out the following: https://help.sonatype.com/display/NXRM3/Tasks+API
For creating a scheduled task, you will have to add some groovy code. Perhaps the following would be a good start:
import org.sonatype.nexus.scheduling.TaskConfiguration
import org.sonatype.nexus.scheduling.TaskInfo
import org.sonatype.nexus.scheduling.TaskScheduler
import groovy.json.JsonOutput
import groovy.json.JsonSlurper
class TaskXO
{
String typeId
Boolean enabled
String name
String alertEmail
Map<String, String> properties
}
TaskXO task = new JsonSlurper().parseText(args)
TaskScheduler scheduler = container.lookup(TaskScheduler.class.name)
TaskConfiguration config = scheduler.createTaskConfigurationInstance(task.typeId)
config.enabled = task.enabled
config.name = task.name
config.alertEmail = task.alertEmail
task.properties?.each { key, value -> config.setString(key, value) }
TaskInfo taskInfo = scheduler.scheduleTask(config, scheduler.scheduleFactory.manual())
JsonOutput.toJson(taskInfo)

can we use Spark sql for reporting queries in REST web services

Some basic question regarding Spark. Can we use spark only in the context of processing jobs?In our use case we have stream of positon and motion data which we can refine and save it to cassandra tables.That is done with kafka and spark streaming.But for a web user who want to view some report with some search criteria can we use Spark(Spark SQL).Or for this purpose should we restrict to cql ? If we can use spark , how can we invoke spark-sql from a webservice deployed in tomcat server.
Well, you can do it by passing a SQL request via HTML address like:
http://yourwebsite.com/Requests?query=WOMAN
At the receiving point, the architecture will be something like:
Tomcat+Servlet --> Apache Kafka/Flume --> Spark Streaming --> Spark SQL inside a SS closure
In the servlet (if you don't know what a servlet is, better look it up) in the webapplication folder in your tomcat, you will have something like this:
public class QueryServlet extends HttpServlet{
#Override
public void doGet(ttpServletRequest request, HttpServletResponse response){
String requestChoice = request.getQueryString().split("=")[0];
String requestArgument = request.getQueryString().split("=")[1];
KafkaProducer<String, String> producer;
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("acks", "all");
properties.setProperty("retries", "0");
properties.setProperty("batch.size", "16384");
properties.setProperty("auto.commit.interval.ms", "1000");
properties.setProperty("linger.ms", "0");
properties.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.setProperty("block.on.buffer.full", "true");
producer = new KafkaProducer<>(properties);
producer.send(new ProducerRecord<String, String>(
requestChoice,
requestArgument));
In the Spark Streaming running application (which you need to be running in order to catch the queries, otherwise you know how long it takes Spark to start), You need to have a Kafka Receiver
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(batchInt*1000));
Map<String, Integer> topicMap = new HashMap<>();
topicMap.put("wearable", 1);
//FIrst Dstream is a couple made by the topic and the value written to the topic
JavaPairReceiverInputDStream<String, String> kafkaStream =
KafkaUtils.createStream(jssc, "localhost:2181", "test", topicMap);
After this, what happens is that
You do a GET setting either the GET body or giving the argument to the query
The GET is caught by your servlet, which immediately creates, send, close a Kafka Producer (it is possible to actually avoid the Kafka Step, simply sending your Spark Streaming app the information in any other way; see SparkStreaming receivers)
Spark Streaming operates your SparkSQL code as any other submitted Spark application, but it keeps running waiting for other queries to come.
Of course, in the servlet you should check the validity of the request, but this is the main idea. Or at least the architecture I've been using

Running Google Dataflow pipeline from a Google App Engine app?

I am creating a dataflow job using DataflowPipelineRunner. I tried the following scenarios.
Without specifying any machineType
With g1 small machine
with n1-highmem-2
In all the above scenarios, Input is a file from GCS which is very small file(KB size) and output is Big Query table.
I got Out of memory error in all the scenarios
The size of my compiled code is 94mb. I am trying only word count example and it did not read any input(It fails before the job starts). Please help me understand why i am getting this error.
Note: I am using appengine to start the job.
Note: The same code works with beta versoin 0.4.150414
EDIT 1
As per the suggestions in the answer tried the following,
Switched from Automatic scaling to Basic Scaling.
Used machine type B2 which provides 256MB memory
After these configuration, Java Heap Memory problem is solved. But it is trying to upload a jar into stagging location which is more than 10Mb, hence it fails.
It logs the following exception
com.google.api.client.http.HttpRequest execute: exception thrown while executing request
com.google.appengine.api.urlfetch.RequestPayloadTooLargeException: The request to https://www.googleapis.com/upload/storage/v1/b/pwccloudedw-stagging-bucket/o?name=appengine-api-L4wtoWwoElWmstI1Ia93cg.jar&uploadType=resumable&upload_id=AEnB2Uo6HCfw6Usa3aXlcOzg0g3RawrvuAxWuOUtQxwQdxoyA0cf22LKqno0Gu-hjKGLqXIo8MF2FHR63zTxrSmQ9Yk9HdCdZQ exceeded the 10 MiB limit.
at com.google.appengine.api.urlfetch.URLFetchServiceImpl.convertApplicationException(URLFetchServiceImpl.java:157)
at com.google.appengine.api.urlfetch.URLFetchServiceImpl.fetch(URLFetchServiceImpl.java:45)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.fetchResponse(URLFetchServiceStreamHandler.java:543)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getInputStream(URLFetchServiceStreamHandler.java:422)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getResponseCode(URLFetchServiceStreamHandler.java:275)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithoutGZip(MediaHttpUploader.java:545)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:562)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:419)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:427)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357)
at java.util.concurrent.FutureTask.run(FutureTask.java:260)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1168)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:605)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$1$1.run(ApiProxyImpl.java:1152)
at java.security.AccessController.doPrivileged(Native Method)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$1.run(ApiProxyImpl.java:1146)
at java.lang.Thread.run(Thread.java:745)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$2$1.run(ApiProxyImpl.java:1195)
I tried directly uploading the jar file - appengine-api-1.0-sdk-1.9.20.jar, still it tries to upload this jar appengine-api-L4wtoWwoElWmstI1Ia93cg.jar.
which i dont know what jar it is. Any idea on what this jar is appreciated.
Please help me to fix this issue.
The short answer is that if you use AppEngine on a Managed VM you will not encounter the AppEngine sandbox limits (OOM when using a F1 or B1 instance class, execution time limit issues, whitelisted JRE classes). If you really want to run within the App Engine sandbox, then your use of the Dataflow SDK most conform to the limits of the AppEngine sandbox. Below I explain common issues and what people have done to conform to the AppEngine sandbox limits.
The Dataflow SDK requires an AppEngine instance class which has enough memory to execute the users application to construct the pipeline, stage any resources, and send the job description to the Dataflow service. Typically we have seen users require using an instance class with more than 128mb of memory to not see OOM errors.
Generally constructing a pipeline and submitting it to the Dataflow service typically takes less than a couple of seconds if the required resources for your application are already staged. Uploading your JARs and any other resources to GCS can take longer than 60 seconds. This can be solved manually by pre-staging your JARs to GCS beforehand (the Dataflow SDK will skip staging them again if it detects they are already there) or using a task queue to get a 10 minute limit (note that for large applications, 10 mins may not be enough to stage all your resources).
Finally, within the AppEngine sandbox environment, you and all your dependencies are limited to using only whitelisted classes within the JRE or you'll get an exception like:
java.lang.SecurityException:
java.lang.IllegalAccessException: YYY is not allowed on ZZZ
...
EDIT 1
We perform a hash of the contents of the jars on the classpath and upload them to GCS with a modified filename. AppEngine runs a sandboxed environment with its own JARs, appengine-api-L4wtoWwoElWmstI1Ia93cg.jar refers to appengine-api.jar which is a jar that the sandboxed environment adds. You can see from our PackageUtil#getUniqueContentName(...) that we just append -$HASH before .jar.
We are working to solve why you are seeing the RequestPayloadToLarge excepton and it is currently recommended that you set the filesToStage option and filter out the jars not required to execute your Dataflow to get around the issue that you face. You can see how we build the files to stage with DataflowPipelineRunner#detectClassPathResourcesToStage(...).
I had the same problem with the 10MB limit. What I did was filtering out the JAR files bigger than that limit (instead of specific files), and then set the renaming files in the DataflowPipelineOptions with setFilesToStage.
So I just copied the method detectClassPathResourcesToStage from the Dataflow SDK and changed it sightly:
private static final long FILE_BYTES_THRESHOLD = 10 * 1024 * 1024; // 10 MB
protected static List<String> detectClassPathResourcesToStage(ClassLoader classLoader) {
if (!(classLoader instanceof URLClassLoader)) {
String message = String.format("Unable to use ClassLoader to detect classpath elements. "
+ "Current ClassLoader is %s, only URLClassLoaders are supported.", classLoader);
throw new IllegalArgumentException(message);
}
List<String> files = new ArrayList<>();
for (URL url : ((URLClassLoader) classLoader).getURLs()) {
try {
File file = new File(url.toURI());
if (file.length() < FILE_BYTES_THRESHOLD) {
files.add(file.getAbsolutePath());
}
} catch (IllegalArgumentException | URISyntaxException e) {
String message = String.format("Unable to convert url (%s) to file.", url);
throw new IllegalArgumentException(message, e);
}
}
return files;
}
And then when I'm creating the DataflowPipelineOptions:
DataflowPipelineOptions dataflowOptions = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
...
dataflowOptions.setFilesToStage(detectClassPathResourcesToStage(DataflowPipelineRunner.class.getClassLoader()));
Here's a version of Helder's 10MB-filtering solution that will adapt to the default file-staging behavior of DataflowPipelineOptions even if it changes in a future version of the SDK.
Instead of duplicating the logic, it passes a throwaway copy of the DataflowPipelineOptions to DataflowPipelineRunner to see which files it would have staged, then removes any that are too big.
Note that this code assumes that you've defined a custom PipelineOptions class named MyOptions, along with a java.util.Logger field named logger.
// The largest file size that can be staged to the dataflow service.
private static final long MAX_STAGED_FILE_SIZE_BYTES = 10 * 1024 * 1024;
/**
* Returns the list of .jar/etc files to stage based on the
* Options, filtering out any files that are too large for
* DataflowPipelineRunner.
*
* <p>If this accidentally filters out a necessary file, it should
* be obvious when the pipeline fails with a runtime link error.
*/
private static ImmutableList<String> getFilesToStage(MyOptions options) {
// Construct a throw-away runner with a copy of the Options to see
// which files it would have wanted to stage. This could be an
// explicitly-specified list of files from the MyOptions param, or
// the default list of files determined by DataflowPipelineRunner.
List<String> baseFiles;
{
DataflowPipelineOptions tmpOptions =
options.cloneAs(DataflowPipelineOptions.class);
// Ignore the result; we only care about how fromOptions()
// modifies its parameter.
DataflowPipelineRunner.fromOptions(tmpOptions);
baseFiles = tmpOptions.getFilesToStage();
// Some value should have been set.
Preconditions.checkNotNull(baseFiles);
}
// Filter out any files that are too large to stage.
ImmutableList.Builder<String> filteredFiles = ImmutableList.builder();
for (String file : baseFiles) {
long size = new File(file).length();
if (size < MAX_STAGED_FILE_SIZE_BYTES) {
filteredFiles.add(file);
} else {
logger.info("Not staging large file " + file + ": length " + size
+ " >= max length " + MAX_STAGED_FILE_SIZE_BYTES);
}
}
return filteredFiles.build();
}
/** Runs the processing pipeline with given options. */
public void runPipeline(MyOptions options)
throws IOException, InterruptedException {
// DataflowPipelineRunner can't stage large files;
// remove any from the list.
DataflowPipelineOptions dpOpts =
options.as(DataflowPipelineOptions.class);
dpOpts.setFilesToStage(getFilesToStage(options));
// Run the pipeline as usual using "options".
// ...
}

Using a text file as Spark streaming source for testing purpose

I want to write a test for my spark streaming application that consume a flume source.
http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ suggests using ManualClock but for the moment reading a file and verifying outputs would be enough for me.
So I wish to use :
JavaStreamingContext streamingContext = ...
JavaDStream<String> stream = streamingContext.textFileStream(dataDirectory);
stream.print();
streamingContext.awaitTermination();
streamingContext.start();
Unfortunately it does not print anything.
I tried:
dataDirectory = "hdfs://node:port/absolute/path/on/hdfs/"
dataDirectory = "file://C:\\absolute\\path\\on\\windows\\"
adding the text file in the directory BEFORE the program begins
adding the text file in the directory WHILE the program run
Nothing works.
Any suggestion to read from text file?
Thanks,
Martin
Order of start and await are indeed inversed.
In addition to that, the easiest way to pass data to your Spark Streaming application for testing is a QueueDStream. It's a mutable queue of RDD of arbitrary data. This means that you could create the data programmatically or load it from disk into an RDD and pass that to your Spark Streaming code.
Eg. to avoid the timing issues faced with the fileConsumer, you could try this:
val rdd = sparkContext.textFile(...)
val rddQueue: Queue[RDD[String]] = Queue()
rddQueue += rdd
val dstream = streamingContext.queueStream(rddQueue)
doMyStuffWithDstream(dstream)
streamingContext.start()
streamingContext.awaitTermination()
I am so stupid, I inverted calls to start() and awaitTermination()
If you want to do the same, you should read from HDFS, and add the file WHILE the program runs.