Running Google Dataflow pipeline from a Google App Engine app? - google-bigquery

I am creating a dataflow job using DataflowPipelineRunner. I tried the following scenarios.
Without specifying any machineType
With g1 small machine
with n1-highmem-2
In all the above scenarios, Input is a file from GCS which is very small file(KB size) and output is Big Query table.
I got Out of memory error in all the scenarios
The size of my compiled code is 94mb. I am trying only word count example and it did not read any input(It fails before the job starts). Please help me understand why i am getting this error.
Note: I am using appengine to start the job.
Note: The same code works with beta versoin 0.4.150414
EDIT 1
As per the suggestions in the answer tried the following,
Switched from Automatic scaling to Basic Scaling.
Used machine type B2 which provides 256MB memory
After these configuration, Java Heap Memory problem is solved. But it is trying to upload a jar into stagging location which is more than 10Mb, hence it fails.
It logs the following exception
com.google.api.client.http.HttpRequest execute: exception thrown while executing request
com.google.appengine.api.urlfetch.RequestPayloadTooLargeException: The request to https://www.googleapis.com/upload/storage/v1/b/pwccloudedw-stagging-bucket/o?name=appengine-api-L4wtoWwoElWmstI1Ia93cg.jar&uploadType=resumable&upload_id=AEnB2Uo6HCfw6Usa3aXlcOzg0g3RawrvuAxWuOUtQxwQdxoyA0cf22LKqno0Gu-hjKGLqXIo8MF2FHR63zTxrSmQ9Yk9HdCdZQ exceeded the 10 MiB limit.
at com.google.appengine.api.urlfetch.URLFetchServiceImpl.convertApplicationException(URLFetchServiceImpl.java:157)
at com.google.appengine.api.urlfetch.URLFetchServiceImpl.fetch(URLFetchServiceImpl.java:45)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.fetchResponse(URLFetchServiceStreamHandler.java:543)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getInputStream(URLFetchServiceStreamHandler.java:422)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getResponseCode(URLFetchServiceStreamHandler.java:275)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithoutGZip(MediaHttpUploader.java:545)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:562)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:419)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:427)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357)
at java.util.concurrent.FutureTask.run(FutureTask.java:260)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1168)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:605)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$1$1.run(ApiProxyImpl.java:1152)
at java.security.AccessController.doPrivileged(Native Method)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$1.run(ApiProxyImpl.java:1146)
at java.lang.Thread.run(Thread.java:745)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$2$1.run(ApiProxyImpl.java:1195)
I tried directly uploading the jar file - appengine-api-1.0-sdk-1.9.20.jar, still it tries to upload this jar appengine-api-L4wtoWwoElWmstI1Ia93cg.jar.
which i dont know what jar it is. Any idea on what this jar is appreciated.
Please help me to fix this issue.

The short answer is that if you use AppEngine on a Managed VM you will not encounter the AppEngine sandbox limits (OOM when using a F1 or B1 instance class, execution time limit issues, whitelisted JRE classes). If you really want to run within the App Engine sandbox, then your use of the Dataflow SDK most conform to the limits of the AppEngine sandbox. Below I explain common issues and what people have done to conform to the AppEngine sandbox limits.
The Dataflow SDK requires an AppEngine instance class which has enough memory to execute the users application to construct the pipeline, stage any resources, and send the job description to the Dataflow service. Typically we have seen users require using an instance class with more than 128mb of memory to not see OOM errors.
Generally constructing a pipeline and submitting it to the Dataflow service typically takes less than a couple of seconds if the required resources for your application are already staged. Uploading your JARs and any other resources to GCS can take longer than 60 seconds. This can be solved manually by pre-staging your JARs to GCS beforehand (the Dataflow SDK will skip staging them again if it detects they are already there) or using a task queue to get a 10 minute limit (note that for large applications, 10 mins may not be enough to stage all your resources).
Finally, within the AppEngine sandbox environment, you and all your dependencies are limited to using only whitelisted classes within the JRE or you'll get an exception like:
java.lang.SecurityException:
java.lang.IllegalAccessException: YYY is not allowed on ZZZ
...
EDIT 1
We perform a hash of the contents of the jars on the classpath and upload them to GCS with a modified filename. AppEngine runs a sandboxed environment with its own JARs, appengine-api-L4wtoWwoElWmstI1Ia93cg.jar refers to appengine-api.jar which is a jar that the sandboxed environment adds. You can see from our PackageUtil#getUniqueContentName(...) that we just append -$HASH before .jar.
We are working to solve why you are seeing the RequestPayloadToLarge excepton and it is currently recommended that you set the filesToStage option and filter out the jars not required to execute your Dataflow to get around the issue that you face. You can see how we build the files to stage with DataflowPipelineRunner#detectClassPathResourcesToStage(...).

I had the same problem with the 10MB limit. What I did was filtering out the JAR files bigger than that limit (instead of specific files), and then set the renaming files in the DataflowPipelineOptions with setFilesToStage.
So I just copied the method detectClassPathResourcesToStage from the Dataflow SDK and changed it sightly:
private static final long FILE_BYTES_THRESHOLD = 10 * 1024 * 1024; // 10 MB
protected static List<String> detectClassPathResourcesToStage(ClassLoader classLoader) {
if (!(classLoader instanceof URLClassLoader)) {
String message = String.format("Unable to use ClassLoader to detect classpath elements. "
+ "Current ClassLoader is %s, only URLClassLoaders are supported.", classLoader);
throw new IllegalArgumentException(message);
}
List<String> files = new ArrayList<>();
for (URL url : ((URLClassLoader) classLoader).getURLs()) {
try {
File file = new File(url.toURI());
if (file.length() < FILE_BYTES_THRESHOLD) {
files.add(file.getAbsolutePath());
}
} catch (IllegalArgumentException | URISyntaxException e) {
String message = String.format("Unable to convert url (%s) to file.", url);
throw new IllegalArgumentException(message, e);
}
}
return files;
}
And then when I'm creating the DataflowPipelineOptions:
DataflowPipelineOptions dataflowOptions = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
...
dataflowOptions.setFilesToStage(detectClassPathResourcesToStage(DataflowPipelineRunner.class.getClassLoader()));

Here's a version of Helder's 10MB-filtering solution that will adapt to the default file-staging behavior of DataflowPipelineOptions even if it changes in a future version of the SDK.
Instead of duplicating the logic, it passes a throwaway copy of the DataflowPipelineOptions to DataflowPipelineRunner to see which files it would have staged, then removes any that are too big.
Note that this code assumes that you've defined a custom PipelineOptions class named MyOptions, along with a java.util.Logger field named logger.
// The largest file size that can be staged to the dataflow service.
private static final long MAX_STAGED_FILE_SIZE_BYTES = 10 * 1024 * 1024;
/**
* Returns the list of .jar/etc files to stage based on the
* Options, filtering out any files that are too large for
* DataflowPipelineRunner.
*
* <p>If this accidentally filters out a necessary file, it should
* be obvious when the pipeline fails with a runtime link error.
*/
private static ImmutableList<String> getFilesToStage(MyOptions options) {
// Construct a throw-away runner with a copy of the Options to see
// which files it would have wanted to stage. This could be an
// explicitly-specified list of files from the MyOptions param, or
// the default list of files determined by DataflowPipelineRunner.
List<String> baseFiles;
{
DataflowPipelineOptions tmpOptions =
options.cloneAs(DataflowPipelineOptions.class);
// Ignore the result; we only care about how fromOptions()
// modifies its parameter.
DataflowPipelineRunner.fromOptions(tmpOptions);
baseFiles = tmpOptions.getFilesToStage();
// Some value should have been set.
Preconditions.checkNotNull(baseFiles);
}
// Filter out any files that are too large to stage.
ImmutableList.Builder<String> filteredFiles = ImmutableList.builder();
for (String file : baseFiles) {
long size = new File(file).length();
if (size < MAX_STAGED_FILE_SIZE_BYTES) {
filteredFiles.add(file);
} else {
logger.info("Not staging large file " + file + ": length " + size
+ " >= max length " + MAX_STAGED_FILE_SIZE_BYTES);
}
}
return filteredFiles.build();
}
/** Runs the processing pipeline with given options. */
public void runPipeline(MyOptions options)
throws IOException, InterruptedException {
// DataflowPipelineRunner can't stage large files;
// remove any from the list.
DataflowPipelineOptions dpOpts =
options.as(DataflowPipelineOptions.class);
dpOpts.setFilesToStage(getFilesToStage(options));
// Run the pipeline as usual using "options".
// ...
}

Related

Migrating from Microsoft.Azure.Storage.Blob to Azure.Storage.Blobs - directory concepts missing

These are great guides for migrating between the different versions of NuGet package:
https://github.com/Azure/azure-sdk-for-net/blob/Azure.Storage.Blobs_12.6.0/sdk/storage/Azure.Storage.Blobs/README.md
https://elcamino.cloud/articles/2020-03-30-azure-storage-blobs-net-sdk-v12-upgrade-guide-and-tips.html
However I am struggling to migrate the following concepts in my code:
// Return if a directory exists:
container.GetDirectoryReference(path).ListBlobs().Any();
where GetDirectoryReference is not understood and there appears to be no direct translation.
Also, the concept of a CloudBlobDirectory does not appear to have made it into Azure.Storage.Blobs e.g.
private static long GetDirectorySize(CloudBlobDirectory directoryBlob) {
long size = 0;
foreach (var blobItem in directoryBlob.ListBlobs()) {
if (blobItem is BlobClient)
size += ((BlobClient) blobItem).GetProperties().Value.ContentLength;
if (blobItem is CloudBlobDirectory)
size += GetDirectorySize((CloudBlobDirectory) blobItem);
}
return size;
}
where CloudBlobDirectory does not appear anywhere in the API.
There's no such thing as physical directories or folders in Azure Blob Storage. The directories you sometimes see are part of the blob (e.g. folder1/folder2/file1.txt). The List Blobs requests allows you to add a prefix and delimiter in a call, which are used by the Azure Portal and Azure Data Explorer to create a visualization of folders. As example prefix folder1/ and delimiter / would allow you to see the content as if folder1 was opened.
That's exactly what happens in your code. The GetDirectoryReference() adds a prefix. The ListBlobs() fires a request and Any() checks if any items return.
For V12 the command that'll allow you to do the same would be GetBlobsByHierarchy and its async version. In your particular case where you only want to know if any blobs exist in the directory a GetBlobs with prefix would also suffice.

Cloud Dataflow: Step to read csv file on AWS S3 (TextIO.read) sometimes get stuck

example code is below.
// Java
// Apache Beam SDK verison: 2.16.0
final TupleTag<TableRow> successTag = new TupleTag<TableRow>() {};
final TupleTag<TableRow> deadLetterTag = new TupleTag<TableRow>() {};
Pipeline p = Pipeline.create(dataflowOptions)
PCollection<String> input = p.apply("ReadS3File", TextIO.read().from("s3://sourceBucket/sourceFilename.csv"));
PCollectionTuple outputTuple = input.apply("StringToBigQueryTableRow", ParDo.of(new DoFn<String, TableRow>() { /**/ }))).withOutputTags(successTag, TupleTagList.of(deadLetterTag)))
ReadS3File step gets stuck.
I'm reading the Dataflow documentation and examining the thread dump, it appears to be stuck at com.amazonaws.internal.SdkFilterInputStream.read
Common error guidance
I've tried to determine the root cause of the issue by analysing the thread dump, but I'm afraid it's not enough. I would recommend you to open a case on Google Cloud Platform, because this requires more information on your part which shouldn't be publicly shared in here.

Is it possible to load a pre-populated database from local resource using sqldelight

I have a relatively large db that may take 1 to 2 minutes to initialise, is it possible to load a pre-populated db when using sqldelight (kotlin multiplatform) instead of initialising the db on app launch?
Yes, but it can be tricky. Not just for "Multiplatform". You need to copy the db to the db folder before trying to init sqldelight. That probably means i/o on the main thread when the app starts.
There is no standard way to do this now. You'll need to put the db file in assets on android and in a bundle on iOS and copy them to their respective folders before initializing sqldelight. Obviously you'll want to check if the db exists first, or have some way of knowing this is your first app run.
If you're planning on shipping updates that will have newer databases, you'll need to manage versions outside of just a check for the existance of the db.
Although not directly answering your question, 1 to 2 minutes is really, really long for sqlite. What are you doing? I would first make sure you're using transactions properly. 1-2 minutes of inserting data would (probably) result in a huge db file.
Sorry, but I can't add any comments yet, which would be more appropriate...
Although not directly answering your question, 1 to 2 minutes is
really, really long for sqlite. What are you doing? I would first make
sure you're using transactions properly. 1-2 minutes of inserting data
would (probably) result in a huge db file.
Alternatively, my problem due to which I had to use a pre-populated database was associated with the large size of .sq files (more than 30 MB text of INSERTs per table), and SqlDeLight silently interrupted the generation, without displaying error messages.
You'll need to put the db file in assets on android and in a bundle on
iOS and copy them to their respective folders before initializing
sqldelight.
Having to load a db from resources on both android and ios feels a lot
of work + it means the shared project wont be the only place where the
data is initialised.
Kotlin MultiPlatform library Moko-resources solves the issue of a single source for a database in a shared module. It works for KMM the same way for Android and iOS.
Unfortunately, using this feature are almost not presented in the samples of library. I added a second method (getDriver) to the expected class DatabaseDriverFactory to open the prepared database, and implemented it on the platform. For example, for androidMain:
actual class DatabaseDriverFactory(private val context: Context) {
actual fun createDriver(schema: SqlDriver.Schema, fileName: String): SqlDriver {
return AndroidSqliteDriver(schema, context, fileName)
}
actual fun getDriver(schema: SqlDriver.Schema, fileName: String): SqlDriver {
val database: File = context.getDatabasePath(fileName)
if (!database.exists()) {
val inputStream = context.resources.openRawResource(MR.files.dbfile.rawResId)
val outputStream = FileOutputStream(database.absolutePath)
inputStream.use { input: InputStream ->
outputStream.use { output: FileOutputStream ->
input.copyTo(output)
}
}
}
return AndroidSqliteDriver(schema, context, fileName)
}
}
MR.files.fullDb is the FileResource from the class generated by the library, it is associated with the name of the file located in the resources/MR/files directory of the commonMain module. It property rawResId represents the platform-side resource ID.
The only thing you need is to specify the path to the DB file using the driver.
Let's assume your DB lies in /mnt/my_best_app_dbs/super.db. Now, pass the path in the name property of the Driver. Something like this:
val sqlDriver: SqlDriver = AndroidSqliteDriver(Schema, context, "/mnt/my_best_app_dbs/best.db")
Keep in mind that you might need to have permissions that allow you to read a given storage type.

Issues uploading/downloading files in akka-http/akka-streams

I'm trying to use akka-streams and akka-http and the alpakka library to download/upload files to Amazon S3. I am seeing two issues which might be related...
I can only download very small files, the largest one 8kb.
I can't upload larger files. It fails with the message
Error during processing of request: 'Substream Source has not been
materialized in 5000 milliseconds'. Completing with 500 Internal
Server Error response. To change default exception handling behavior,
provide a custom ExceptionHandler.
akka.stream.impl.SubscriptionTimeoutException:
Substream Source has not been materialized in 5000 milliseconds
Here are my routes
pathEnd {
post {
fileUpload("attachment") {
case (metadata, byteSource) => {
val writeResult: Future[MultipartUploadResult] = byteSource.runWith(client.multipartUpload("bucketname", key))
onSuccess(writeResult) { result =>
complete(result.location.toString())
}
}
}
}
} ~
path("key" / Segment) {
(sourceSystem, sourceTable, sourceId) =>
get {
val result: Future[ByteString] =
client.download("bucketname", key).runWith(Sink.head)
onSuccess(result) {
complete(_)
}
}
}
Trying to download a file of say 100KB will end up fetching a truncated version of the file usually of size around 16-25Kb
Any help appreciated
Edit: For the download issue, I took Stefano's suggestion and got
[error] found : akka.stream.scaladsl.Source[akka.util.ByteString,akka.NotUsed]
[error] required: akka.http.scaladsl.marshalling.ToResponseMarshallable
This made it work
complete(HttpEntity(ContentTypes.`application/octet-stream`, client.download("bucketname", key).runWith(Sink.head)))
1) On the download issue: by calling
val result: Future[ByteString] =
client.download("bucketname", key).runWith(Sink.head)
you are streaming all the data from S3 into memory, and then serve the result.
Akka-Http as streaming support that allows you to stream bytes straight from a source, without buffering them all in memory. More info on this can be found in the docs. Practically, this means the complete directive can take a Source[ByteString, _], as in
...
get {
complete(client.download("bucketname", key))
}
2) On the upload issue: you can try to tweak Akka HTTP akka.http.server.parsing.max-content-length setting:
# Default maximum content length which should not be exceeded by incoming request entities.
# Can be changed at runtime (to a higher or lower value) via the `HttpEntity::withSizeLimit` method.
# Note that it is not necessarily a problem to set this to a high value as all stream operations
# are always properly backpressured.
# Nevertheless you might want to apply some limit in order to prevent a single client from consuming
# an excessive amount of server resources.
#
# Set to `infinite` to completely disable entity length checks. (Even then you can still apply one
# programmatically via `withSizeLimit`.)
max-content-length = 8m
Resulting code to test this would be something along the lines of:
withoutSizeLimit {
fileUpload("attachment") {
...
}
}

App Folder files not visible after un-install / re-install

I noticed this in the debug environment where I have to do many re-installs in order to test persistent data storage, initial settings, etc... It may not be relevant in production, but I mention this anyway just to inform other developers.
Any files created by an app in its App Folder are not 'visible' to queries after manual un-install / re-install (from IDE, for instance). The same applies to the 'Encoded DriveID' - it is no longer valid.
It is probably 'by design' but it effectively creates 'orphans' in the app folder until manually cleaned by 'drive.google.com > Manage Apps > [yourapp] > Options > Delete hidden app data'. It also creates problem if an app relies on finding of files by metadata, title, ... since these seem to be gone. As I said, not a production problem, but it can create some frustration during development.
Can any of friendly Googlers confirm this? Is there any other way to get to these files after re-install?
Try this approach:
Use requestSync() in onConnected() as:
#Override
public void onConnected(Bundle connectionHint) {
super.onConnected(connectionHint);
Drive.DriveApi.requestSync(getGoogleApiClient()).setResultCallback(syncCallback);
}
Then, in its callback, query the contents of the drive using:
final private ResultCallback<Status> syncCallback = new ResultCallback<Status>() {
#Override
public void onResult(#NonNull Status status) {
if (!status.isSuccess()) {
showMessage("Problem while retrieving results");
return;
}
query = new Query.Builder()
.addFilter(Filters.and(Filters.eq(SearchableField.TITLE, "title"),
Filters.eq(SearchableField.TRASHED, false)))
.build();
Drive.DriveApi.query(getGoogleApiClient(), query)
.setResultCallback(metadataCallback);
}
};
Then, in its callback, if found, retrieve the file using:
final private ResultCallback<DriveApi.MetadataBufferResult> metadataCallback =
new ResultCallback<DriveApi.MetadataBufferResult>() {
#SuppressLint("SetTextI18n")
#Override
public void onResult(#NonNull DriveApi.MetadataBufferResult result) {
if (!result.getStatus().isSuccess()) {
showMessage("Problem while retrieving results");
return;
}
MetadataBuffer mdb = result.getMetadataBuffer();
for (Metadata md : mdb) {
Date createdDate = md.getCreatedDate();
DriveId driveId = md.getDriveId();
}
readFromDrive(driveId);
}
};
Job done!
Hope that helps!
It looks like Google Play services has a problem. (https://stackoverflow.com/a/26541831/2228408)
For testing, you can do it by clearing Google Play services data (Settings > Apps > Google Play services > Manage Space > Clear all data).
Or, at this time, you need to implement it by using Drive SDK v2.
I think you are correct that it is by design.
By inspection I have concluded that until an app places data in the AppFolder folder, Drive does not sync down to the device however much to try and hassle it. Therefore it is impossible to check for the existence of AppFolder placed by another device, or a prior implementation. I'd assume that this was to try and create a consistent clean install.
I can see that there are a couple of strategies to work around this:
1) Place dummy data on AppFolder and then sync and recheck.
2) Accept that in the first instance there is the possibility of duplicates, as you cannot access the existing file by definition you will create a new copy, and use custom metadata to come up with a scheme to differentiate like-named files and choose which one you want to keep (essentially implement your conflict merge strategy across the two different files).
I've done the second, I have an update number to compare data from different devices and decide which version I want so decide whether to upload, download or leave alone. As my data is an SQLite DB I also have some code to only sync once updates have settled down and I deliberately consider people updating two devices at once foolish and the results are consistent but undefined as to which will win.