Migrating from Microsoft.Azure.Storage.Blob to Azure.Storage.Blobs - directory concepts missing - azure-storage

These are great guides for migrating between the different versions of NuGet package:
https://github.com/Azure/azure-sdk-for-net/blob/Azure.Storage.Blobs_12.6.0/sdk/storage/Azure.Storage.Blobs/README.md
https://elcamino.cloud/articles/2020-03-30-azure-storage-blobs-net-sdk-v12-upgrade-guide-and-tips.html
However I am struggling to migrate the following concepts in my code:
// Return if a directory exists:
container.GetDirectoryReference(path).ListBlobs().Any();
where GetDirectoryReference is not understood and there appears to be no direct translation.
Also, the concept of a CloudBlobDirectory does not appear to have made it into Azure.Storage.Blobs e.g.
private static long GetDirectorySize(CloudBlobDirectory directoryBlob) {
long size = 0;
foreach (var blobItem in directoryBlob.ListBlobs()) {
if (blobItem is BlobClient)
size += ((BlobClient) blobItem).GetProperties().Value.ContentLength;
if (blobItem is CloudBlobDirectory)
size += GetDirectorySize((CloudBlobDirectory) blobItem);
}
return size;
}
where CloudBlobDirectory does not appear anywhere in the API.

There's no such thing as physical directories or folders in Azure Blob Storage. The directories you sometimes see are part of the blob (e.g. folder1/folder2/file1.txt). The List Blobs requests allows you to add a prefix and delimiter in a call, which are used by the Azure Portal and Azure Data Explorer to create a visualization of folders. As example prefix folder1/ and delimiter / would allow you to see the content as if folder1 was opened.
That's exactly what happens in your code. The GetDirectoryReference() adds a prefix. The ListBlobs() fires a request and Any() checks if any items return.
For V12 the command that'll allow you to do the same would be GetBlobsByHierarchy and its async version. In your particular case where you only want to know if any blobs exist in the directory a GetBlobs with prefix would also suffice.

Related

Is it possible to load a pre-populated database from local resource using sqldelight

I have a relatively large db that may take 1 to 2 minutes to initialise, is it possible to load a pre-populated db when using sqldelight (kotlin multiplatform) instead of initialising the db on app launch?
Yes, but it can be tricky. Not just for "Multiplatform". You need to copy the db to the db folder before trying to init sqldelight. That probably means i/o on the main thread when the app starts.
There is no standard way to do this now. You'll need to put the db file in assets on android and in a bundle on iOS and copy them to their respective folders before initializing sqldelight. Obviously you'll want to check if the db exists first, or have some way of knowing this is your first app run.
If you're planning on shipping updates that will have newer databases, you'll need to manage versions outside of just a check for the existance of the db.
Although not directly answering your question, 1 to 2 minutes is really, really long for sqlite. What are you doing? I would first make sure you're using transactions properly. 1-2 minutes of inserting data would (probably) result in a huge db file.
Sorry, but I can't add any comments yet, which would be more appropriate...
Although not directly answering your question, 1 to 2 minutes is
really, really long for sqlite. What are you doing? I would first make
sure you're using transactions properly. 1-2 minutes of inserting data
would (probably) result in a huge db file.
Alternatively, my problem due to which I had to use a pre-populated database was associated with the large size of .sq files (more than 30 MB text of INSERTs per table), and SqlDeLight silently interrupted the generation, without displaying error messages.
You'll need to put the db file in assets on android and in a bundle on
iOS and copy them to their respective folders before initializing
sqldelight.
Having to load a db from resources on both android and ios feels a lot
of work + it means the shared project wont be the only place where the
data is initialised.
Kotlin MultiPlatform library Moko-resources solves the issue of a single source for a database in a shared module. It works for KMM the same way for Android and iOS.
Unfortunately, using this feature are almost not presented in the samples of library. I added a second method (getDriver) to the expected class DatabaseDriverFactory to open the prepared database, and implemented it on the platform. For example, for androidMain:
actual class DatabaseDriverFactory(private val context: Context) {
actual fun createDriver(schema: SqlDriver.Schema, fileName: String): SqlDriver {
return AndroidSqliteDriver(schema, context, fileName)
}
actual fun getDriver(schema: SqlDriver.Schema, fileName: String): SqlDriver {
val database: File = context.getDatabasePath(fileName)
if (!database.exists()) {
val inputStream = context.resources.openRawResource(MR.files.dbfile.rawResId)
val outputStream = FileOutputStream(database.absolutePath)
inputStream.use { input: InputStream ->
outputStream.use { output: FileOutputStream ->
input.copyTo(output)
}
}
}
return AndroidSqliteDriver(schema, context, fileName)
}
}
MR.files.fullDb is the FileResource from the class generated by the library, it is associated with the name of the file located in the resources/MR/files directory of the commonMain module. It property rawResId represents the platform-side resource ID.
The only thing you need is to specify the path to the DB file using the driver.
Let's assume your DB lies in /mnt/my_best_app_dbs/super.db. Now, pass the path in the name property of the Driver. Something like this:
val sqlDriver: SqlDriver = AndroidSqliteDriver(Schema, context, "/mnt/my_best_app_dbs/best.db")
Keep in mind that you might need to have permissions that allow you to read a given storage type.

Spring Cloud Server serving multiple property files for the same application

Lets say I have applicationA that has 3 property files:
-> applicationA
- datasource.properties
- security.properties
- jms.properties
How do I move all properties to a spring cloud config server and keep them separate?
As of today I have configured the config server that will only read ONE property file as this seems to be the standard way. This file the config server picks up seems to be resolved by using the spring.application.name. In my case it will only read ONE file with this name:
-> applicationA.properties
How can I add the other files to be resolved by the config server?
Not possible in the way how you requested. Spring Cloud Config Server uses NativeEnvironmentRepository which is:
Simple implementation of {#link EnvironmentRepository} that uses a SpringApplication and configuration files located through the normal protocols. The resulting Environment is composed of property sources located using the application name as the config file stem (spring.config.name) and the environment name as a Spring profile.
See: https://github.com/spring-cloud/spring-cloud-config/blob/master/spring-cloud-config-server/src/main/java/org/springframework/cloud/config/server/environment/NativeEnvironmentRepository.java
So basically every time when client request properties from Config Server it creates ConfigurableApplicationContext using SpringApplicationBuilder. And it is launched with next configuration property:
String config = application;
if (!config.startsWith("application")) {
config = "application," + config;
}
list.add("--spring.config.name=" + config);
So possible names for property files will be only application.properties(or .yml) and config client application name that is requesting configuration - in your case applicationA.properties.
But you can "cheat".
In config server configuration you can add such property
spring:
cloud:
config:
server:
git:
search-paths: '{application}, {application}/your-subdirectory'
In this case Config Server will search for same property file names but in few directories and you can use subdirectories to keep your properties separate.
So with configuration above you will be able to load configuration from:
applicationA/application.properies
applicationA/your-subdirectory/application.properies
This can be done.
You need to create your own EnvironmentRepository, which loads your property files.
org.springframework.cloud.config.server.support.AbstractScmAccessor#getSearchLocations
searches for the property files to load :
for (String prof : profiles) {
for (String app : apps) {
String value = location;
if (app != null) {
value = value.replace("{application}", app);
}
if (prof != null) {
value = value.replace("{profile}", prof);
}
if (label != null) {
value = value.replace("{label}", label);
}
if (!value.endsWith("/")) {
value = value + "/";
}
output.addAll(matchingDirectories(dir, value));
}
}
There you could add custom code, that reads the required property files.
The above code matches exactly the behaviour described in the spring docs.
The NativeEnvironmentRepository does NOT access GIT/SCM in any way, so you should use
JGitEnvironmentRepository as base for your own implementation.
As #nmyk pointed out, NativeEnvironmentRepository boots a mini app in order to collect the properties by providing it with - sort of speak - "hardcoded" {appname}.* and application.* supported property file names. (#Stefan Isele - prefabware.com JGitEnvironmentRepository ends up using NativeEnvironmentRepository as well, for that matter).
I have issued a pull request for spring-cloud-config-server 1.4.x, that supports defining additional file names, through a spring.cloud.config.server.searchNames environment property, in the same sense one can do for a single springboot app, as defined in the Externalized Configuration.Application Property Files section of the documentation, using the spring.config.name enviroment property. I hope they review it soon, since it seems many have asked about this feature in stack overflow, and surely many many more search for it and read the currently advised solutions.
It worths mentioning that many ppl advise "abusing" the profile feature to achieve this, which is a bad practice, in my humble opinion, as I describe in this answer

Running Google Dataflow pipeline from a Google App Engine app?

I am creating a dataflow job using DataflowPipelineRunner. I tried the following scenarios.
Without specifying any machineType
With g1 small machine
with n1-highmem-2
In all the above scenarios, Input is a file from GCS which is very small file(KB size) and output is Big Query table.
I got Out of memory error in all the scenarios
The size of my compiled code is 94mb. I am trying only word count example and it did not read any input(It fails before the job starts). Please help me understand why i am getting this error.
Note: I am using appengine to start the job.
Note: The same code works with beta versoin 0.4.150414
EDIT 1
As per the suggestions in the answer tried the following,
Switched from Automatic scaling to Basic Scaling.
Used machine type B2 which provides 256MB memory
After these configuration, Java Heap Memory problem is solved. But it is trying to upload a jar into stagging location which is more than 10Mb, hence it fails.
It logs the following exception
com.google.api.client.http.HttpRequest execute: exception thrown while executing request
com.google.appengine.api.urlfetch.RequestPayloadTooLargeException: The request to https://www.googleapis.com/upload/storage/v1/b/pwccloudedw-stagging-bucket/o?name=appengine-api-L4wtoWwoElWmstI1Ia93cg.jar&uploadType=resumable&upload_id=AEnB2Uo6HCfw6Usa3aXlcOzg0g3RawrvuAxWuOUtQxwQdxoyA0cf22LKqno0Gu-hjKGLqXIo8MF2FHR63zTxrSmQ9Yk9HdCdZQ exceeded the 10 MiB limit.
at com.google.appengine.api.urlfetch.URLFetchServiceImpl.convertApplicationException(URLFetchServiceImpl.java:157)
at com.google.appengine.api.urlfetch.URLFetchServiceImpl.fetch(URLFetchServiceImpl.java:45)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.fetchResponse(URLFetchServiceStreamHandler.java:543)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getInputStream(URLFetchServiceStreamHandler.java:422)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getResponseCode(URLFetchServiceStreamHandler.java:275)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithoutGZip(MediaHttpUploader.java:545)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:562)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:419)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:427)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357)
at java.util.concurrent.FutureTask.run(FutureTask.java:260)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1168)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:605)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$1$1.run(ApiProxyImpl.java:1152)
at java.security.AccessController.doPrivileged(Native Method)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$1.run(ApiProxyImpl.java:1146)
at java.lang.Thread.run(Thread.java:745)
at com.google.apphosting.runtime.ApiProxyImpl$CurrentRequestThreadFactory$2$1.run(ApiProxyImpl.java:1195)
I tried directly uploading the jar file - appengine-api-1.0-sdk-1.9.20.jar, still it tries to upload this jar appengine-api-L4wtoWwoElWmstI1Ia93cg.jar.
which i dont know what jar it is. Any idea on what this jar is appreciated.
Please help me to fix this issue.
The short answer is that if you use AppEngine on a Managed VM you will not encounter the AppEngine sandbox limits (OOM when using a F1 or B1 instance class, execution time limit issues, whitelisted JRE classes). If you really want to run within the App Engine sandbox, then your use of the Dataflow SDK most conform to the limits of the AppEngine sandbox. Below I explain common issues and what people have done to conform to the AppEngine sandbox limits.
The Dataflow SDK requires an AppEngine instance class which has enough memory to execute the users application to construct the pipeline, stage any resources, and send the job description to the Dataflow service. Typically we have seen users require using an instance class with more than 128mb of memory to not see OOM errors.
Generally constructing a pipeline and submitting it to the Dataflow service typically takes less than a couple of seconds if the required resources for your application are already staged. Uploading your JARs and any other resources to GCS can take longer than 60 seconds. This can be solved manually by pre-staging your JARs to GCS beforehand (the Dataflow SDK will skip staging them again if it detects they are already there) or using a task queue to get a 10 minute limit (note that for large applications, 10 mins may not be enough to stage all your resources).
Finally, within the AppEngine sandbox environment, you and all your dependencies are limited to using only whitelisted classes within the JRE or you'll get an exception like:
java.lang.SecurityException:
java.lang.IllegalAccessException: YYY is not allowed on ZZZ
...
EDIT 1
We perform a hash of the contents of the jars on the classpath and upload them to GCS with a modified filename. AppEngine runs a sandboxed environment with its own JARs, appengine-api-L4wtoWwoElWmstI1Ia93cg.jar refers to appengine-api.jar which is a jar that the sandboxed environment adds. You can see from our PackageUtil#getUniqueContentName(...) that we just append -$HASH before .jar.
We are working to solve why you are seeing the RequestPayloadToLarge excepton and it is currently recommended that you set the filesToStage option and filter out the jars not required to execute your Dataflow to get around the issue that you face. You can see how we build the files to stage with DataflowPipelineRunner#detectClassPathResourcesToStage(...).
I had the same problem with the 10MB limit. What I did was filtering out the JAR files bigger than that limit (instead of specific files), and then set the renaming files in the DataflowPipelineOptions with setFilesToStage.
So I just copied the method detectClassPathResourcesToStage from the Dataflow SDK and changed it sightly:
private static final long FILE_BYTES_THRESHOLD = 10 * 1024 * 1024; // 10 MB
protected static List<String> detectClassPathResourcesToStage(ClassLoader classLoader) {
if (!(classLoader instanceof URLClassLoader)) {
String message = String.format("Unable to use ClassLoader to detect classpath elements. "
+ "Current ClassLoader is %s, only URLClassLoaders are supported.", classLoader);
throw new IllegalArgumentException(message);
}
List<String> files = new ArrayList<>();
for (URL url : ((URLClassLoader) classLoader).getURLs()) {
try {
File file = new File(url.toURI());
if (file.length() < FILE_BYTES_THRESHOLD) {
files.add(file.getAbsolutePath());
}
} catch (IllegalArgumentException | URISyntaxException e) {
String message = String.format("Unable to convert url (%s) to file.", url);
throw new IllegalArgumentException(message, e);
}
}
return files;
}
And then when I'm creating the DataflowPipelineOptions:
DataflowPipelineOptions dataflowOptions = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
...
dataflowOptions.setFilesToStage(detectClassPathResourcesToStage(DataflowPipelineRunner.class.getClassLoader()));
Here's a version of Helder's 10MB-filtering solution that will adapt to the default file-staging behavior of DataflowPipelineOptions even if it changes in a future version of the SDK.
Instead of duplicating the logic, it passes a throwaway copy of the DataflowPipelineOptions to DataflowPipelineRunner to see which files it would have staged, then removes any that are too big.
Note that this code assumes that you've defined a custom PipelineOptions class named MyOptions, along with a java.util.Logger field named logger.
// The largest file size that can be staged to the dataflow service.
private static final long MAX_STAGED_FILE_SIZE_BYTES = 10 * 1024 * 1024;
/**
* Returns the list of .jar/etc files to stage based on the
* Options, filtering out any files that are too large for
* DataflowPipelineRunner.
*
* <p>If this accidentally filters out a necessary file, it should
* be obvious when the pipeline fails with a runtime link error.
*/
private static ImmutableList<String> getFilesToStage(MyOptions options) {
// Construct a throw-away runner with a copy of the Options to see
// which files it would have wanted to stage. This could be an
// explicitly-specified list of files from the MyOptions param, or
// the default list of files determined by DataflowPipelineRunner.
List<String> baseFiles;
{
DataflowPipelineOptions tmpOptions =
options.cloneAs(DataflowPipelineOptions.class);
// Ignore the result; we only care about how fromOptions()
// modifies its parameter.
DataflowPipelineRunner.fromOptions(tmpOptions);
baseFiles = tmpOptions.getFilesToStage();
// Some value should have been set.
Preconditions.checkNotNull(baseFiles);
}
// Filter out any files that are too large to stage.
ImmutableList.Builder<String> filteredFiles = ImmutableList.builder();
for (String file : baseFiles) {
long size = new File(file).length();
if (size < MAX_STAGED_FILE_SIZE_BYTES) {
filteredFiles.add(file);
} else {
logger.info("Not staging large file " + file + ": length " + size
+ " >= max length " + MAX_STAGED_FILE_SIZE_BYTES);
}
}
return filteredFiles.build();
}
/** Runs the processing pipeline with given options. */
public void runPipeline(MyOptions options)
throws IOException, InterruptedException {
// DataflowPipelineRunner can't stage large files;
// remove any from the list.
DataflowPipelineOptions dpOpts =
options.as(DataflowPipelineOptions.class);
dpOpts.setFilesToStage(getFilesToStage(options));
// Run the pipeline as usual using "options".
// ...
}

Hadoop DistributedCache caching files without absolute path?

I am in the process of migrating to YARN and it seems the behavior of the DistributedCache changed.
Previously, I would add some files to the cache as follows:
for (String file : args) {
Path path = new Path(cache_root, file);
URI uri = new URI(path.toUri().toString());
DistributedCache.addCacheFile(uri, conf);
}
The path would typically look like
/some/path/to/my/file.txt
Which pre-exists on HDFS and would essentially end up in the DistributedCache as
/$DISTRO_CACHE/some/path/to/my/file.txt
I could symlink to it in my current working directory and use with DistributedCache.getLocalCacheFiles()
With YARN, it seems this file instead ends up in the cache as:
/$DISTRO_CACHE/file.txt
ie, the 'path' part of the file URI got dropped and only the filename remains.
How does with work with different absolute paths ending up with the same filename? Consider the following case:
DistributedCache.addCacheFile("some/path/to/file.txt", conf);
DistributedCache.addCacheFile("some/other/path/to/file.txt", conf);
Arguably someone could use fragments:
DistributedCache.addCacheFile("some/path/to/file.txt#file1", conf);
DistributedCache.addCacheFile("some/other/path/to/file.txt#file2", conf);
But this seems unnecessarily harder to manage. Imagine the scenario where those are command-line arguments, you somehow need to manage that those 2 filenames, although different absolute paths would definitely clash in the DistributedCache and therefore need to re-map these filenames to fragments and propagate as such to the rest of the program?
Is there an easier way to manage this?
Try to add files into Job
It's most likely how you're actually configuring the job and then accessing them in the Mapper.
When you're setting up the job you're going to do something like
job.addCacheFile(new Path("cache/file1.txt").toUri());
job.addCacheFile(new Path("cache/file2.txt").toUri());
Then in your mapper code the urls are going to be stored in an array which can be accessed like so.
URI file1Uri = context.getCacheFiles()[0];
URI file2Uri = context.getCacheFiles()[1];
Hope this could help you.

App Folder files not visible after un-install / re-install

I noticed this in the debug environment where I have to do many re-installs in order to test persistent data storage, initial settings, etc... It may not be relevant in production, but I mention this anyway just to inform other developers.
Any files created by an app in its App Folder are not 'visible' to queries after manual un-install / re-install (from IDE, for instance). The same applies to the 'Encoded DriveID' - it is no longer valid.
It is probably 'by design' but it effectively creates 'orphans' in the app folder until manually cleaned by 'drive.google.com > Manage Apps > [yourapp] > Options > Delete hidden app data'. It also creates problem if an app relies on finding of files by metadata, title, ... since these seem to be gone. As I said, not a production problem, but it can create some frustration during development.
Can any of friendly Googlers confirm this? Is there any other way to get to these files after re-install?
Try this approach:
Use requestSync() in onConnected() as:
#Override
public void onConnected(Bundle connectionHint) {
super.onConnected(connectionHint);
Drive.DriveApi.requestSync(getGoogleApiClient()).setResultCallback(syncCallback);
}
Then, in its callback, query the contents of the drive using:
final private ResultCallback<Status> syncCallback = new ResultCallback<Status>() {
#Override
public void onResult(#NonNull Status status) {
if (!status.isSuccess()) {
showMessage("Problem while retrieving results");
return;
}
query = new Query.Builder()
.addFilter(Filters.and(Filters.eq(SearchableField.TITLE, "title"),
Filters.eq(SearchableField.TRASHED, false)))
.build();
Drive.DriveApi.query(getGoogleApiClient(), query)
.setResultCallback(metadataCallback);
}
};
Then, in its callback, if found, retrieve the file using:
final private ResultCallback<DriveApi.MetadataBufferResult> metadataCallback =
new ResultCallback<DriveApi.MetadataBufferResult>() {
#SuppressLint("SetTextI18n")
#Override
public void onResult(#NonNull DriveApi.MetadataBufferResult result) {
if (!result.getStatus().isSuccess()) {
showMessage("Problem while retrieving results");
return;
}
MetadataBuffer mdb = result.getMetadataBuffer();
for (Metadata md : mdb) {
Date createdDate = md.getCreatedDate();
DriveId driveId = md.getDriveId();
}
readFromDrive(driveId);
}
};
Job done!
Hope that helps!
It looks like Google Play services has a problem. (https://stackoverflow.com/a/26541831/2228408)
For testing, you can do it by clearing Google Play services data (Settings > Apps > Google Play services > Manage Space > Clear all data).
Or, at this time, you need to implement it by using Drive SDK v2.
I think you are correct that it is by design.
By inspection I have concluded that until an app places data in the AppFolder folder, Drive does not sync down to the device however much to try and hassle it. Therefore it is impossible to check for the existence of AppFolder placed by another device, or a prior implementation. I'd assume that this was to try and create a consistent clean install.
I can see that there are a couple of strategies to work around this:
1) Place dummy data on AppFolder and then sync and recheck.
2) Accept that in the first instance there is the possibility of duplicates, as you cannot access the existing file by definition you will create a new copy, and use custom metadata to come up with a scheme to differentiate like-named files and choose which one you want to keep (essentially implement your conflict merge strategy across the two different files).
I've done the second, I have an update number to compare data from different devices and decide which version I want so decide whether to upload, download or leave alone. As my data is an SQLite DB I also have some code to only sync once updates have settled down and I deliberately consider people updating two devices at once foolish and the results are consistent but undefined as to which will win.