What is the most efficient way to retrieve oldest blob from container? - azure-storage

Currently I'm getting all blobs in a list from container and then sorting it based on LastModified property. In case of large number of blobs in container, it's giving performance issue.
IEnumerable<IListBlobItem> blobsList = _sourceBlobStorageClient.BlobContainer.ListBlobs(null, false);
var blobItem = blobsList.Cast<CloudBlockBlob>().OrderBy(s => s.Properties.LastModified);

No, there isn't a better way to achieve this.

According to your description, you worry about the performance issue of retrieving the oldest blob from a container when there are a large number of blobs in container. As Zhaoxinglu mention, there is no better way to achieve this. From my experience, we can try below workaround to improve the performance.
Since the blob’s name could contain ‘/’ or ‘\’ characters that are interpreted as folders by many apps that read the blob, we could try to use this feature to improve the performance.
The basic idea is as follows:
a) According to the Modified-Time of the blob to store the blob in the directories formatted as “{docs}/{year}/{month}”;
b) Using the prefix associated with the Modified-time of the blob to avoid listing all blobs in the container.
Please refer to the following steps to see whether it could help you.
You could design your blob as follows:
When you want to retrieve the oldest blob, you could refer to the following code:
var results = (from blob in container.ListBlobs(prefix: "docs/2016/1/", useFlatBlobListing: true)
orderby ((CloudBlockBlob)blob).Properties.LastModified
select blob).FirstOrDefault();
result:
Hope it could help you and please let me know if you have any questions about the above steps.

Hi Thanks for the responses. As Zhaoxing mentioned there is no better way. I am using a workaround by introducing a container as staged container. So whatever the number of blobs fetched in first call, I am moving to this staged container and processing it from their. Using this approach I was able to save on listBlobs() call and hence I was seeing improvement in performance as well.

Related

Neo4j: how do I use an index in a relationship between two nodes?

I'm debugging the code of an api and I found a cypher instruction that takes 6 minutes to return the data.
I ran the neo4j code in smaller chunks and found that this snippet is causing the problem: MATCH(copart:CopartOperadora) WHERE NOT (copart)-[:FROM_TO]->(:Coexistence)
I'm new to neo4j so I still haven't figured out how I can optimize this instruction.
Thanks to everyone who contributed.
Optimizations of this kind, usually depend on the schema, of your graph database, without that it's very hard to provide any insights. But you can try this:
MATCH (copart:CopartOperadora)-[:FROM_TO]->(:Coexistence)
WITH collect(id(copart)) AS connectedNodesIds
MATCH (copart:CopartOperadora) WHERE id(copart) NOT IN connectedNodesIds
We can't create any index as such, unfortunately. But if the relationship FROM_TO is only present from CopartOperadora to Coexistence nodes. Then you can remove the node label for Coexistence, all together, which will be optimal. Something like this:
MATCH(copart:CopartOperadora) WHERE NOT (copart)-[:FROM_TO]->()

boto3's atomic test and create?

In normal file systems is normal to have the pattern of trying to create a file and fail if it already existed to have the guarantee of being creating a unique filename.
How can the same be achieved with S3 : if I have many parallel tasks creating keys with random names on S3, how can I "test and write" atomically to guarantee that chances don't create a race and I end with messed data ?
Thanks
After a few days of thinking, I believe I have found a very decent solution to my own problem: activate versioning on bucket and save freely the key name you want. From the answer take versionId and encode the object url in a agreed format (e.g. s3://your-bucket/your-key?versionId=XXXXX ) . This url refers always to the object you wanted to save in the first place with no possibility of clashes/races.

camel split big sql result in smaller chunks

Because of memory limitation i need to split a result from sql-component (List<Map<column,value>>) into smaller chunks (some thousand).
I know about
from(sql:...).split(body()).streaming().to(...)
and i also know
.split().tokenize("\n", 1000).streaming()
but the latter is not working with List<Map<>> and is also returning a String.
Is there a out of the Box way to create those chunks? Or do i need to add a custom aggregator just behind the split? Or is there another way?
Edit
Additional info as requested by soilworker:
At the moment the sql endpoint is configured this way:
SqlEndpoint endpoint = context.getEndpoint("sql:select * from " + lookupTableName + "?dataSource=" + LOOK_UP_DS,
SqlEndpoint.class);
// returns complete result in one list instead of one exchange per line.
endpoint.getConsumerProperties().put("useIterator", false);
// poll interval
endpoint.getConsumerProperties().put("delay", LOOKUP_POLL_INTERVAL);
The route using this should poll once a day (we will add CronScheduledRoutePolicy soon) and fetch a complete table (view). All the data is converted to csv with a custom processor and sent via a custom component to proprietary software. The table has 5 columns (small strings) and around 20M entries.
I don't know if there is a memory issue. But i know on my local machine 3GB isn't enough. Is there a way to approximate the memory footprint to know if a certain amount of Ram would be enough?
thanks in advance
maxMessagesPerPoll will help you get the result in batches

Call RESTful service in Pig script

I'm working on a Pig script (my first) that loads a large text file. For each record in that text file, the content of one field needs to be sent off to a RESTful service for processing. Nothing needs to be evaluated or filtered. Capture data, send it off and the script doesn't need anything back.
I'm assuming that a UDF is required for this kind of functionality, but I'm new enough to Pig that I don't have a clear picture of what type of function I should build. My best guess would be a Store Function since the data is ultimately getting stored somewhere, but I feel like the amount of guesswork involved in coming to that conclusion is higher than I'd like.
Any insight or guidance would be much appreciated.
Have you had a look to DBStorage which does something similar?
everything = LOAD 'categories.txt' USING PigStorage() AS (category:chararray);
...
STORE ordered INTO RestStorage('https://...');
Having never found even a hint of an answer to this, I decided to move in a different direction. I'm using Pig to load and parse the large file, but then streaming each record that I care about to PHP for additional processing that Pig doesn't seem to have the capability to handle cleanly.
It's still not complete (read: there's a great big, very unhappy bug in the mix), but I think the concept is solid--just need to work out the implementation details.
everything = LOAD 'categories.txt' USING PigStorage() AS (category:chararray);
-- apply filter
-- apply filter
-- ...
-- apply last filter
ordered = ORDER filtered_categories BY category;
streamed = STREAM limited THROUGH `php -nF process_categories.php`;
DUMP streamed;

Fastest way to query for object existence in NHibernate

I am looking for the fastest way to check for the existence of an object.
The scenario is pretty simple, assume a directory tool, which reads the current hard drive. When a directory is found, it should be either created, or, if already present, updated.
First lets only focus on the creation part:
public static DatabaseDirectory Get(DirectoryInfo dI)
{
var result = DatabaseController.Session
.CreateCriteria(typeof (DatabaseDirectory))
.Add(Restrictions.Eq("FullName", dI.FullName))
.List<DatabaseDirectory>().FirstOrDefault();
if (result == null)
{
result = new DatabaseDirectory
{
CreationTime = dI.CreationTime,
Existing = dI.Exists,
Extension = dI.Extension,
FullName = dI.FullName,
LastAccessTime = dI.LastAccessTime,
LastWriteTime = dI.LastWriteTime,
Name = dI.Name
};
}
return result;
}
Is this the way to go regarding:
Speed
Separation of Concern
What comes to mind is the following: A scan will always be performed "as a whole". Meaning, during a scan of drive C, I know that nothing new gets added to the database (from some other process). So it MAY be a good idea to "cache" all existing directories prior to the scan, and look them up this way. On the other hand, this may be not suitable for large sets of data, like files (which will be 600.000 or more)...
Perhaps some performance gain can be achieved using "index columns" or something like this, but I am not so familiar with this topic. If anybody has some references, just point me in the right direction...
Thanks,
Chris
PS: I am using NHibernate, Fluent Interface, Automapping and SQL Express (could switch to full SQL)
Note:
In the given problem, the path is not the ID in the database. The ID is an auto-increment, and I can't change this requirement (other reasons). So the real question is, what is the fastest way to "check for the existance of an object, where the ID is not known, just a property of that object"
And batching might be possible, by selecting a big group with something like "starts with C:Testfiles\" but the problem then remains, how do I know in advance how big this set will be. I cant select "max 1000" and check in this buffered dictionary, because i might "hit next to the searched dir"... I hope this problem is clear. The most important part, is, is buffering really affecting performance this much. If so, does it make sense to load the whole DB in a dictionary, containing only PATH and ID (which will be OK, even if there are 1.000.000 object, I think..)
First off, I highly recommend that you (anyone using NH, really) read Ayende's article about the differences between Get, Load, and query.
In your case, since you need to check for existence, I would use .Get(id) instead of a query for selecting a single object.
However, I wonder if you might improve performance by utilizing some knowledge of your problem domain. If you're going to scan the whole drive and check each directory for existence in the database, you might get better performance by doing bulk operations. Perhaps create a DTO object that only contains the PK of your DatabaseDirectory object to further minimize data transfer/processing. Something like:
Dictionary<string, DirectoryInfo> directories;
session.CreateQuery("select new DatabaseDirectoryDTO(dd.FullName) from DatabaseDirectory dd where dd.FullName in (:ids)")
.SetParameterList("ids", directories.Keys)
.List();
Then just remove those elements that match the returned ID values to get the directories that don't exist. You might have to break the process into smaller batches depending on how large your input set is (for the files, almost certainly).
As far as separation of concerns, just keep the operation at a repository level. Have a method like SyncDirectories that takes a collection (maybe a Dictionary if you follow something like the above) that handles the process for updating the database. That way your higher application logic doesn't have to worry about how it all works and won't be affected should you find an even faster way to do it in the future.