passing large data to Titanium.Utils.md5HexDigest - titanium

I am trying to calculate md5 hash for large files ( about 60MB or more). The device, a Nexus 7 with 1GB RAM and 16GB, is not able to allocate anything more than 30MB. The code fails with java.lang.OutOfMemory exception.
And I don't find any way to add data in a piecemeal way to Titanium.Utils.md5HexDigest(). It needs the whole data at once.
Is there any way I can workaround this problem?
I have searched for any products that would help me do this on the Marketplace. But I haven't found any.

You mentioned it was to determine wether or not to download it again. So, it comes from a server somewhere.
Instead of recalculating the MD5, you should have already stored that in the app when downloading the file in the first place. So just compare the stored MD5 Hash with the one on the server. This saves you a lot of trouble, and actually doesn't require you to recalculate. It also speeds up the app tremendously.

Related

A persistent simple data storage for Node.JS app implementation?

I'm planning to launch a simple Node.JS utility and push it to heroku. A fire and forget solution, will sleep for like 90% of the time probably. Unfortunately it seems that I require a persistent data storage for my purposes (heroku apps get rebooted daily and storing everything in RAM is unrealistic), and I don't know which way to look as:
Most SQL hostings are paid / limited time free / require constant refreshing ( like freemysqlhosting ).
Storing stuff in plain .txt format is seemingly hard to implement, besides git always overwrites the contents of a tracked .txt file, and leaving it untracked disposes of it on heroku and leads to ENOENT No such file error. Yeah, I tried.
So, the question is - how do I implement a simple and built in solution for storing data? Are there any relevant typical solutions? It's going to be equivalent to just 1 SQL table.
As you can see, you can answer this on many levels - maybe suggest a free deploy and forget SQL hosting (it obviously has to support external connections), maybe tell me how to keep a file tracked in git without actually replacing all of its content with every commit, maybe suggest some module to install. I hope this is not too broad..

What's a best approach to create a filestore

This is an open ended question. I have noob understanding of databases but willing to learn whatever is required. Though I believe my problem could be done without learning a lot.
So, here goes the question:
I have large amount of files getting generated in mt projects(depending on the builds) and I need to archive them and also need to reproduce them according to buildNumber if requested by users. I don't expect these requests to be a lot. May be 1-2 requests a day.
For eg: 16GB data per build every week. Most of the files in weekly builds are duplicate. And I don't want to archive them again and again. I prefer to store them only once. There is one caveat that it can happen that the files relative location can change, even though content hasn't changed.
My approach is as follow: Create a hash from each file. Create the key-value pair as fileHash-actual file and store it. Store this information in some kind of manifest file for each build. So, I should be able to create the builds back with correct files/paths etc.
Can it ever happen that 2 different files will ever have the same hash? Can some database help to do it efficiently? I am currently thinking of dumping all files in one folder.
Thanks

Extracting Data from a VERY old unix machine

Firstly apologies if this question seems like a wall of text, I can't think of a way to format it.
I have a machine with valuable data on(circa 1995), the machine is running unix (SCO OpenServer 6) with some sort of database stored on it.
The data is normally accessed via a software package of which the license has expired and the developers are no longer trading.
The software package connects to the machine via telnet to retrieve data and modify data (the telnet connection no longer functions due to the license being changed).
I can access the machine via an ODBC driver (SeaODBC.dll) over a network, this was how I was planning to extract the data but so far I have retrieved 300,000 rows in just over 24 hours, in total I estimate there will be around 50,000,000 rows total so at current speed it will take 6 months!
I need either a quicker way to extract the data from the machine via ODBC or a way to extract the entire DB locally on the machine to an external drive/network drive or other external source.
I've played around with the unix interface and the only large files I can find are in a massive matrix of single character folder (eg A\G\data.dat, A\H\Data.dat ect).
Does anyone know how to find out the installed DB systems on the machine? Hopefully it is a standard and I'll be able to find a way to export everything into a nicely formatted file.
Edit
Digging around the file system I have found a folder under root > L which contains lots of single lettered folders, each single lettered folder contains more single letter folders.
There are also files which are named after the table I need (eg "ooi.r") which have the following format:
<Id>
[]
l for ooi_lno, lc for ooi_lcno, s for ooi_invno, id for ooi_indate
require l="AB"
require ls="SO"
require id=25/04/1998
{<id>} is s
sort increasing Id
I do not recognize those kinds of filenames A\G\data.dat and so on (filenames with backslashes in them???) and it's likely to be a proprietary format so I wouldn't expect much from that avenue. You can try running file on these to see if they are in any recognized format just to see...
I would suggest improving the speed of data extraction over ODBC by virtualizing the system. A modern computer will have faster memory, faster disks, and a faster CPU and may be able to extract the data a lot more quickly. You will have to extract a disk image from the old system in order to virtualize it, but hopefully a single sequential pass at reading everything off its disk won't be too slow.
I don't know what the architecture of this system is, but I guess it is x86, which means it might be not too hard to virtualize (depending on how well the SCO OpenServer 6 OS agrees with the virtualization). You will have to use a hypervisor that supports full virtualization (not paravirtualization).
I finally solved the problem, running a query using another tool (not through MS Access or MS Excel) worked massively faster, ended up using DaFT (Database Fishing Tool) to SELECT INTO a text file. Processed all 50 million rows in a few hours.
It seems the dll driver I was using doesn't work well with any MS products.

How to maintain lucene indexes in azure cloud-app

I just started playing with the Azure Library for Lucene.NET (http://code.msdn.microsoft.com/AzureDirectory). Until now, I was using my own custom code for writing lucene indexes on the azure blob. So, I was copying the blob to localstorage of the azure web/worker role and reading/writing docs to the index. I was using my custom locking mechanism to make sure we dont have clashes between reads and writes to the blob. I am hoping Azure Library would take care of these issues for me.
However, while trying out the test app, I tweaked the code to use compound-file option, and that created a new file everytime I wrote to the index. Now, my question is, if I have to maintain the index - i.e keep a snapshot of the index file and use it if the main index gets corrupt, then how do I go about doing this. Should I keep a backup of all the .cfs files that are created or handling only the latest one is fine. Are there api calls to clean up the blob to keep the latest file after each write to the index?
Thanks
Kapil
After i answered this, we ended up changing our search infrastructure and used Windows Azure Drive. We had a Worker Role, which would mount a VHD using the Block Storage, and host the Lucene.NET Index on it. The code checked to make sure the VHD was mounted first and that the index directory existed. If the worker role fell over, the VHD would automatically dismount after 60 seconds, and a second worker role could pick it up.
We have since changed our infrastructure again and moved to Amazon with a Solr instance for search, but the VHD option worked well during development. it could have worked well in Test and Production, but Requirements meant we needed to move to EC2.
i am using AzureDirectory for Full Text indexing on Azure, and i am getting some odd results also... but hopefully this answer will be of some use to you...
firstly, the compound-file option: from what i am reading and figuring out, the compound file is a single large file with all the index data inside. the alliterative to this is having lots of smaller files (configured using the SetMaxMergeDocs(int) function of IndexWriter) written to storage. the problem with this is once you get to lots of files (i foolishly set this to about 5000) it takes an age to download the indexes (On the Azure server it takes about a minute,, of my dev box... well its been running for 20 min now and still not finished...).
as for backing up indexes, i have not come up against this yet, but given we have about 5 million records currently, and that will grow, i am wondering about this also. if you are using a single compounded file, maybe downloading the files to a worker role, zipping them and uploading them with todays date would work... if you have a smaller set of documents, you might get away with re-indexing the data if something goes wrong... but again, depends on the number....

How can I speed up batch processing job in Coldfusion?

Every once in awhile I am fed a large data file that my client uploads and that needs to be processed through CMFL. The problem is that if I put the processing on a CF page, then it runs into a timeout issue after 120 seconds. I was able to move the processing code to a CFC where it seems to not have the timeout issue. However, sometime during the processing, it causes ColdFusion to crash and has to restarted. There are a number of database queries (5 or more, mixture of updates and selects) required for each line (8,000+) of the file I go through as well as other logic provided by me in the form of CFML.
My question is what would be the best way to go through this file. One caveat, I am not able to move the file to the database server and process it entirely with the DB. However, would it be more efficient to pass each line to a stored procedure that took care of everything? It would still be a lot of calls to the database, but nothing compared to what I have now. Also, what would be the best way to provide feedback to the user about how much of the file has been processed?
Edit:
I'm running CF 6.1
I just did a similar thing and use CF often for data parsing.
1) Maintain a file upload table (Parent table). For every file you upload you should be able to keep a list of each file and what status it is in (uploaded, processed, unprocessed)
2) Temp table to store all the rows of the data file. (child table) Import the entire data file into a temporary table. Attempting to do it all in memory will inevitably lead to some errors. Each row in this table will link to a file upload table entry above.
3) Maintain a processing status - For each row of the datafile you bring in, set a "process/unprocessed" tag. This way if it breaks, you can start from where you left off. As you run through each line, set it to be "processed".
4) Transaction - use cftransaction if possible to commit all of it at once, or at least one line at a time (with your 5 queries). That way if something goes boom, you don't have one row of data that is half computed/processed/updated/tested.
5) Once you're done processing, set the file name entry in the table in step 1 to be "processed"
By using the approach above, if something fails, you can set it to start where it left off, or at least have a clearer path of where to start investigating, or worst case clean up in your data. You will have a clear way of displaying to the user the status of the current upload processing, where it's at, and where it left off if there was an error.
If you have any questions, let me know.
Other thoughts:
You can increase timeouts, give the VM more memory, put it in 64 bit but all of those will only increase the capacity of your system so much. It's a good idea to do these per call and do it in conjunction with the above.
Java has some neat file processing libraries that are available as CFCS. if you run into a lot of issues with speed, you can use one of those to read it into a variable and then into the database
If you are playing with XML, do not use coldfusion's xml parsing. It works well for smaller files and has fits when things get bigger. There are several cfc's written out there (check riaforge, etc) that wrap some excellent java libraries for parsing xml data. You can then create a cfquery manually if need be with this data.
It's hard to tell without more info, but from what you have said I shoot out three ideas.
The first thing, is with so many database operations, it's possible that you are generating too much debugging. Make sure that under Debug Output settings in the administrator that the following settings are turned off.
Enable Robust Exception Information
Enable AJAX Debug Log Window
Request Debugging Output
The second thing I would do is look at those DB queries and make sure they are optimized. Make sure selects are happening with indicies, etc.
The third thing I would suspect is that the file hanging out in memory is probably suboptimal.
I would try looping through the file using file looping:
<cfloop file="#VARIABLES.filePath#" index="VARIABLES.line">
<!--- Code to go here --->
</cfloop>
Have you tried an event gateway? I believe those threads are not subject to the same timeout settings as page request threads.
SQL Server Integration Services (SSIS) is the recommended tool for complex ETL (Extract, Transform, and Load) work, which is what this sounds like. (It can be configured to access files on other servers.) The question might be, can you work up an interface between Cold Fusion and SSIS?
If you can upgrade to cf8 and take advantage of cfloop file="" which would give you greater speed and the file would not be put in memory (which is probably the cause of the crashing).
Depending on the situation you are encountering you could also use cfthread to speed up processing.
Currently, an event gateway is the only way to get around the timeout limits of an HTTP request cycle. CF does not have a way to process CF pages offline, that is, there is no command-line invocation (one of my biggest gripes about CF - very little offling processing).
Your best bet is to use an Event Gateway or rewrite your parsing logic in straight Java.
I had to do the same thing, Ben Nadel has written a bunch of great articles uses java file io, to allow you to more speedily read files, write files etc...
Really helped improve the performance of our csv importing application.