File format for storing text as easily accessible lines - file-io

I have a data structure (B-Tree) that stores text as a series of nodes each one representing a single line of the text. I would like to store the text in a file that I could keep in sync with the structure without having to rewrite the entire file on each edit. So when line n of my structure is changed I can access and change only line n of the file keeping it updated with the structure.
Can someone point me in the right direction?
The reason for this is I'm trying to store the state of my structure so I can restore after a crash, but without the overhead of constantly writing the entire file. (Could be a lot of data)

Looks like you want a B-Tree.
If you go this route, keep in mind all relational databases are built on B-Trees. So you may consider using some embedded database, like SQLite, instead.

Related

Load file from Cloud Storage to BigQuery to single string column

We are designing a new ingestion framework (Cloud Storage -> BigQuery) using Cloud Functions. However, we receive some files (json, csv) that are corrupted and cannot be inserted as is (bad field names, missing columns, etc.) not even as external tables. Therefore, we would like to ingest every row to one cell as a JSON string and deal with the issues when we cleanse the data in BigQuery.
Is there a way to do that natively and efficiently and as little processing possible (so Cloud Functions wouldn't time out)? I wrote a function that processes the files and wraps lines one by one but for bigger files it won't be an option. We would prefer to stay with Cloud Functions to have this as lightweight as possible.
My option in that case is to ingest the CSV with a dummy separator, for instance # or |. I know that I will never have those characters and that's why I chose them.
Like that, the schema autodetect detect only 1 column, and create a single string column table.
If you can pick a character like that, it's the easiest solution, but without any guaranty of course (it's corrupted file, it's hard to know in advance what will be the unused characters)

Cocoa Core Data and plain text file

I'm wrinting a very simple application which can read data from a specified file and populate the fetched data to a NSTableView. The file is a plain text file where each line represents an object (I need to parse the file). I want to use Core Data and my question is what is the Cocoa way to do that?
My first idea is to parse the file and create instances for the Entity which represents one line. I'm not sure that is it the best solution. And later I'll write out the changes to the file (after save? or automatically after a given amount of time?)
My configuration is Mountain Lion and the latest XCode.
A single entity with a number of attributes sounds good. If you had an attribute which holds a reasonable amount of data and was repeated on a number of 'rows' then it would be a candidate for another entity.
Yes, each instance of your entity would represent one row in your table.
Personally, I would either save on request by the user, or not have a save button and save each change. Your issue is the conversion between your in-memory store and on-disk store. Using a plain text file doesn't help (though, depending on the file format, it could be possible to edit individual lines in the file using NSFileHandle).
Generally it would still make more sense to use a better format like XML or JSON and then make use of a framework like RestKit which will do all of the parsing and mapping for you (after you specify the mapping configuration).
You can also use bindings to connect your data to your NSTableView. Another ref.

Database Schema, pointer to file

This is probably a really simple question, but just making sure. I am designing a database schema and some of tables should link to files on the file system (PDF, PPT, etc).
How should this be done?
My initial idea is varchar(255) with the absolute/relative path to the file. Is there a better way to do this? I've searched online and found varbinary(max), but not sure if that's what I actually want; I don't wish to actually load any binary into the database, merely to have a pointer to a file.
This depends on the OS and the max length of a valid path. What you are calling a "pointer" is just a text field with the file path, so no different than other character data.
I would usually store the relative path, and have the root folder specified in my application. This way you can move files to a different drive, for example, and not have to udpate the rows in your db.
The actual data type you choose depends on the dbms you are using. Some databases also provide specific data types for files that you may want to explore, e.g., the FileStream data type introduced in SQL Server 2008.
You need to store in the database de name of the file, and it's path, is that right? Then you should create a fild with varchar(255). I always used like that and never had problems.
Hope it helped.
If you don't want to store the file's binary data in the database, then storing the path is the only way to go. Whether you store the absolute path or the relative path is up to you.
Yep that's basically it.
Relative path from some location configured as a parameter in Db is the usual way of it.
Aside from getting round length restrictions.
If you had say C:MySystem\MyData as the base path. Then you could do Images\MyImageFile.jpg, Docs\MyDopc.pdf etc.
Note the impact on backup and restore though. You have to do the database and the file system.
One other potential consideration is filenames have to be unique. So you If Fred and Wilma both up load Picture1.jpg, the db is okay, but the file system will be stuffed.
Usual way round this is to have a user filename and an actual filename.
So Fred's Picture1.jpg is actually p000004566.jpg
Don't forget to add code to cope with the file you think should be there has been deleted by some twit.
Also some sort of admin task to tidy up orphaned files might be in order, in the infinitely unlikely event that a coding error was made. :)
Also if the path to the file is configurable by software, make sure you check that the account that will be doing the work has read write access, might also want to use a UNC path, but don't saddle yourself with a mapped drive.

what data type to use for storing different files in database

Im trying to make a database accept different files in a postgres database table. The files I want to support are of different mime-types. I want to support pdf, word, plain text, and power point. The problem is that i don't know what datatype to choose. The documentation to pgadmin (the tool im using) is very (let´s say) unsatisfactory. Thanks
While you can store the file contents in the database, consider storing the file path instead and using the file system to store the file.
In the IT world "you can do anything with anything", but that doesn't mean you should.
In this case, you're trying to use a database as a file system, which it can do, but databases are not as efficient or practical as file systems for storing file contents (typically "large" data). It will:
make your backups longer and larger
slow your insert queries down (more I/O)
make your log files larger (slower and fill more often)
make accessing the files slower (query vs simple disk I/O)
require you to go via the database to access the files (hassle, can't use browser etc)
etc
You can use bytea type in PostgreSQL.

Changing hash of a files

I have a folder full of binary files and I want to make a change to these files so that the hash of these files will change. I want to do this is a fashion that doesn't pertinently corrupt the files. Meaning that the change should still allow the file to operate normally or that I should be able to undo the change at any point in time.
Does anyone know of a script that I could use to do this or many a program that will automate this?
Cheers
UPDATE
Its a edge case that I am trying to deal with. I have a system that only allows me to store a file with a given hash once. Hence I am wanting to change the content hash of the file to allow the file to be stored. Note the system in question is not one I control or can change.
Couldn't I just add a random 1 to the end of the file and then remove it afterward without breaking anything? I'm just not sure how to script this - as in how to modify the binary data in this way. Note I'm in a windows environment.
Without knowing the format of the files, we can't tell. It may in fact be impossible - for instance if these binary files are self-signed with some private key. Changing any single bit within the file is likely to render it invalid.
Is your hash calculated purely from the contents, and not any other metadata that you can change (such as filename or modified date)? If so, you're probably out of luck. If the hash is meant to detect when the content changes, but you're trying to change the hash without actually changing the content, you've clearly got a problem...
What is the hash used for? Why do you want to change it? There may be an alternative solution if you could give us more information about the bigger picture.
EDIT: One alternative is to effectively create your own container format - so while a file is stored in your container format, it's not usable in its original form, but it can be extracted easily. Your container could be as simple as "add four bytes at the end as a seed to disturb the hash" - "extracting" the file would just involve copying it and removing the last four bytes. But the important point is that what you end up with isn't an MP3 file or whatever you started with - it's your custom format, simple as it is. You need to package/extract the file any time you interact with the store.