Core Data sqlite file growing - objective-c

I have a synchronization process and Im using Core Data to store a lot of information. Several times I downloaded the real SQLite database file with the organizer to check if the data is correct.
Some days ago I recongized that the size difference of two SQLite files was huge. One file was 80MB, the other file about 100MB. As I checked out the data in it with a SQLiteviewer there was no difference. The same tables, same indexes, same rows. How can that be? Is is is possible that some data is still in the file when I delete objects over Core Data?
EDIT:
The solution is an option flag which can be inserted in an option NSDictionary and added as parameter to the addPersistentStore method.
NSSQLiteManualVacuumOption
Option key to rebuild the store file, forcing a database wide defragmentation when the store is added to the coordinator.
This invokes SQLite's VACUUM command. It is ignored by stores other than the SQLite store.

Sqlite does not proactively return unused disk space as it deletes data, for performance reasons. This could be why you see the difference. See this link for more info:
SQLite FAQ

Related

Is SELECT retrieving data from the WAL files?

To my understanding, database can postpone writing to table files to boost IO performance. When transaction is COMMITted, data are written to the WAL files.
I'm curious, how much delayed the writing to table files can be. In particular, when I use a simple SELECT, e.g.
SELECT * from myTable;
after COMMIT, is it possible that database has to retrieve data from the WAL files in addition to the table files?
The documentation talks about being able to postpone the flushing of data pages:
If we follow this procedure, we do not need to flush data pages to disk on every transaction commit.
What WAL files allow an RDBMS to do is keeping "dirty" data pages in memory and flushing them to disk at a later time. It does not work in a way that the data pages are modified at a later time.
So the answer to your question is "No, a SELECT is always retrieving data from the data pages, not from the WAL files".
PostgreSQL does not read from WAL for this purpose during normal operations. It would only read WAL in order to apply changes to the data files after a crash (or during replication onto another server).
When data from ordinary data files is changed, the pages of the data files are kept in shared memory (in shared_buffers) until they are written to disk. Any other processes wanting to see that data will find it in that shared memory, and it will be in its changed form. They will always look for it in shared_buffers before they try to read it from disk, so no one will ever see the stale on-disk version of the data, except for the recovery process after a crash.

archiving some redis data to disk

I have been using redis a lot lately, and really am loving it. I am mostly familiar with persistence (rdb and aof). I do have one concern. I would like to be able to selectively "archive" some of my data to disk (or cheaper storage) once it is no longer important. I don't really want to delete it because it might be valuable at some point.
All of my keys are named id_<id>_<someattribute>. So when I am done with id 4, I want to "archive" all all keys that match id_4_*. I can view them quite easily in with the command line, but I can't do anything with them, persay. I have quite a bit of data (very large bitmaps) associated with this data set, and frankly I can't afford the space once the id is no longer relevant or important.
If this were mysql, I would have my different tables and would very easily just dump it to a .sql file and then drop the table. The actual .sql file isn't directly useful to me, but I could reimport the data if/when I need it. Or maybe I have to mysql database and I want to move one table to another database. Are there redis corollaries to these processes? Is there someway to make an rdb or aof file that is a subset of the data?
Any help or input on this matter would be appreciated! Thanks!
#Hoseong Hwang recently asked what I did, so I'm posting what I ended up doing.
It was really quite simple, actually. I was benefited by the fact that my key space is segmented out by different users. All of my keys were of the structure user_<USERID>_<OTHERVALUES>. My archival needs were on a user basis, some user's data was no longer needed to be kept in redis.
So, I started up another instance of redis-server, on another port locally (6380?) or another machine, it makes no difference. Then, I wrote a short script that basically just called KEYS user_<USERID>_* (I understand the blocking nature of KEYS, my key space is so small it didn't matter, you can use SCAN if that is an issue for you.) Then, for each key, I MIGRATED them to that new redis-server instance. After they were all done. I did a SAVE to ensure that the rdb file for that instance was up to date. And now I have that rdb, which is just the content that I wanted to archive. I then terminated that temporary redis-server and the memory was reclaimed.
Now, keep that rdb file somewhere for cheap, safe keeping. And if you ever needed it again, doing the reverse of my process above to get those keys back into your main redis-server would be fairly straightforward.
Instead of trying to extract data from a live Redis instance for archiving purpose, my suggestion would be to extract the data from a dump file.
Run a bgsave command to generate a dump, and then use redis-rdb-tools to extract the keys you are interested in - you can easily get the result as a json file.
See https://github.com/sripathikrishnan/redis-rdb-tools
You can keep the json data in flat files, or try to store them into a relational database or a document store if you need them to be indexed for retrieval purpose.
A few suggestions for you...
I would like to be able to selectively "archive" some of my data to
disk (or cheaper storage) once it is no longer important. I don't
really want to delete it because it might be valuable at some point.
If such data is that valuable, use a traditional database for storage. Despite redis supporting snap-shotting to disk and AOF logs, you should view it as mostly volatile storage. The primary use case for redis is reducing latency, not persistence of valuable data.
So when I am done with id 4, I want to "archive" all all keys that
match id_4_*
What constitutes done? You need to ask yourself this question; does it mean after 1 day the data can fall out of redis? If so, just use TTL and expiration to let redis remove the object from memory. If you need it again, fall back to the database and pull the object back into redis. That first client will take the hit of pulling from the db, but subsequent requests will be cached. If done means something not associated with a specific duration, then you'll have to remove items from redis manually to conserve memory space.
If this were mysql, I would have my different tables and would very
easily just dump it to a .sql file and then drop the table. The actual
.sql file isn't directly useful to me, but I could reimport the data
if/when I need it.
We do the same at my firm. Important data is imported into redis from rdbms executed as on-demand job. We don't drop tables, we just selectively import data from the database into redis; nothing wrong with that.
Is there someway to make an rdb or aof file that is a subset of the
data?
I don't believe there is a way to do selective archiving; it's either all or none.
IMO, spend more time playing with redis. I highly recommend leveraging out-of-box features instead of reinventing and/or over-engineering solutions to suit your needs.
Hope that helps!...

BLOB's in SQL that stores a Video file

I am hoping someone can explain how to use BLOBs. I see that BLOBs can be used to store video files. My question is why would a person store a video file in a BLOB in a SQL database? What are the advantages and disadvantages compared to storing pointers to the location of the video file?
A few different reasons.
If you store a pointer to a file on disk (presumably using the BFILE data type), you have to ensure that your database is updated whenever files are moved, renamed, or deleted on disk. It's relatively common when you store data like this that over time your database gets out of sync with the file system and you end up with broken links and orphaned content.
If you store a pointer to a file on disk, you cannot use transactional semantics when you're dealing with multimedia. Since you can't do something like issue a rollback against a file system, you either have to deal with the fact that you're going to have situations where the data on the file system doesn't match the data in the database (i.e. someone uploaded a video to the file system but the transaction that created the author and title in the database failed or vice versa) or you have to add additional steps to the file upload to simulate transactional semantics (i.e. upload a second <>_done.txt file that just contains the number of bytes in the actual file that was uploaded. That's cumbersome and error-prone and may create usability issues.
For many applications, having the database serve up data is the easiest way to provide it to a user. If you want to avoid giving a user a direct FTP URL to your files because they could use that to bypass some application-level security, the easiest option is to have a database-backed application where to retrieve the data, the database reads it from the file system and then returns it to the middle tier which then sends the data to the client. If you're going to have to read the data into the database every time the data is retrieved, it often makes more sense to just store the data directly in the database and to let the database read it from its data files when the user asks for it.
Finally, databases like Oracle provide additional utilities for working with multimedia data in the database. Oracle interMedia, for example, provides a rich set of objects to interact with video data stored in the database-- you can easily tag where scenes begin or end, tag where various subjects are discussed, when the video was recorded, who recorded it, etc. And you can integrate that search functionality with searches against all your relational data. Of course, you could write an application on top of the database that did all those things as well but then you're either writing a lot of code or using another framework in your app. It's often much easier to leverage the database functionality.
Take a read of this : http://www.oracle.com/us/products/database/options/spatial/039950.pdf
(obviously a biased view, but does have a few cons (that have now been fixed by the advent of 11g)

Is having multiple data/log files a good thing even on the same LUN?

I have read that it is a good idea to have one file per CPU/CPU Core so that SQL can more efficiently stream data to and from the disks. Ok, I can see the benefit if they are on different spindles, but what if I only have one spindle (4 drives in Raid 10) for my data files (.mdf and .ndf), will I still benefit from splitting the data files (from just the .mdf file to a .mdf and several .ndf files)? Same goes for the log file, although I see no benefit to it as the data has to be written serially and you're limited by the spindle's sequential write speed...
FYI, this is in regards to SQL Server 2005/2008...
Thanks.
The recommendation for multiple tempdb data files is definitely not about IOPS. It is about contention on the allocation pages (GAM, SGAM, PFS) in tempdb. SQL 2005+ doesn't require as big of a load on these pages, but contention still occurs. Not all system require a 1 file to 1 core mapping. Most sytems will perform well with 1 file to 2 or 4 cores. Having too many files adds overhead for managing the files. A good recommendation is to start with 1:4 or 1:2 and increasing if contention continues. Don't go above 1:1.
For other databases, this is not recommended.
And yes, only 1 log file ... always.
8 Steps to better Transaction Log throughput:
Create only ONE transaction log file.
Even though you can create multiple
transaction log files, you only need
one... SQL Server DOES not "stripe"
across multiple transaction log files.
Instead, SQL Server uses the
transaction log files sequentially.
Misconceptions around TF 1118:
Why is the trace flag not required so
much in 2005 and 2008? In SQL Server
2005, my team changed the allocation
system for tempdb to reduce the
possibility of contention. There is
now a cache of temp tables. When a new
temp table is created on a cold system
(just after startup) it uses the same
mechanism as for SQL 2000. When it is
dropped though, instead of all the
pages being deallocated completely,
one IAM page and one data page are
left allocated, and the temp table is
put into a special cache. Subsequent
temp table creations will look in the
cache to see if they can just grab a
pre-created temp table 'off the
shelf'. If so, this avoids accessing
the allocation bitmaps completely. The
temp table cache isn't huge (I think
it's 32 tables), but this can still
lead to a big drop in latch
contention in tempdb.
So the answer is NO to both questions. Log striping was never an issue, and one-NDF-per-CPU is largely a myth, one that will take a very long time to die out. Multiple files IMHO make sense only if you can stripe IO (separate LUNs). Multiple filegroups though make sense, but not for IO reasons, for administrative purposes: piecemeal restores and archive read-only filegroups.
Still good. This is not about IOPS - it is about SQL Server BLOCKING a file for certain operations. mostly when file extends are allocated to a table / index. If you do a lot of inserts / updates, multiple files basically mean another thread will block another file, not wait on the first one.
So, this is not really about IOPS loads, it is about a blocking behavior.

Does SQL's DELETE statement truly delete data?

Story: today one of our customers asked us if all the data he deleted in the program was not recoverable.
Aside scheduled backups, we shrink the log file once a day, and we use the DELETE command to remove records inside our tables where needed.
Though, just for the sake of it, I opened the .mdf file with an editor (used PSPad), and searched for a particular unique piece of data -I was sure- was inside one of tables.
Problem: I tracked it in the file, then executed the DELETE command, and it was still there.
Question:
Is there a particular command we are not aware of to delete the records physically form the disk?
Note: we know there are particular techniques to recover lost data from the hard drives, but here I am talking about a notepad-wannabe!
The text may still be there, but SQL Server has no concept of that data having any structure or being available.
The "freed space" is simply deallocated: not removed, compacted or zeroed.
The "Instant File Initialization" feature relies on this too (not zeroing the entire MDF file) and previous disk data is still available eben for a brand new database:
Because the deleted disk content is overwritten only as new data is written to the files, the deleted content might be accessed by an unauthorized principal.
Edit: To reclaim space:
ALTER INDEX...WITH REBUILD is the best way
DBCC SHRINKFILE using NOTRUNCATE can compact pages into gaps caused by deallocated pages, but won't reclaim space in a page for deleted row
SQL Server just marks the space of deleted rows as available, but does not reorganize the database and does not zero out the freed up space. Try to "Shrink" the database, and the deleted rows should no longer be found.
Thanks, gbn, for your correction. A page is the allocation unit of the database, and shrinking a database only eliminates pages, but does not compact them. You'd have to delete all rows in a page in order to see them disappear after shrinking.
If your client is concerned about data security it should use Transparent Database Encryption. Even if you obliterate information from the table, the record is still in the log. Even when log is recycled, the info is still in the backups.
You could update the record with dummy values before issuing the delete, thereby overwriting the data on disk before the database marks it as free. (Whether this also works with LOB fields would warrant investigation, though).
And of course, you'd still have the problem of logs and backups, but I take it you already solved those.