Sqlite copying table resources - sql

I'm looking at using sqlite for storing some data that will be converted from binary to engineering units. The binary data will be kept as one table and the engineering data will be converted from the binary data into another table. It's likely that I'll need to occasionally change the conversions. I read that sqlite doesn't support dropping columns, and I was wondering how expensive (time, resources) it is to copy the old table to a new table, or whether I'm better off with a different database? It's preferable to have the database in a single file and not have a server running.

Related

Keeping archived data schema up to date with running data warehouse

recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.
I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.
however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.
is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.
should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?
There are tons of materials in internet if you search the term "Schema evolution" in big data space.
The Athena documentation has a chapter on schema updates case by case example here.
If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.
Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.
Plan A:
It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.
I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?
If the table were partitioned by Year or Quarter or Month, you could
Every period, "Export tablespace" to remove the oldest from the partition scheme.
That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.
At about the same time, you would build a new partition to receive new data.
(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)
A benefit of the method is that there is virtually zero impact on the big table during the processing.
An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).
Plan B:
Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...
Don't archive the old data.
Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.

Vertica Large Objects

I am migrating a table from Oracle to Vertica that contains an LOB column. The maximum actual size of the LOB column amounts to 800MB. How can this data be accommodated in Vertica? Is it appropriate to use the Flex Table?
In Vertica's documentation, it says that data loaded in a Flex table is stored in column raw which is a LONG VARBINARY data type. By default, it has a max value of 32MB, which, according to the documentation can be changed(i.e. increased) using the parameter FlexTablesRawSize.
I'm thinking this is the approach for storing large objects in Vertica. We just need to update the FlexTablesRawSize parameter to handle 800MB of data. I'd like to consult if this is the optimal way or if there's a better way. Or will this conflict with Vertica's table row constraint limitation that only allows up to 32MB of data per row?
Thank you in advance.
If you use Vertica for what it's built for - running a Big Data database, you would, like in any analytical database, try to avoid large objects in your table. BLOBs and CLOBs are usually used to store unstructured data: large documents, image files, audio files, video files. You can't filter by such a column, you can't run functions on it, or sum it, etc, you can't group by it.
A safe and performant design should lead to storing the file name in a Vertica table column, storing the file maybe even in Hadoop, and letting the front end (usually a BI tool, and all BI tools support that) retrieve the file to bring it to a report screen ...
Good luck ...
Marco

Many small data table I/O for pandas?

I have many table (about 200K of them) each small (typically less than 1K rows and 10 columns) that I need to read as fast as possible in pandas. The use case is fairly typical: a function loads these table one at a time, computes something on them and stores the final result (not keeping the content of the table in memory).
This is done many times over and I can choose the storage format for these tables for best (speed) performance.
What natively supported storage format would be the quickest?
IMO there are a few options in this case:
use HDF Store (AKA PyTable, H5) as #jezrael has already suggested. You can decide whether you want to group some/all of your tables and store them in the same .h5 file using different identifiers (or keys in Pandas terminology)
use new and extremely fast Feather-Format (part of the Apache Arrow project). NOTE: it's still a bit new format so its format might be changed in future which could lead to incompatibilities between different versions of feather-format module. You also can't put multiple DFs in one feather file, so you can't group them.
use a database for storing/reading tables. PS it might be slower for your use-case.
PS you may also want to check this comparison especially if you want to store your data in compressed format

Store file on file system or as varbinary(MAX) in SQL Server

I understand that there is a lot of controversy over whether it is bad practice to store files as blob's in a database, but I just want to understand whether this would make sense in my case.
I am creating an ASP.NET application, used internally at a large company, where the users needs to be able to attach files to a 'job' in the system. These files are generally PDF's or Word documents, probably never exceeding a couple of mb.
I am creating a new table like so:
ID (int)
JobID (int)
FileDescription (nvarchar(250))
FileData (varbinary(MAX)
Is the use of varbinary(MAX) here ideal, or should I be storing the path to the file and simply storing the file on the file system somewhere?
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee foto in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee foto, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it LARGE_DATA.
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!

Different Types saved values to SQL Database

I am currently writing an application which will have a lot of transaction.
Each transaction will have a value although the value can be an int, bit, short string, large string etc...
I want to try to keep processing and storage to a minimum as I would like to run this in the cloud. Should I have lot of different fields on the transaction eg.
TransactionLine.valueint
TransactionLine.valuestring
TransactionLine.valuedecimal
TransactionLine.valuebool
or should I have separate tables for each value transaction value type.
TransactionLine - Table
---------------
TransactionLine.ValueId
ValueInt -Table
-------
ValueInt.ValueId
ValueInt.Value
ValueString - Table
-------
ValueString.ValueId
ValueString.Value
You could store key-value pairs in the database. The only data type that can store any other data type is a VARCHAR(MAX) or a BLOB. That means that all data must be converted to a string before it can be stored. That conversion will take processing time.
In the opposite direction, when you want to do a SUM or a MAX or an AVG , ... of numeric data you will first have to convert the string back to its real data type. That conversion too will take processing time.
Databases are read a lot more than written to. The conversion nightmare will get your system on its knees. There has been a lot of debate on this topic. The high cost of conversions is the killer.
There are systems that store the whole database in one single table. But in those cases the whole system is build with one clear goal: to support that system in an efficient way in a fast compiled programming language, like C(++, #), not in a relational database language like SQL.
I don't have the idea I fully understand what you really want. If you only want to store the transactions, this may be a worth trying. But why do you want to store them one field at a time? Data is stored in groups in records. And the data type of each and every column in a record is known at the creation time of the table.
You should really look into cassandra. When you say a lot of transactions, do you mean millions of records? For cassandra, handling millions of records is a norm. You will have a column family (in rdbms, table is similiar to column family) store many rows, and for each row, you do not need to predefined a column. It can be define on demand, thus reducing the storage dramatically especially if you are dealing with a lot of records.
You do not need to worry if the data is of data type int, string, decimal or bool because default datatype for column value is in BytesType. There are other data types which you can predefined too in the the column family column metadata if you want to. Since you are starting to write an application, I will suggest you spend sometime to read into cassandra and how it would help you in your situation.