Keeping archived data schema up to date with running data warehouse - amazon-s3

recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.
I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.
however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.
is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.
should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?

There are tons of materials in internet if you search the term "Schema evolution" in big data space.
The Athena documentation has a chapter on schema updates case by case example here.
If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.
Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.

Plan A:
It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.
I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?
If the table were partitioned by Year or Quarter or Month, you could
Every period, "Export tablespace" to remove the oldest from the partition scheme.
That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.
At about the same time, you would build a new partition to receive new data.
(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)
A benefit of the method is that there is virtually zero impact on the big table during the processing.
An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).
Plan B:
Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...
Don't archive the old data.
Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.

Related

How many temporary/staging tables to use during the transform step of ETL?

My first thought is to first load data from S3 to a temporary table, apply the necessary transformations and then INSERT INTO target, final table. All the tables would have the same columns and are in Redshift.
However, how big of a performance hit would there be because of using multiple UPDATEs? Would it be better to split UPDATEs and filtering between multiple temporary tables for daily batch processing.
Instead of S3 -> TEMP -> FINAL, the flow would look like S3 -> TEMP1 -> ... -> TEMPN - > FINAL, where "->" would be "INSERT INTO".
Also, is it better to create temporary tables (CREATE TEMP TABLE) on the spot and dropping them every day, or use persisting temporary tables that would be truncated every day. I think using persisting temp tables would be the better choice as it allows me to check how the data looked as it was loaded and transformed that day.
As you are seeing there are lots of ways to run an update process and which is better will depend on factors that are not presented here. First off let's clarify what a TEMP table is and differentiate it from a staging table. A temp table only lives as long as the current session (connection) is active. If the connection drops then so does the TEMP table. A staging table is a permanent table used for staging data which more closely matches what you are describing you parts of your question. I'll use these two terms to be clear about which is being made (TEMP or staging).
Your question revolves around how big of a performance hit it would be to have a series of tables in the ETL (ELT?) process to improve, I expect, diagnosability / debug-ability. This is a fine goal but there are some downsides as with all tradeoffs in the real world.
If this is correct these tables will need to be staging tables as TEMP tables will disappear when the ETL session ends.
Saving a bunch of staging tables when one could be used has some downsides but how big these are depends on you situation. If your cluster is fairly idle and the ETL data payload isn't huge then the impact to the ETL process of the extra tables will be real but not large (a couple of seconds or less). These impacts are mostly around setting up (or truncating) the staging or TEMP tables. But if your cluster is running other workloads when ETL runs then the impact can be much larger.
You see there are many "resources" in a Redshift cluster that all need to be shared by everything running on the database. Some like memory allocation can be (somewhat) controlled through the WLM. Others cannot. The two biggies are network bandwidth and disk bandwidth. There is a fixed capacity to these bandwidths in Redshift and even though they are high, they are finite. There are other limits to Redshift's ability to execute a total workload but these in my experience are the big two.
Every time you create a table, TEMP or permanent, the data is stored to disk. This means a write to disk as well as distributing the data per the distribution settings of the table. Then when the table is accessed the data needs to be read from disk. All this unneeded data movement will have some impact, how large will depend on how big it is and what else is going on at the time. So you see the impact will be moderately small up to very large depending on a lot of factors, not the least of which is how many tables are you creating. The cost of doing this will need to be offset by the benefit of having these extra tables which is a business decision.
A common pattern is to load (COPY) data to a temp or staging table and then extract the DELETE patterns to one staging table and the INSERT data to another. Once the deletes and inserts are applied to then save these tables with a date stamp in the name and possibly unloaded to S3. After a while these sets of data are deleted, 1 month is common. This way you can figure out 'what happened' if things go sideways. This plus good database backups can be used to recover from code bugs.
Your secondary question is about whether it is better to drop and recreate or truncate. There have been a number of performance improvements to both of these statements. With a grain of salt, I'll offer my slightly dated experience comparing these. Both are fast but I saw drop and recreate as slightly faster (fewer dependencies to manage). That said the main difference is in how they interoperate with other aspects of the database. DROP will fail if there are dependent views (unless cascaded) and table permissions will be lost. DROP cannot be run in a transaction block and since it needs an exclusive lock on the table can be held off my another session reading the table. TRUNCATE can run in a transaction block but will force a commit so transaction changes will become visible to all. It is usually these differences that made the decision about TRUNCATE vs. DROP and there are other options such as DELETE and ALTER TABLE APPEND that have their own set of plusses and minuses.
So I'd generally advise against creating more tables than are actually needed in the ETL process when all needs are weighed (including performance and business needs). You may have excess capacity now but usually Redshift clusters get busier over time. The guiding principal here is don't move large amounts of data more times than is necessary.

How to archive a giant postgres table?

We have a postgres table called history which is almost 900GB and continpusly increasing 10GB per day.
This is table is being accessed by a microservice (carts). We have postgres replication setup (one Master and one Slave).
2 Instances of the microservice is running in production, where 1 instance uses the master postgres connection to write and read data for some endpoints
and 1 another instance is using postgres slave connection to just the read the data alone.
Table definition:
id - uuid
data - jsonb column
internal - jsonb column
context - jsonb column
created_date - date
modified_date - date
Now in the above table data and internal column is loaded with big json for every row. We have came to a conclusion on null'ing the data and internal
column will reduce the total space consumption by this table.
Question:
How to archive this giant table? (meaning only cleaning up data and internal column alone).
How to achieve this without zero down time / performance degradation?
Approaches tested as of now.
Using pg_repack (this is best idea so far, but the issue here is once pg_repack is done then entire new table needs to get synced with slave instance which is causing WAL overhead).
Just nullify the data and internal column alone - Problem with this approach is just increases the table size due to postgres follows MVCC pattern.
Using Temp table and clone the data
Create a UNLOGGED table - historyv2
Copy the data from the original table to the historyv2 table without data and internal
then switch the table to LOGGED. ( I guess this will also cause the WAL overhead)
then rename the tables.
Can you guys give me some pointers on how to achieve this?
Postgres Version: 9.5
I always feel that questions like these conflate a few different ideas, which makes them seem more complicated than they should be. What does minimal performance impact mean? Generating lots of WAL may increase file i/o and cpu and network usage, but in most cases does not affect the system enough to have a client-facing impact. If no downtime is the most important thing, you should focus on optimizing for that, and not worry about what the process is to get there (within reason).
That said, if I woke up in your shoes, I would work first set up partitions for data going forward using table-inheritance, so that the data could be more easily segmented and worked on in the future. (This isn't entirely necessary, but probably makes your life easier in the future). After that, I would write a script to slowly go through the old data and creating new partitions with the "nulled out" data, interleaving partition creation, deletion of data, and vacuums against the main table. Once that is automated, you can let it churn slowly or during off-hours until it is done. You might need to do a final repack or vacuum full on the parent once all the data is moved, but it's probably ok even without it. Again, this isn't the simplest idea, probably not the fastest way to do it (if you could have downtime), but in the end, you'll have the schema you want without causing any service disruptions.

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).

archiving the table : searching for the best way

there is a table which has 80.000 rows.
Everyday I will clone this table to another log table giving a name like 20101129_TABLE
, and every day the prefix will be changed according to date..
As you calculate, the data will be 2400 000 rows every month..
Advices please for saving space, and getting fast service and other advantages and disadvantages!! how should i think to create the best archive or log..
it is a table has the accounts info. branch code balance etc
It is quite tricky to answer your question since you are a bit vague on some important facts:
How often do you need the archived tables?
How free are you in your design-choices?
If you don't need the archived data often and you are free in your desgin I'd copy the data into an archive database. That will give you the option of storing the database on a separate disk (cost-efficiency) and you can have a separate backup-schedule on that database as well.
You could also store all the data in one table with just an additional column like ArchiveDate datetime. But I think this depends really on how you plan on accessing the data later.
Consider TABLE PARTITIONING (MSDN) - it is designed for exactly this kind of scenarios. Not only you can spread data across partitions (and map partitions to different disks), you can keep all data in the same table and let MSSQL do all the hard work in the background (what partition to use based on select criteria, etc.).

SQL 2005 DB Partitioning for SharePoint

Background
I have a massive db for a SharePoint site collection. It is 130GB and growing at 10gb per month. 100GB of the 130GB is in one site collection. 30GB is the version table. There is only one site collection - this is by design.
Question
Am I able to partition a database (SharePoint) using SQL 2005s data partitioning features (creating multiple data files)?
Is it possible to partition a database that is already created?
Has anyone partitioned a SharePoint DB? Will I encounter any issues?
You would have to create a partition set and rebuild the table on that partition set. SQL2005 can only partition on a single column, so you would have to have a column in the DB that
Behaves fairly predictably so you don't get a large skew in the amount of data in each partition
IIRC the column has to be a numeric or datetime value
In practice it's easiest if it's monotonically increasing - you can create a series of partitions (automatically or manually) and the system will fill them up as it gets to the range definitions.
A date (perhaps the date the document was entered) would be ideal. However, you may or may not have a useful column on the large table. M.S. tech support would be the best source of advice for this.
The partitioning should be transparent to the application (again, you need a column with appropriate behaviour to use as a partition key).
Unless you are lucky enough to have a partition key column that is also used as a search predicate in the most common queries you may not get much query performance benefit from the partitioning. An example of a column that works well is a date column on a data warehouse. However, your Sharepoint application may not make extensive use of this sort fo query.
Mauro,
Is there no way you can segment the data on a Sharepoint level?
ie you may have multiple "sites" using a single (SQL) content database.
You could migrate site data to a new content database, which will allow you to reduce the data in that large content site and then shrink the datafiles.
it will also assist you in managing your obvious continued growth.
James.