I used RDB in aws.
I inserted 5 billion rows in one table and need to add an index varchar type column.(To speed up the insert, I removed the index and proceeded.)
However, the following error is displayed and it takes too long.
ERROR: could not write to file "base/pgsql_tmp/pgsql_tmp707.0.sharedfileset/0.41": No space left on device
(I am using the db.r6g.2xlarge specification.)
So, I'm considering managing the table separately.
However, I want to divide table sharding by row pk id number (nextval('table_id'::regclass) ) instead of dividing by column.
And when querying with sql, I want to use it as one table.
It's like a vertical sharding(?) concept...!
Is this possible in postgresql?
As an additional question...
Is it possible to split a table that already has data on it by sharding?
Related
Problem description
In an ETL pipeline, we update a table from an SQL database with a pandas dataframe. The table has about 2 milion rows, and the dataframe updates approximately 1 million of them. We do it with SQLAlchemy in Python, and the database is SQL Server (I think this is not too relevant for the question, but I'm writing it for the sake of completeness).
At the moment, the code is "as found", the update consisting of the following steps:
Split the dataframe in many dataframes (the number appears to be fix and arbitrary, does not depend on the dataframe size).
For each sub-dataframe, do an update query.
As it is, the process takes what (in my admittedly very limited SQL experience) appears to be too much time, about 1-2 hours. The table schema consists of 4 columns:
An id as the primary key
2 columns that are foreign keys (primary keys in their respective tables)
A 4th column
Questions
What can I do to make the code more efficient, and faster? Since the UPDATE is done in blocks, I'm unsure of whether the index is re-calculated every time (since the id value is not changed I don't know why that would be the case). I also don't know how the foreign key values (which could change for a given row) enter the complexity calculation.
At what point does it make sense, if at any, to insert all the rows into a new auxiliary table, re-calculate the index only at the end, truncate the original table and copy the auxiliary table into it? Are there any subtleties with this approach, with the indices, foreign keys, etc.?
This question is about how to achieve the best possible performance with SQLite.
If you have a table with one column that you update very often, and another column that you update very rarely, is it better to split up that table into 2 different tables, so that you have 2 tables with 1 column (excluding primary key) each, instead of 1 table with 2 columns (excluding primary key) each?
I have tried to find information about this in the SQLite Documentation, but I have no been able to find an explanation for what exactly happens on updating one column of one row of a table. The closest answer to my question I found was this:
During part of SQLite's INSERT and SELECT processing, the complete content of each row in the database is encoded as a single BLOB. So the SQLITE_MAX_LENGTH parameter also determines the maximum number of bytes in a row.
(Quoted from here: https://www.sqlite.org/limits.html)
To me that sounds as if every row is internally stored as one big blob of all columns, and that would mean that updating a single column of the row will indeed lead to the whole row being internally re-encoded again, including all other columns of that row, even if they have not been modified as part of the UPDATE. But I am not sure if I understand that sentence I quoted correctly.
I am thinking about a case where you have one column that stores some big blob with multiple MB of size, and another column that stores some integer. The column with the big blob might only be updated once a month, while the column with the integer might be updated once per second. As far as I currently understand it based on that quote, it seems that updating the integer once per second will lead to the multiple MB of blob to be re-encoded again every time you update the integer, and that would be very inefficient. Having one table for the blob, and a different table for the integer would be a lot better then.
I have query A, which mostly left joins several different tables.
When I do:
select count(1) from (
A
);
the query returns the count in approximately 40 seconds. The count is not big, at around 2.8M rows.
However, when I do:
create table tbl as A;
where A is the same query, it takes approximately 2 hours to complete. Query A returns 14 columns (not many) and all the tables used on the query are:
Vacuumed;
Analyzed;
Distributed across all nodes (DISTSTYLE ALL);
Encoded/Compressed (except on their sortkeys).
Any ideas on what should I look at?
When using CREATE TABLE AS (CTAS), a new table is created. This involves copying all 2.8 million rows of data. You didn't state the size of your table, but this could conceivable involve a lot of data movement.
CTAS does not copy the DISTKEY or SORTKEY. The CREATE TABLE AS documentation says that the default DISTKEY is EVEN. Therefore, the CTAS operation would also have involved redistributing the data amongst nodes. Since the source table was DISTKEY ALL, at least the data was available on each node for distribution, so this shouldn't have been too bad.
If your original table DDL included compression, then these settings would probably have been copied across. If the DDL did not specify compression, then the copy to the new table might have triggered the automatic compression analysis, which involves loading 100,000 rows, choosing a compression type for each column, dropping that data and then starting the load again. This could consume some time.
Finally, it comes down to the complexity of Query A. It is possible that Redshift was able to optimize the query by reading very little data from disk because it realized that very few columns of data (or perhaps no columns) were required to read from disk to display the count. This really depends upon the contents of that Query.
It could simply be that you've got a very complex query that takes a long time to process (that wasn't processed as part of the Count). If the query involves many JOIN and WHERE statements, it could be optimized by wise use of DISTKEY and SORTKEY values.
CREATE TABLE writes all data that is returned by the query to disk, count query does not, that explains the difference. Writing all rows is more expensive operation compared to reading row count.
In my application, users can create custom tables with three column types, Text, Numeric and Date. They can have up to 20 columns. I create a SQL table based on their schema using nvarchar(430) for text, decimal(38,6) for numeric and datetime, along with an Identity Id column.
There is the potential for many of these tables to be created by different users, and the data might be updated frequently by users uploading new CSV files. To get the best performance during the upload of the user data, we truncate the table to get rid of existing data, and then do batches of BULK INSERT.
The user can make a selection based on a filter they build up, which can include any number of columns. My issue is that some tables with a lot of rows will have poor performance during this selection. To combat this I thought about adding indexes, but as we don't know what columns will be included in the WHERE condition we would have to index every column.
For example, on a local SQL server one table with just over a million rows and a WHERE condition on 6 of its columns will take around 8 seconds the first time it runs, then under one second for subsequent runs. With indexes on every column it will run in under one second the first time the query is ran. This performance issue is amplified when we test on an SQL Azure database, where the same query will take over a minute the first time its run, and does not improve on subsequent runs, but with the indexes it takes 1 second.
So, would it be a suitable solution to add a index on every column when a user creates a column, or is there a better solution?
Yes, it's a good idea given your model. There will, of course, be more overhead maintaining the indexes on the insert, but if there is no predictable standard set of columns in the queries, you don't have a lot of choices.
Suppose by 'updated frequently,' you mean data is added frequently via uploads rather than existing records being modified. In that case, you might consider one of the various non-SQL databases (like Apache Lucene or variants) which allow efficient querying on any combination of data. For reading massive 'flat' data sets, they are astonishingly fast.
I have a SQL Server table in production that has millions of rows, and it turns out that I need to add a column to it. Or, to be more accurate, I need to add a field to the entity that the table represents.
Syntactically this isn't a problem, and if the table didn't have so many rows and wasn't in production, this would be easy.
Really what I'm after is the course of action. There are plenty of websites out there with extremely large tables, and they must add fields from time to time. How do they do it without substantial downtime?
One thing I should add, I did not want the column to allow nulls, which would mean that I'd need to have a default value.
So I either need to figure out how to add a column with a default value in a timely manner, or I need to figure out a way to update the column at a later time and then set the column to not allow nulls.
ALTER TABLE table1 ADD
newcolumn int NULL
GO
should not take that long... What takes a long time is to insert columns in the middle of other columns... b/c then the engine needs to create a new table and copy the data to the new table.
I did not want the column to allow nulls, which would mean that I'd need to have a default value.
Adding a NOT NULL column with a DEFAULT Constraint to a table of any number of rows (even billions) became a lot easier starting in SQL Server 2012 (but only for Enterprise Edition) as they allowed it to be an Online operation (in most cases) where, for existing rows, the value will be read from meta-data and not actually stored in the row until the row is updated, or clustered index is rebuilt. Rather than paraphrase any more, here is the relevant section from the MSDN page for ALTER TABLE:
Adding NOT NULL Columns as an Online Operation
Starting with SQL Server 2012 Enterprise Edition, adding a NOT NULL column with a default value is an online operation when the default value is a runtime constant. This means that the operation is completed almost instantaneously regardless of the number of rows in the table. This is because the existing rows in the table are not updated during the operation; instead, the default value is stored only in the metadata of the table and the value is looked up as needed in queries that access these rows. This behavior is automatic; no additional syntax is required to implement the online operation beyond the ADD COLUMN syntax. A runtime constant is an expression that produces the same value at runtime for each row in the table regardless of its determinism. For example, the constant expression "My temporary data", or the system function GETUTCDATETIME() are runtime constants. In contrast, the functions NEWID() or NEWSEQUENTIALID() are not runtime constants because a unique value is produced for each row in the table. Adding a NOT NULL column with a default value that is not a runtime constant is always performed offline and an exclusive (SCH-M) lock is acquired for the duration of the operation.
While the existing rows reference the value stored in metadata, the default value is stored on the row for any new rows that are inserted and do not specify another value for the column. The default value stored in metadata is moved to an existing row when the row is updated (even if the actual column is not specified in the UPDATE statement), or if the table or clustered index is rebuilt.
Columns of type varchar(max), nvarchar(max), varbinary(max), xml, text, ntext, image, hierarchyid, geometry, geography, or CLR UDTS, cannot be added in an online operation. A column cannot be added online if doing so causes the maximum possible row size to exceed the 8,060 byte limit. The column is added as an offline operation in this case.
The only real solution for continuous uptime is redundancy.
I acknowledge #Nestor's answer that adding a new column shouldn't take long in SQL Server, but nevertheless, it could still be an outage that is not acceptable on a production system. An alternative is to make the change in a parallel system, and then once the operation is complete, swap the new for the old.
For example, if you need to add a column, you may create a copy of the table, then add the column to that copy, and then use sp_rename() to move the old table aside and the new table into place.
If you have referential integrity constraints pointing to this table, this can make the swap even more tricky. You probably have to drop the constraints briefly as you swap the tables.
For some kinds of complex upgrades, you could completely duplicate the database on a separate server host. Once that's ready, just swap the DNS entries for the two servers and voilĂ !
I supported a stock exchange company
in the 1990's who ran three duplicate
database servers at all times. That
way they could implement upgrades on
one server, while retaining one
production server and one failover
server. Their operations had a
standard procedure of rotating the
three machines through production,
failover, and maintenance roles every
day. When they needed to upgrade
hardware, software, or alter the
database schema, it took three days to
propagate the change through their
servers, but they could do it with no
interruption in service. All thanks
to redundancy.
"Add the column and then perform relatively small UPDATE batches to populate the column with a default value. That should prevent any noticeable slowdowns"
And after that you have to set the column to NOT NULL which will fire off in one big transaction. So everything will run really fast until you do that so you have probably gained very little really. I only know this from first hand experience.
You might want to rename the current table from X to Y. You can do this with this command sp_RENAME '[OldTableName]' , '[NewTableName]'.
Recreate the new table as X with the new column set to NOT NULL and then batch insert from Y to X and include a default value either in your insert for the new column or placing a default value on the new column when you recreate table X.
I have done this type of change on a table with hundreds of millions of rows. It still took over an hour, but it didn't blow out our trans log. When I tried to just change the column to NOT NULL with all the data in the table it took over 20 hours before I killed the process.
Have you tested just adding a column filling it with data and setting the column to NOT NULL?
So in the end I don't think there's a magic bullet.
select into a new table and rename. Example, Adding column i to table A:
select *, 1 as i
into A_tmp
from A_tbl
//Add any indexes here
exec sp_rename 'A_tbl', 'A_old'
exec sp_rename 'A_tmp', 'A_tbl'
Should be fast and won't touch your transaction log like inserting in batches might.
(I just did this today w/ a 70 million row table in < 2 min).
You can wrap it in a transaction if you need it to be an online operation (something might change in the table between the select into and the renames).
Another technique is to add the column to a new related table (Assume a one-to-one relationship which you can enforce by giving the FK a unique index). You can then populate this in batches and then you can add the join to this table wherever you want the data to appear. Note I would only consider this for a column that I would not want to use in every query on the original table or if the record width of my original table was getting too large or if I was adding several columns.