AWS RDS Postgres maximum instance and table size - sql

Thanks in advance!
I am planning to use AWS RDS Postgres for pretty big data (> ~50TB) , but I have couple of questions un-answered
Is 16TB the maximum limit for AWS RDS Postgres instance, if so how do people store > 16TB data.
Is the limit of 16TB for RDS the maximum database size post compression that Postgres can store on AWS.
Also I do not see any option for enabling compression while setting up AWS RDS Postgres DB instance. How to enable compression in AWS RDS Postgres?
I have followed :
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Limits.html
https://blog.2ndquadrant.com/postgresql-maximum-table-size/ (wherein Postgres table can have size greater than 32TB).
https://wiki.postgresql.org/wiki/FAQ#What_is_the_maximum_size_for_a_row.2C_a_table.2C_and_a_database.3F

In addition to RDS for PostgreSQL, which has a 32 TiB limit, you should take a look at Amazon Aurora PostgreSQL, which has a 64 TiB limit. In both cases, the largest single table you can create is 32 TiB, though you can't quite reach that size in RDS for PostgreSQL as some of the space will be taken up by the system catalog.
Full disclosure: I am the product manager for Aurora PostgreSQL at AWS.

As of 2019/02/11, Amazon's docs state that the maximum database size for a Postgres RDS instance is 32TiB.
Additionally, this 32TiB limit appears to be a hard limit (Some AWS limits are considered 'soft' in that a user can request that they be raised)
As others have suggested, one option would be to manage your own database, however this is not likely to be easy given the scale of data we are talking about here.
Staying within the AWS ecosystem, another option would be to store all your data in S3, and use AWS Athena to run queries, although depending on what sort of queries you'll be running, it could get quite expensive.
As for your question about compression, if you are storing data in s3, you can compress it before you upload. You might also find this answer helpful.

RDS limit has been revised to 64 TB.
MariaDB, MySQL, Oracle, and PostgreSQL database instances: 20 GiB–64 TiB
SQL Server for Enterprise, Standard, Web, and Express editions: 20 GiB–16 TiB

Related

postgres rds slow response time

We have an aws rds postgres db od type t3.micro.
I am running simple queries on a pretty small table and I get pretty high response time - around 2 seconds per query run.
Query example:
select * from package where id='late-night';
The cpu usage is not high (around 5%)
We tried creating a bigger rds db (t3.meduiom) with the snapshot of the original one and the performance did not improve at all.
Table size 2600 rows
We examined connection with bot external ip and internal ip.
Disk size 20gib
Memory type: ssd
Is there a way to improve performance??
Thanks for the help!

why does AWS Athena needs 'spill-bucket' when it dumps results in target S3 location

why does AWS Athena needs 'spill-bucket' when it dumps results in target S3 location
WITH
( format = 'Parquet',
parquet_compression = 'SNAPPY',
external_location = '**s3://target_bucket_name/my_data**'
)
AS
WITH my_data_2
AS
(SELECT * FROM existing_tablegenerated_data" limit 10)
SELECT *
FROM my_data_2;
Since it already has the bucket to store the data , why does Athena need the spill-bucket and what does it store there ?
Trino/Presto developer here who was directly involved in Spill development.
In Trino (formerly known as Presto SQL) the term "spill" refers to dumping on disk data that does not fit into memory. It is an opt-in feature allowing you to process larger queries. Of course, if all your queries require spilling, it's more efficient to simply provision a bigger cluster with more memory, but the functionality is useful when larger queries are rare.
Spilling involves saving temporary data, not the final query results. The spilled data is re-read back and deleted before the query completes execution.
Athena uses Lambda functions to connect to External Hive data stores
Because of the limit on Lambda function response sizes, responses larger than the threshold spill into an Amazon S3 location that you specify when you create your Lambda function. Athena reads these responses from Amazon S3 directly.
https://docs.aws.amazon.com/athena/latest/ug/connect-to-data-source-hive.html

ERROR : FAILED: Error in acquiring locks: Error communicating with the metastore org.apache.hadoop.hive.ql.lockmgr.LockException

Getting the Error in acquiring locks, when trying to run count(*) on partitioned tables.
The table has 365 partitions when filtered on <= 350 partitions, the queries are working fine.
when tried to include more partitions for the query, it's failing with the error.
working on Hive-managed ACID tables, with the following default values
hive.support.concurrency=true //cannot make it as false, it's throwing <table> is missing from the ValidWriteIdList config: null, should be true for ACID read and write.
hive.lock.manager=org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.txn.strict.locking.mode=false
hive.exec.dynamic.partition.mode=nonstrict
Tried increasing/decreasing values for these following with a beeline session.
hive.lock.numretries
hive.unlock.numretries
hive.lock.sleep.between.retries
hive.metastore.batch.retrieve.max={default 300} //changed to 10000
hive.metastore.server.max.message.size={default 104857600} // changed to 10485760000
hive.metastore.limit.partition.request={default -1} //did not change as -1 is unlimited
hive.metastore.batch.retrieve.max={default 300} //changed to 10000.
hive.lock.query.string.max.length={default 10000} //changed to higher value
Using the HDI-4.0 interactive-query-llap cluster, the meta-store is backed by default sql-server provided along.
The problem is NOT due to service tier of the hive metastore database.
It is most probably due to too many partitions in one query based on the symptom.
I meet the same issue several times.
In the hivemetastore.log, you shall able to see such error:
metastore.RetryingHMSHandler: MetaException(message:Unable to update transaction database com.microsoft.sqlserver.jdbc.SQLServerException: The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:578)
This is due to in Hive metastore, each partition involved in the hive query requires at most 8 parameters to acquire a lock.
Some possible workarounds:
Decompose the the query into multiple sub-queries to read from fewer
partitions.
Reduce the number of partitions by setting different partition keys.
Remove partitioning if partition keys don't have any filters.
Following are the parameters which manage the batch size for INSERT query generated by the direct SQL. Their default value is 1000. Set both of them to 100 (as a good starting point) in the Custom hive-site section of Hive configs via. Ambari and restart ALL Hive related components (including Hive metastore).
hive.direct.sql.max.elements.values.clause=100
hive.direct.sql.max.elements.in.clause=100
We also faced the same error in HDInsight and after doing many configuration changes similar to what you have done, the only thing that worked is scaling our Hive Metastore SQL DB server.
We had to scale it all the way to a P2 tier with 250 DTUs for our workloads to work without these Lock Exceptions. As you may know, with the tier and DTU count, the SQL server's IOPS and response time improves thus we suspected that the Metastore performance was the root cause for these Lock Exceptions with the increase in workloads.
Following link provides information about the DTU based performance variation in SQL servers in Azure.
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers-dtu
Additionally as I know, the default Hive metastore that gets provisioned when you opt to not provide an external DB in cluster creation is just an S1 tier DB. This would not be suitable for any high capacity workloads. At the same time, as a best practice always provision your metastores external to the cluster and attach at cluster provisioning time, as this gives you the flexibility to connect the same Metastore to multiple clusters (so that your Hive layer schema can be shared across multiple clusters, e.g. Hadoop for ETLs and Spark for Processing / Machine Learning), and you have the full control to scale up or down your metastore as per your need anytime.
The only way to scale the default metastore is by engaging the Microsoft support.
We faced the same issue in HDINSIGHT. We solved it by upgrading the metastore.
The Default metastore had only 5 DTU which is not recommended for production environments. So we migrated to custom Metastore and spin the Azure SQL SERVER (P2 above 250 DTUs) and the setting the below properties:
hive.direct.sql.max.elements.values.clause=200
hive.direct.sql.max.elements.in.clause=200
Above values are set because SQL SERVER cannot process more than 2100 parameter. When you have partitions more than 348, you faced this issue as 1 partition creates 8 parameters for metastore 8 x 348

Redshift Spectrum much slower than Athena?

Our data is stored in S3 as JSON without partitions. Until today we were using only athena but now we tried Redshift Spectrum.
We are running the same query twice.
Once using Redshift Spectrum and once using Athena. Both connect to the same data in S3.
Using Redshift Spectrum this report takes forever(more than 15 minutes) to run and using Athena it only takes 10 seconds to run.
The query that we are running in both cases in aws console is this:
SELECT "events"."persistentid" AS "persistentid",
SUM(1) AS "sum_number_of_reco"
FROM "analytics"."events" "events"
GROUP BY "events"."persistentid"
Any idea what's going on?
Thanks
The Redshift Spectrum processing power is limited by Redshift cluster size.
You can find the infomation from Improving Amazon Redshift Spectrum Query Performance
The Amazon Redshift query planner pushes predicates and aggregations
to the Redshift Spectrum query layer whenever possible. When large
amounts of data are returned from Amazon S3, the processing is limited
by your cluster's resources. Redshift Spectrum scales automatically to
process large requests. Thus, your overall performance improves
whenever you can push processing to the Redshift Spectrum layer.
On the other hand, Athena uses optimized amount of resource for the query, which may be larger than the Spectrum of a small Redshift cluster can get.
This has been confirmed by our testing on Redshift Spectrum performance with different Redshift cluster size.

How to make duplicate a postgres database on the same RDS instance faster?

thank you guys in advance.
I am having a 60GB Postgres RDS on aws, and there is databaseA inside this RDS instance, I want to make a duplicate of databaseA called databaseB in the same RDS server.
So basically what I tried is to run CREATE DATABASE databaseB WITH TEMPLATE databaseA OWNER postgres; This single query took 6 hours to complete, which is too slow. I see the max IOPS during the process is 120, not even close to the limit of aws general SSD's limit 10,000 IOPS. I have also tried tunning up work_mem, shared_buffers, effective_cache_size in parameter group, There is no improvements at all.
My last option is to just create two separate RDS instance, but It will be much easier if I can do this in one instance. I'd appreciate any suggestions.
(The instance class is db.m4.xlarge)
As mentioned by Matt; you have two options:
Increase your server size which will give you more IOPS.
Switch to provisioned IOPS
As this is a temporary requirement I will go with 1 because u can upgrade to max. available server --> do database copy --> downgrade db server seamlessly and won't take much time. Switching SSD to provisioned IOPS will take lots of time because it needs to convert your data and hence more downtime. And later again when u will switch back from provisioned iops to SSD again it will take time.
Note that Both 1 & 2 are expensive ( if u really dont need them ) if used for long term; so u can't leave it as is.