Table size in AWS-Athena - size

Is there SQL-based way to retrieve the size of all tables within a database in AWS-Athena?
I'm more familiar with MSSQL and there it is relatively easy to write such query.

The quick way is via s3: ... > Show Properties > Location and lookup the size in the s3-console.
Explainer
You can run SELECT * FROM some_table for each table and look at the result metadata for the amount scanned, but it will be an expensive way to do it.
Athena doesn't really know about the data in your tables the way an RDBMS does, it's only when you query a table that Athena goes out to look at the data. It really S3 that you should as. You can list all objects in the location(s) of your tables and sum their sizes, but that might be a time consuming way of doing it if there are many objects.
The least expensive, and least time consuming way, when there are many hundreds of thousands of objects, is to enable S3 Inventory on the bucket that contains the data for your tables, then use the inventory to sum up the sizes for each table. You can get the inventory in CSV, ORC, or Parquet format, and they all work well with Athena – so if you have a lot of files in your bucket you can still query the inventory very efficiently.
You can read more about S3 Inventory here: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html

Related

BigQuery's hive partitioned table performance

I am trying to take advantage of the hive partitioned table. I have encountered the problem that retrieving Parquet files directly from GCS is several times faster than retrieving the same data using the hive partitioned external table.
My data is stored in parquet format in the following structure:
gs://mybucket/dataset/dt=2019-06-17/h=5/m=0/000
gs://mybucket/dataset/dt=2019-06-17/h=5/m=0/001
gs://mybucket/dataset/dt=2019-06-17/h=5/m=0/...
"h" stands for an hour, and "m" stands for a minute. "mybucket" is on region "us-central1".
Querying parquet files directly takes 3 seconds:
bq --project_id=chronosphere-production --location us-central1 query --nouse_cache --use_legacy_sql=false --external_table_definition='trace::PARQUET=gs://mybucket/dataset/dt=2019-06-17/h=5/*' "SELECT name, count(*) as c FROM people GROUP BY name ORDER BY c DESC LIMIT 20"
The other query, which runs on the same data but using hive partitioned table where hive url is gs://mybucket/dataset/{dt:DATE}/{h:INTEGER}/{m:INTEGER} takes 12 seconds:
bq --location us-central1 query --nouse_cache --use_legacy_sql=false "SELECT name, count(*) as c FROM \`dataset.hive_table\` WHERE dt='2019-06-17' AND h=5 GROUP BY name ORDER BY c DESC LIMIT 20"
Both queries scan the same amount of data/rows, returns the same result. But the response time difference is huge. Any ideas what can be the reason for such a big difference?
BTW if I create a non-hive partitioned table that points to gs://mybucket/dataset/dt=2019-06-17/h=5, it performs as good as querying parquet files directly. I think it's ok as this is temporary table vs permanent table performance.
Any help would be very appreciated.
EDIT:
It feels like it is related to file count, but I'm still not sure what is the root cause and if it's possible to solve it.
Here are some folder/file count numbers:
dt=* folder count = 3
h=* folder count per dt folder = 24
m=* folder count per h folder = 60
files per m folder ~40
My query scans ~32M rows / 500Mb of data.
I assume that when I provide a filter like WHERE dt='2019-06-17' AND h=5, BigQuery should go directly to gs://mybucket/dataset/dt=2019-06-17/h=5/ and start searching for files from there, but it feels that's not what it does.
I'm guessing there's a significant number of files in the bucket, as this is largely a comparison of cloud storage object listing performance. It's not clear from the description how many objects are involved, and how they're distributed across your partitioning scheme. Is that a typical date, or an unusual one in terms of data distribution?
In the hive partitioned case, BigQuery must get the larger list of bucket objects (e.g. gs://mybucket/dataset/*) and then filter it, whereas in the non-hive cases you describe you're effectively pushing a more targeted filter to cloud storages's list operations (e.g. gs://mybucket/dataset/dt=2019-06-17/h=5/*).
The calculus here is whether the performance implications outweigh other factors like convenience/manageability/etc. There's likely a middle ground to consider as well, which would be to experiment using more of the dt prefix for defining yearly/monthly tables to see if you get a more satisfactory performance tradeoff.

What is the most efficient way to load a Parquet file for SQL queries in Azure Databricks?

Our team drops parquet files on blob, and one of their main usages is to allow analysts (whose comfort zone is SQL syntax) to query them as tables. They will do this in Azure Databricks.
We've mapped the blob storage and can access the parquet files from a notebook. Currently, they are loaded and "prepped" for SQL querying in the following way:
Cell1:
%python
# Load the data for the specified job
dbutils.widgets.text("JobId", "", "Job Id")
results_path = f"/mnt/{getArgument("JobId")}/results_data.parquet"
df_results = spark.read.load(results_path)
df_results.createOrReplaceTempView("RESULTS")
The cell following this can now start doing SQL queries. e.g.:
SELECT * FROM RESULTS LIMIT 5
This takes a bit of time to get up, but not too much. I'm concerned about two things:
Am I loading this in the most efficient way possible, or is there a way to skip the creation of the df_results dataframe, which is only used to create the RESULTS temp table.
Am I loading the table for SQL in a way that lets it be used most efficiently? For example, if the user plans to execute a few dozen queries, I don't want to re-read from disk each time if I have to, but there's no need to persist beyond this notebook. Is createOrReplaceTempView the right method for this?
For your first question:
Yes, you can use the Hive Metastore on Databricks and query any tables in there without first creating DataFrames. The documentation on Databases and Tables is a fantastic place to start.
As a quick example, you can create a table using SQL or Python:
# SQL
CREATE TABLE <example-table>(id STRING, value STRING)
# Python
dataframe.write.saveAsTable("<example-table>")
Once you've created or saved a table this way, you'll be able to access it directly in SQL without creating a DataFrame or temp view.
# SQL
SELECT * FROM <example-table>
# Python
spark.sql("SELECT * FROM <example-table>")
For your second question:
Performance depends on multiple factors but in general, here are some tips.
If your tables are large (tens, hundreds of GB at least), you can partition by a predicate commonly used by your analysts to filter data. For example, if you typically include a WHERE clause that includes a date range or state, it might make sense to partition the table by one of those columns. The key concept here is data skipping.
Use Delta Lake to take advantage of OPTIMIZE and ZORDER. OPTIMIZE helps right-size files for Spark and ZORDER improves data skipping.
Choose Delta Cache Accelerated instace types for the cluster that your analysts will be working on.
I know you said there's no need to persist beyond the notebook but you can improve performance by creating persistent tables and taking advantage of data skipping, caching, and so on.

How to relationalize, join and aggregate multiple files from S3

I have a bucket in S3, containing hundreds of folders, each contains files with the same structure, which are csv representation of relational db tables. The different folders differ by content of the data, but overlapping might occur.
In each folder, I want to join 3 tables, and store the output in a dedicated table. The dedicated table should eventually hold joined data from all different folders. Duplications might occur between different folders but the records have a unique key that can help with the aggregation.
Data size for a specific folder, of all the files, can reach to 5 GB of disk space. 2 of the files contain hundreds of thousands of records. The third file can reach up to 20M records.
The result should be stored in AWS RDS, on a postgresql instance. However, I am considering to switch to Redshift. Will it be better for this scale of data?
The 3 tables are:
Providers
Consumers
Transactions
All of them are indexed by the key which is used in the join.
My approach is to iterate over S3 bucket, and for each folder load the 3 files to the db. Then, create the joined table for the 3 tables using sql, and finally adding the joined data to the aggregated table that should contain the data from all folders.
I am currently trying to handle 1 folder, in order to understand better how to make the process optimal, both in time and space.
After loading I noticed the db uses around 2X disk space then what I expected. Why is joining cost so much in disk space? Is there a way of loading and joining with minimal cost? The data loaded initially for each folder is used as a staging table, until I drop the duplicates and load it to the aggregated table. So its lifespan will be relatively short. I tried to use CREATE UNLOGGED TABLE but it didn't had much effect.
CREATE UNLOGGED TABLE agg_data AS SELECT * FROM
transactions t
INNER JOIN consumers c USING (consumer_id)
INNER JOIN providers p USING (provider_id);
This works ok for 1 folder, time wise. It does take a lot more disk space than I assumed it will.
How will this work in a mass scale, for hundreds of folders. How will the aggregation behave over time, as I will need to search for duplicated records in a continuously growing table?
To summarize my questions:
How to choose between RDS and Redshift? My concerns are tens of millions of records in the target table, and the need to drop duplicates while adding new data to the target table.
Why is joining data take so much db storage? Is there a way to minimize it, for data that is temporary?
What is an efficient way of inserting new data to the destination table while dropping duplications?
Will it be better to join and store the files in S3 using AWS Glue, and then load them to the target db? Currently it does not seem like an option, as Glue takes forever to join the data.
I would recommend using Amazon Athena to join the files and produce the desired output.
First, each directory needs to be recognised as a table. This can be done by manually running a CREATE EXTERNAL TABLE command in Athena and pointing at the folder. All files within the folder will be treated as containing data for the table and they should all be of the same format.
If desired, an AWS Glue crawler can instead be used to create the table definition. Create a crawler and point it to the folder. Glue will create the table definition in the AWS Glue Data Catalog, which is accessible to Athena.
Once the three input tables have been defined, you can run a query in Amazon Athena that joins the three tables and produces an output table using CREATE TABLE AS.
See: Creating a Table from Query Results (CTAS) - Amazon Athena
Glue can also be used for Program AWS Glue ETL Scripts in Python - AWS Glue, but I haven't tried this so I can't offer advice on it. However, I have used AWS Glue crawlers to create tables that I then query via Amazon Athena.
Once you have the output data, you can then load it into the database of your choice. Which database you choose depends upon your use-case. I would suggest starting with Amazon RDS for PostgreSQL since it is a traditional database and you seem to be comfortable with it. If you later need improved performance (eg billions or rows instead of millions), you can move to Amazon Redshift.
General comment: It is rather strange that you wish to join those 3 tables since there will presumably be a lot of duplicated data (very denormalized). You could instead simply load those tables into your desired database and then do the joins in the database, possibly being selective as to which columns you wish to include.

Is partitioning helpful in Amazon Athena if query doesn't filter based on partition?

I have a large amount of data, but there is no particular column I would like to filter based on (that is, my 'where clause' can be any column). In this scenario, does partitioning provide any benefit (maybe helps with read-parallelism?) when the queries end up scanning all the data?
If there is no column all, or most, queries would filter on then partitions will only hurt performance. Instead aim for files around 100 MB, as few as possible, Parquet if possible, and put all files directly under the table's LOCATION.
The reason why partitions would hurt performance is that when Athena starts executing your query it will list all files, and the way it does it is as if S3 was a file system. It starts by listing the table's LOCATION, and if it finds anything that looks like a directory it will list it separately, and so on, recursively. If you have a deep directory structure this can end up taking a lot of time. You want to help Athena by having all your files in a flat structure, but also fewer than 1000 of them, because that's the page size for S3's list operation. With more than 1000 files you want to have directories so that Athena can parallelize the listing (but as few as possible still, because there's a limit to how many listings it will do in parallel).
You want to keep file sizes to around 100 MB because that's a good size that trades off how long it takes to process a file against the overhead of getting it from S3. The exact recommendation is 128 MB.

How to store millions of statistics records efficiently?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task?
Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like
345422,{051216:23212,051217:64233} where 051216,051217=mm/dd/yy and 23212,64233=number of views
This of course is kinda stupid if you want to go 1 year back since if you want to get the sum of views of say 1000 products you need to fetch like 30mb from the database and calculate it your self.
The other way we think of going right now is just to have a massive table with 3 columns classified_id,date,view and store its recording on its own row, this of course will result in a huge table with hundred of millions of rows , for example if we have 1.8 millions of classifieds and keep records 24/7 for one year every 2 hours we need
1800000*365*12=7.884.000.000(billions with a B) rows which while it is way inside the theoritical limit of postgres I imagine the queries on it(say for updating the views), even with the correct indices, will be taking some time.
Any suggestions? I can't even imagine how google analytics stores the stats...
This number is not as high as you think. In current work we store metrics data for websites and total amount of rows we have is much higher. And in previous job I worked with pg database which collected metrics from mobile network and it collected ~2 billions of records per day. So do not be afraid of billions in number of records.
You will definitely need to partition data - most probably by day. With this amount of data you can find indexes quite useless. Depends on planes you will see in EXPLAIN command output. For example that telco app did not use any indexes at all because they would just slow down whole engine.
Another question is how quick responses for queries you will need. And which steps in granularity (sums over hours/days/weeks etc) for queries you will allow for users. You may even need to make some aggregations for granularities like week or month or quarter.
Addition:
Those ~2billions of records per day in that telco app took ~290GB per day. And it meant inserts of ~23000 records per second using bulk inserts with COPY command. Every bulk was several thousands of records. Raw data were partitioned by minutes. To avoid disk waits db had 4 tablespaces on 4 different disks/ arrays and partitions were distributed over them. PostreSQL was able to handle it all without any problems. So you should think about proper HW configuration too.
Good idea also is to move pg_xlog directory to separate disk or array. No just different filesystem. It all must be separate HW. SSDs I can recommend only in arrays with proper error check. Lately we had problems with corrupted database on single SSD.
First, do not use the database for recording statistics. Or, at the very least, use a different database. The write overhead of the logs will degrade the responsiveness of your webapp. And your daily backups will take much longer because of big tables that do not need to be backed up so frequently.
The "do it yourself" solution of my choice would be to write asynchronously to log files and then process these files afterwards to construct the statistics in your analytics database. There is good code snippet of async write in this response. Or you can benchmark any of the many loggers available for Java.
Also note that there are products like Apache Kafka specifically designed to collect this kind of information.
Another possibility is to create a time series in column oriented database like HBase or Cassandra. In this case you'd have one row per product and as many columns as hits.
Last, if you are going to do it with the database, as #JosMac pointed, create partitions, avoid indexes as much as you can. Set fillfactor storage parameter to 100. You can also consider UNLOGGED tables. But read thoroughly PostgreSQL documentation before turning off the write-ahead log.
Just to raise another non-RDBMS option for you (so a little off topic), you could send text files (CSV, TSV, JSON, Parquet, ORC) to Amazon S3 and use AWS Athena to query it directly using SQL.
Since it will query free text files, you may be able to just send it unfiltered weblogs, and query them through JDBC.