Redshift Spectrum Query - Request ran out of memory in the S3 query layer - sql

I am trying to execute a query with grouping on 26 columns. Data is stored in S3 in parquet format partitioned by day. Redshift Spectrum query is returning below error. I am not able to find any relevant documentation in aws regarding this.
Request ran out of memory in the S3 query layer
Total Number of rows in table : 770 Million
Total size of table in Parquet format : 45 GB
Number of records in each partition : 4.2 Million
Million Redshift configuration : Single node dc2.xlarge
Attached is the table ddl

Try declaring the text columns in this table as VARCHAR rather than STRING. Also make sure to use the minimum possible VARCHAR size for the column to reduce the memory required by the GROUP BY.
Also, two further suggestions:
Recommend always using at least 2 nodes of Redshift. This gives
you a free leader node and allows your compute nodes to use all
their RAM for query processing.
Grouping by so many columns is an unusual query pattern. If you are looking for duplicates in the table consider hashing the columns into a single value and grouping on that. Here's an example:
SELECT MD5(ws_sold_date_sk
||ws_sold_time_sk
||ws_ship_date_sk
||ws_item_sk
||ws_bill_customer_sk
||ws_bill_cdemo_sk
||ws_bill_hdemo_sk
||ws_bill_addr_sk
||ws_ship_customer_sk
||ws_ship_cdemo_sk
||ws_ship_hdemo_sk
||ws_ship_addr_sk
||ws_web_page_sk
||ws_web_site_sk
||ws_ship_mode_sk)
, COUNT(*)
FROM spectrum.web_sales
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
;

Related

Athena (Hive/Presto) query partitioned table IN statement

I have the following partitioned table in Athena (HIVE/Presto):
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
id STRING,
data STRING
)
PARTITIONED BY (
year string,
month string,
day string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3://mybucket';
Data is stored in s3 organized in a path structure like s3://mybucket/year=2020/month=01/day=30/.
I would like to know if the following query would leverage partitioning optimization:
SELECT
*
FROM
mydb.mytable
WHERE
(year='2020' AND month='08' AND day IN ('10', '11', '12')) OR
(year='2020' AND month='07' AND day IN ('29', '30', '31'));
I am assuming since IN operator will be transformed in a series of OR conditions, this will be still a query which will benefit by partitioning. Am I correct?
Unfortunately Athena does not expose information that would make it easier to understand how to optimise queries. Currently the only thing you can do is to run different variations of queries and look at the statistics returned in the GetQueryExecution API call.
One way to figure out if Athena will make use of partitioning in a query is to run the query with different values for the partition column and make sure that the amount of data scanned is different. If the amount of data is different Athena was able to prune partitions during query planning.
Yes, it's also mentioned int the documentation.
When Athena runs a query on a partitioned table, it checks to see if any partitioned columns are used in the WHERE clause of the query. If partitioned columns are used, Athena requests the AWS Glue Data Catalog to return the partition specification matching the specified partition columns. The partition specification includes the LOCATION property that tells Athena which Amazon S3 prefix to use when reading data. In this case, only data stored in this prefix is scanned. If you do not use partitioned columns in the WHERE clause, Athena scans all the files that belong to the table's partitions.

Hive join query optimisation

Table A
---------
col1, col2,Adate,qty
Table B
-------
col2,cost,Bdate
The table sizes are as follows:
A: 1 million
B: 700k
Consider this query:
SELECT
A.col1,
A.col2,
B.Bdate bdate,
SUM(qty)*COLLECT_LIST(cost)[0] price
FROM A
JOIN B
ON (A.col2 = B.col2 AND A.Adate <= B.Bdate)
GROUP BY
A.col1,
A.col2,
B.bdate;
The above hive query takes more than 3 hrs on a cluster of 4 slaves(8GB memory,100 GB disk) and 1 master(16 GB memory, 100 GB disk)
Can this query be optimized? If yes, where can the optimization be possible?
Use Tez and mapjoin.
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --adjust for your smaller table to fit in memory
set hive.execution.engine=tez;
Also this computation is not memory-efficient:
SUM(qty)*COLLECT_LIST(cost)[0] price
COLLECT_LIST will collect all cost values in the group into non unique(contains values from ALL rows in the group) and unordered (yes, unordered, because you have no any distribute + sort before collect_list) array. This array can be big enough (the number of elements = the number of rows in the group), depending on your data, then you are taking [0] element, it means that you are picking just any random cost from the group. Does it make any sense to collect array to get just any random element? Use min() or max instead. If it does not matter which cost should be taken, then min(cost) or max(cost) or some other scalar function will consume less memory. You can use first_value analytic function (may require sub-query, but it will be memory-efficient also)
I will try to give you some advices to improve query performance in Hive.
Check the execution engine you are using
set hive.execution.engine;
If you execution engine is mr, rather than MapReduce, you may be able to use Apache Spark or Apache Tez, both of which are faster than MapReduce.
set hive.execution.engine=tez;
Join queries are computationally expensive and can be slow, especially when you’re joining three or more tables, or if you’re working with very large data.
One strategy that can be used to remedy this problem is to join the data in advance and store the pre-joined result in a separate table, which you can then query.
this is one way of denormalizing a normalized database, to make it easier to run analytic queries.
This approach of pre-joining tables has some costs, but it can make analytic queries easier to write and faster to run.
There are some other techniques for improving Hive query performance
Join table ordering (Largest table last)
As with any type of tuning, it is important to understand the internal working of a system. When Hive executes a join,
it needs to select which table is streamed and which table is cached.
Hive takes the last table in the JOIN statement for streaming, so we need to ensure that this streaming table is largest among the two.
A: 1 million B: 700k
Hence, when these two tables are joined it is important that the larger table comes last in the query.
Bucketing stores data in separate files, not separate subdirectories like partitioning.
It divides the data in an effectively random way, not in a predictable way like partitioning.
When records are inserted into a bucketed table, Hive computes hash codes of the values in the specified bucketing column and uses these hash codes to divide the records into buckets.
For this reason, bucketing is sometimes called hash partitioning.
The goal of bucketing is to distribute records evenly across a predefined number of buckets.
Bucketing can improve the performance of joins if all the joined tables are bucketed on the join key column.
For more on bucketing, see the page of the Hive Language Manual describing bucketed tables,
BucketedTables
bucketing-in-hive
Partitioning
Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department.
Each table in the hive can have one or more partition keys to identify a particular partition.
Using partition it is easy to do queries on slices of the data.
apache-hive-partitions

How to know data is coming form partitioned filegroup or it is reading total records in table

I applied a partition on a DateTime column in a MSSQL table .
Created Partition function, Scheme and 4 file groups and given boundary values.
I have queried a result in this table with where condition on partitioned column.
In this how-to know, the query is reading total records or related filegroup.
How to know the query is using partition or not ?.
One way is with the actual query execution plan. The Actual Partition Count of the seek/scan operator will show the actual number of partitions touched.
Another method is to run the query with SET STATISTICS IO ON, where the scan count of the table will reflect the number of partitions used.

Need suggestion on splitting table in bigquery based on non-date column along with date partition

We are having a date partitioned table with 5 yrs(with daily incremental load) data running into millions & millions of records. To improve the performance, thinking of splitting the table based on a non-date field(id) as all the queries will include a where clause on that column(id). And also partition each of split tables with date partition so that we can query on a smaller dataset with a date range. we will not be using wildcarded table as we will know the id and planning to append that to the table and run a query against that specific table. Need to know whether that would be good option to pursue to improve performance and reduce the query cost.
[Update]: We went ahead and split the tables based on id column(tablename_id) and made the table date partitioned and clustered with 4 other columns(max supported) which are commonly used in queries. With that we were able to get a better performance and also reduced the data accessed for each query. Based on the testing looks like it is a good option to puruse as long as wildcarded querying of tables is avoided and till Bigquery supports partitioning based on non-date/non-datetime columns.
We split the tables based on the id columns creating multiple tables. Each of the split tables are partition on date column. Apart from that we had it as clustered table on 4 other columns as needed. Find below the performance on a sample dataset. Old Table(UserInfo) has more than 500,000 rows. The stats we captured is for a given date range and id, the performance of old table(non split/combined table) and split table(split based on the ID) in terms of amount of data processed and the time taken for the same query.
This is not possible. BigQuery doesn't support partitioned on non-date columns.
There's a feature request for it. I suggest subscribing to it to keep receiving information regarding its availability.

How to get records from row index 1 million to 2 million in big query tables?

I've been trying to download the m-lab dataset from big query recently. There seems to be a limit that we can only query and get as much as around 1 million rows with one query. The m-lab dataset contains multiple billion records in many tables. I'd love to use queries like bq query --destination_table=mydataset.table1 "select * from (select ROW_NUMBER() OVER() row_number, * from (select * from [measurement-lab:m_lab.2013_03] limit 10000000)) where row_number between 2000001 and 3000000;" but it didn't work. Is there a workaround to make it work? Thanks a lot!
If you're trying to download a large table (like the m-lab table), your best option is to use an extract job. For example, run
bq extract 'mlab-project:datasset.table' 'gs://bucket/foo*'
Which will extract the table to the google cloud storage objects gs://bucket/foo000000000.csv, gs://bucket/foo0000000001.csv, etc. The default extracts as CSV, but you can pass `--destination_format=NEWLINE_DELIMITED_JSON to extract the table as json.
The other thing to mention is that you can read the 1 millionth row in bigquery using the tabledata list api to read from that particular offset (no query required!).
bq head -n 1000 -s 1000000 'm-lab-project:dataset.table'
will read 1000 rows starting at the 1000000th row.