Hive Distinct Query takes time when we have more files - sql

Table Structure -
hive> desc table1;
OK
col1 string
col2 string
col3 string
col4 bigint
col5 string
Time taken: 0.454 seconds, Fetched: 5 row(s);
NUmber of underlying files -
[user#localhost ~]$ hadoop fs -ls /user/hive/warehouse/database.db/table | wc -l
58822
[user#localhost ~]$
Distinct Query - select distinct concat(col1,'~',col2,'~',col3) from vn_req_tab;
Total records - ~2M Above query runs for 8 hours.
What is causing the issue, How do i debug this query.

You have very large number of small files and this is the main problem.
When you are executing the query 1 mapper is executing on each file thus there are lot of mappers are running each mapper on small piece of data(1 file each) which means they are consuming unnecessary resources from cluster and waits for others to finish.
Please note hadoop is ideal for bigger files with large data.
If you would have executed the same query on bigger files it would have given much better performance.
Try setting the below properties
set mapred.min.split.size=100000000; // u can set max.split.size for optimal performance.
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set mapred.min.split.size=100000000;
Try tweaking the values in property to reach at optimal number of mappers

Related

How to efficiently sample hive table and run a query?

I want to sample 100 rows from a big_big_table (millions and millions of rows), and run some query on these 100 rows. Mainly for testing purposes.
The way I wrote this runs for really long time, as if it reads the whole big_big_table, and only then take the LIMIT 100:
WITH sample_table AS (
SELECT *
FROM big_big_table
LIMIT 100
)
SELECT name
FROM sample_table
ORDER BY name
;
Question: What's the correct/fast way of doing this?
Check hive.fetch.task.* configuration properties
set hive.fetch.task.conversion=more;
set hive.fetch.task.aggr=true;
set hive.fetch.task.conversion.threshold=1073741824; --1Gbyte
Set these properties before your query and if you are lucky, it will work without map-reduce. Also consider limiting to single partition.
It may not work depending on the storage type/serde and files size. If files are of small size/splittable and table is native, it may work fast without Map-Reduce started.

BigQuery. Long execution time on small datasets

I created a new Google cloud project and set up BigQuery data base. I tried different queries, they all are executing too long. Currently we don't have a lot of data, so high performance was expected.
Below are some examples of queries and their execution time.
Query #1 (Job Id bquxjob_11022e81_172cd2d59ba):
select date(installtime) regtime
,count(distinct userclientid) users
,sum(fm.advcost) advspent
from DWH.DimUser du
join DWH.FactMarketingSpent fm on fm.date = date(du.installtime)
group by 1
The query failed in 1 hour + with error "Query exceeded resource limits. 14521.457814668494 CPU seconds were used, and this query must use less than 12800.0 CPU seconds."
Query execution plan: https://prnt.sc/t30bkz
Query #2 (Job Id bquxjob_41f963ae_172cd41083f):
select fd.date
,sum(fd.revenue) adrevenue
,sum(fm.advcost) advspent
from DWH.FactAdRevenue fd
join DWH.FactMarketingSpent fm on fm.date = fd.date
group by 1
Execution time ook 59.3 sec, 7.7 MB processed. What is too slow.
Query Execution plan: https://prnt.sc/t309t4
Query #3 (Job Id bquxjob_3b19482d_172cd31f629)
select date(installtime) regtime
,count(distinct userclientid) users
from DWH.DimUser du
group by 1
Execution time 5.0 sec elapsed, 42.3 MB processed. What is not terrible but must be faster for such small volumes of data.
Tables used :
DimUser - Table size 870.71 MB, Number of rows 2,771,379
FactAdRevenue - Table size 6.98 MB, Number of rows 53,816
FaceMarketingSpent - Table size 68.57 MB, Number of rows 453,600
The question is what am I doing wrong so that query execution time is so big? If everything is ok, I would be glad to hear any advice on how to reduce execution time for such simple queries. If anyone from google reads my question, I would appreciate if jobids are checked.
Thank you!
P.s. Previously I had experience using BigQuery for other projects and the performance and execution time were incredibly good for tables of 50+ TB size.
Posting same reply i've given in the gcp slack workspace:
Both your first two queries looks like you have one particular worker who is overloaded. Can see this because in the compute section, the max time is very different from the avg time. This could be for a number of reasons, but i can see that you are joining a table of 700k+ rows (looking at the 2nd input) to a table of ~50k (looking at the first input). This is not good practice, you should switch it so the larger table is the left most table. see https://cloud.google.com/bigquery/docs/best-practices-performance-compute?hl=en_US#optimize_your_join_patterns
You may also have a heavily skew in your join keys (e.g. 90% of rows are on 1/1/2020, or NULL). check this.
For the third query, that time is expected, try a approx count instead to speed it up. Also note BQ starts to get better if you perform the same query over and over, so this will get quicker.

Spark SQL: Number of partitions being generated seems weird

I have a very simple Hive table with the below structure.
CREATE EXTERNAL TABLE table1(
col1 STRING,
col2 STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://path/';
The directory this table is being pointed to has just ONE file of size 51 KB.
From the pyspark shell (with all default values):
df = sparksession.sql("SELECT * from table1")
df.rdd.getNumPartitions()
The number of partitions being returned is weird. Sometimes it returned 64 and sometimes 81.
My expectation was to see 1 or 2 partitions utmost. Any thoughts on why I see that many partitions?
Thanks.
As you stated that number of partitions returned sometimes it returned 64 and sometimes 81 because its up to the spark that in how many partitions it want to store the data even if you use the repartition command then also its a request to the spark to shuffle the data into given re partitions if spark thinks its not possible then it will take the decision by itself and store the data in random number of partitions.
Hope this explanation solves your query.

Amazon Redshift queries mysteriously dying

Why is my Amazon Redshift query sometimes working, sometimes getting killed, and sometimes running out of memory?
This is a simple query:
dev=# EXPLAIN SELECT row_number, browser_cookie, "timestamp", request_path,
status, outcome, duration, referrer
FROM annotated_apache_logs
WHERE date = '2015-09-15';
QUERY PLAN
------------------------------------------------------------------------------------
XN Seq Scan on annotated_apache_logs (cost=0.00..114376.71 rows=9150137 width=207)
Filter: (date = '2015-09-15'::date)
Pulling about 9 million rows:
dev=# SELECT count(*) FROM annotated_apache_logs WHERE date = '2015-09-15';
count
---------
9150137
(1 row)
And choking:
dev=# SELECT row_number, browser_cookie, "timestamp", request_path,
status, outcome, duration, referrer
FROM annotated_apache_logs
WHERE date = '2015-09-15';
out of memory
Sometimes the sql says Killed. Sometimes it works. Sometimes I get out of memory. No idea why. The table looks like this (I've removed rows not in the above query):
CREATE TABLE IF NOT EXISTS annotated_apache_logs (
row_number double precision,
browser_cookie character varying(240),
timestamp integer,
request_path character varying(2500),
status character varying(12),
outcome character varying(128),
duration integer,
referrer character varying(2500)
)
DISTKEY (date)
SORTKEY (browser_cookie);
And I've worked very hard to get all of those columns as small as I can to reduce memory usage. What do I look for now? If I read the EXPLAIN output correctly, this might return a couple of gigs of data. Not much data, no joins, nothing fancy. For a "petabyte scale data warehouse", that's trivial, so I'm assuming I'm missing something fundamental here.
You should use cursors to fetch the result set in chunks. See http://docs.aws.amazon.com/redshift/latest/dg/declare.html
If your client application uses an ODBC connection and your query creates a result set that is too large to fit in memory, you can stream the result set to your client application by using a cursor. When you use a cursor, the entire result set is materialized on the leader node, and then your client can fetch the results incrementally.
Edit:
Assuming that you want the entire result set rather than filtering using where/limit.
If your query is actually running out of memory, check what is the concurrency for the WLM queue under which this query runs. Try to increase the available memory for this queue or reduce the concurrency, this will allow your query to have more memory.
P.S:
When it says "Petabyte scale", it does not mean it has petabyte of RAM for you. There are a lot of factors which decide how much memory your query is actually getting while execution,
What is the node type you are using?
How many nodes?
What other queries are running when you are running this query?

Sporadic Execution Times for Query in SQL Server 2008

I have been running some speed tests on a query where I insert 10,000 records into a table that has millions (over 24mil) of records. The query (below) will not insert duplicate records.
MERGE INTO [dbo].[tbl1] AS tbl
USING (SELECT col2,col3, max(col4) col4, max(col5) col5, max(col6) col6 FROM #tmp group by col2, col3) AS src
ON (tbl.col2 = src.col2 AND tbl.col3 = src.col3)
WHEN NOT MATCHED THEN
INSERT (col2,col3,col4,col5,col6)
VALUES (src.col2,src.col3,src.col4,src.col5,src.col6);
The execution times of the above query are sporadic; ranging anywhere from 0:02 seconds to 2:00 minutes.
I am running these tests within SQL Server Studio via a script that will create the 10,000 rows of data (into the #tmp table), then the MERGE query above is fired. The point being, the same exact script is executing for each test that I run.
The execution times bounce around from seconds to minutes as in:
Test #1: 0:10 seconds
Test #2: 1:13 minutes
Test #3: 0:02 seconds
Test #4: 1:56 minutes
Test #5: 0:05 seconds
Test #6: 1:22 minutes
One metric that I find interesting is that the seconds/minutes alternating sequence is relatively consistent - i.e. every other test the results are in seconds.
Can you give me any clues as to what may be causing this query to have such sporadic execution times?
I wish I could say what the cause of the sporadic execution times was, but I can say what I did to work around the problem...
I created a new database and target table and added 25 million records to the target table. Then I ran my original tests on the new database/table by repeatedly inserting 10k records into the target table. The results were consistent execution times of aprox 0:07 seconds (for each 10k insert).
For kicks I did the exact same testing on a machine that has twice as much CPU/Memory than my dev laptop. The results were consistent execution times of 0:00 seconds (It's time for a new dev machine ;))
I dislike not discovering the cause to the problem, but in this case I'm going to have to call it good and move on. Hopefully, someday, a StackO die-hard can update this question with a good answer.