It is possible to enable Fetch task in Hive for simple query instead of Map or MapReduce using hive hive.fetch.task.conversion parameter.
Please explain why Fetch task is running much faster than Map especially when doing some simple work (for example select * from table limit 10;)? What map-only task is doing additionally in this case? The performance difference is more than 20 times faster in my case. Both tasks should read the table data, isn't it?
FetchTask directly fetches data, whereas Mapreduce will invoke a map reduce job
<property>
<name>hive.fetch.task.conversion</name>
<value>minimal</value>
<description>
Some select queries can be converted to single FETCH task
minimizing latency.Currently the query should be single
sourced not having any subquery and should not have
any aggregations or distincts (which incurrs RS),
lateral views and joins.
1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
2. more : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns)
</description>
</property>
Also there is another parameter hive.fetch.task.conversion.threshold which by default in 0.10-0.13 is -1 and >0.14 is 1G(1073741824)
This indicates that, If table size is greater than 1G use Mapreduce instead of Fetch task
more detail
Related
I have a redshift query the output of which is too big to fit into the RAM of my EC2 instance. I am using psycopg2 to execute the query. If I use the limit keyword will the rows repeat if I increment the limit ?
Say I enforce a limit of 0,1000 at first and get a block, then I enforce a limit of 1001,2000. Will there be a repetition of rows in those two blocks considering redshift fetches data parallelly ?
Is there a better alternative to this ?
You want to DECLARE a cursor to store the full results on Redshift and then FETCH rows from the cursor in batches as you need. This way the query only runs once when the cursor is filled. See https://docs.aws.amazon.com/redshift/latest/dg/declare.html (example is at the bottom of the page).
This is exactly how BI tools like Tableau get data from Redshift - in blocks of 10,000 rows. Using cursors prevents the tool/system/network from being overwhelmed by data as you possibly select very large result sets.
I have oracle table contain 900 million records , this table partioned to 24 partion , and have indexes :
i try to using hint and i put fetch_buffer to 100000:
select /+ 8 parallel +/
* from table
it take 30 minutes to get 100 million records
my question is :
is there are any way more faster to get the 900 million (all data in the table ) ? should i use partions and did 24 sequential queries ? or should i use indexes and split my query to 10 queries for example
The network is almost certainly the bottleneck here. Oracle parallelism only impacts the way the database retrieves the data, but data is still sent to the client with a single thread.
Assuming a single thread doesn't already saturate your network, you'll probably want to build a concurrent retrieval solution. It helps that the table is already partitioned, then you can read large chunks of data without re-reading anything.
I'm not sure how to do this in Scala, but you want to run multiple queries like this at the same time, to use all the client and network resources possible:
select * from table partition (p1);
select * from table partition (p2);
...
Not really an answer but too long for a comment.
A few too many variables can impact this to give informed advice, so the following are just some general hints.
Is this over a network or local on the server? If the database is remote server then you are paying a heavy network price. I would suggest (if possible) running the extract on the server using the BEQUEATH protocol to avoid using the network. Once the file(s) complete, is will be quicker to compress and transfer to destination than transferring the data direct from database to local file via JDBC row processing.
With JDBC remember to set the cursor fetch size to reduce round tripping - setFetchSize. The default value is tiny (10 I think), try something like 1000 to see how that helps.
As for the query, you are writing to a file so even though Oracle might process the query in parallel, your write to file process probably doesn't so it's a bottleneck.
My approach would be to write the Java program to operate off a range of values as command line parameters, and experiment to find which range size and concurrent instances of the Java give optimal performance. The range will likely fall within discrete partitions so you will benefit from partition pruning (assuming the range value is an a indexed column ideally the partition key).
Roughly speaking I would start with range of 5m, and run concurrent instances that match the number of CPU cores - 2; this is not a scientifically derive number just one that I tend to use as my first stab and see what happens.
I have dataset that has data added almost everyday, and needs to be processed everyday in a part of a larger ETL.
When I select the partition directly, the query is really fast:
SELECT * FROM JSON.`s3://datalake/partitoned_table/?partition=2019-05-20`
Yet, the issue is that the event type does not generate data on some Sundays, resulting in a non-existing partition on that particular day. Because of this, I cannot use the previous statement to run my daily job.
Another attempt led me to try to have spark find the latest partition of that dataset, in order to be sure the bigger query wouldn't fail:
SELECT * FROM JSON.`s3://datalake/partitoned_table/`
WHERE partition = (SELECT MAX(partition) FROM JSON.`s3://datalake/partitoned_table/`)
This works every time, but it is unbelievably slow.
I found numerous articles and reference on how to build and maintain partitions, yet nothing about how to read them correctly.
Any idea how to have this done properly?
(SELECT MAX(partition) FROM JSON.s3://datalake/partitoned_table/)
This query will be executed as a subquery in Spark.
Reason for slowness
1. Subquery needs to be executed completely before the actual query execution starts.
2. The Above query will list all the S3 files to retrieve the partition information. If the folder has a large number of files, this process will take a long time. Time taken for listing is directly proportional to the number of files.
We could create a table on top of s3://datalake/partitoned_table/ with the partitioning scheme, let's say the name of the table is tbl
You could perform an
ALTER TABLE tbl RECOVER PARTITIONS
which stores the partition information in metastore. This also involves a listing, but it is a one-time operation and spark spawns multiple threads to perform the listing to make it faster.
Then we could fire
SELECT * FROM tbl WHERE partition = (SELECT MAX(partition) FROM tbl`)
Which will get the partition information only from metastore, without having to list the object store which I believe is an expensive operation.
The cost incurred in this approach is that of recovering partitions.
After which all queries will be faster(when data for new partition comes, we need to recover partitions again)
WorkAround when you don't have Hive-
FileSystem.get(URI.create("s3://datalake/partitoned_table"), conf).listStatus(new Path("s3://datalake/partitoned_table/"))
Above code will give you list of file partitions example - List(s3://datalake/partitoned_table/partition=2019-05-20, s3://datalake/partitoned_table/partition=2019-05-21....)
This is very efficient because it is only fetching metadata from the s3 location.
Just take the latest file partitions and use it your SQL.
When i am running queries in VirtualBox Sandbox with hive. I feel Select count(*) is too much slower than the Select *.
Can anyone explain what is going on behind?
And why this delay is happening?
select * from table
It can be a Map only job But
Select Count(*) from table
It can be a Map and Reduce job
Hope this helps.
There are three types of operations that a hive query can perform.
In order of cheapest and fastest to more expensive and slower here they are.
A hive query can be a metadata only request.
Show tables, describe table are examples. In these queries the hive process performs a lookup in the metadata server. The metadata server is a SQL database, probably MySQL, but the actual DB is configurable.
A hive query can be an hdfs get request.
Select * from table, would be an example. In this case hive can return the results by performing an hdfs operation. hadoop fs -get, more or less.
A hive query can be a Map Reduce job.
Hive has to ship the jar to hdfs, the jobtracker queues the tasks, the tasktracker execute the tasks, the final data is put into hdfs or shipped to the client.
The Map Reduce job has different possibilities as well.
It can be a Map only job.
Select * from table where id > 100 , for example all of that logic can be applied on the mapper.
It can be a Map and Reduce job,
Select min(id) from table;
Select * from table order by id ;
It can also lead to multiple map Reduce passes, but I think the above summarizes some behaviors.
This is because the DB is using clustered primary keys so the query searches each row for the key individually, row by agonizing row, not from an index.
Run optimize table. This will ensure that the data pages are
physically stored in sorted order. This could conceivably speed up a
range scan on a clustered primary key.
create an additional non-primary index on just the change_event_id
column. This will store a copy of that column in index pages which be
much faster to scan. After creating it, check the explain plan to
make sure it's using the new index
I am dealing with a database (2.5 GB) having some tables only 40 row to some having 9 million rows data.
when I am doing any query for large table it takes more time.
I want results in less time
small query on table which have 90 rows only-->
hive> select count(*) from cidade;
Time taken: 50.172 seconds
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.block.size</name>
<value>131072</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
does these setting affects performance of hive?
dfs.replication=3
dfs.block.size=131072
can i set it from hive prompt as
hive>set dfs.replication=5
Is this value remains for a perticular session only ?
or Is it better to change it in .xml file ?
The important thing is that select count(*) will cause hive start a map reduce job.
You may think this is very fast like mysql query.
But even a simplest map reduce job in hadoop, the total time is consist of submit to job tracker, assign task to task tracker and etc. So the total time at lease several ten secs.
try select count(*) on a big table. The time will not increase to much.
So, you need understand hive and hadoop deal big data.
dfs.replication shouldn't impact the run-time of your hive queries. It's a property exposed from hdfs-site.xml that determines how many HDFS nodes does a block of data get replicated to. dfs.replication of 3 would mean each block of data is on 3 nodes (in total). Consequently, it's not for a particular session.