I am dealing with a database (2.5 GB) having some tables only 40 row to some having 9 million rows data.
when I am doing any query for large table it takes more time.
I want results in less time
small query on table which have 90 rows only-->
hive> select count(*) from cidade;
Time taken: 50.172 seconds
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.block.size</name>
<value>131072</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
does these setting affects performance of hive?
dfs.replication=3
dfs.block.size=131072
can i set it from hive prompt as
hive>set dfs.replication=5
Is this value remains for a perticular session only ?
or Is it better to change it in .xml file ?
The important thing is that select count(*) will cause hive start a map reduce job.
You may think this is very fast like mysql query.
But even a simplest map reduce job in hadoop, the total time is consist of submit to job tracker, assign task to task tracker and etc. So the total time at lease several ten secs.
try select count(*) on a big table. The time will not increase to much.
So, you need understand hive and hadoop deal big data.
dfs.replication shouldn't impact the run-time of your hive queries. It's a property exposed from hdfs-site.xml that determines how many HDFS nodes does a block of data get replicated to. dfs.replication of 3 would mean each block of data is on 3 nodes (in total). Consequently, it's not for a particular session.
Related
It is possible to enable Fetch task in Hive for simple query instead of Map or MapReduce using hive hive.fetch.task.conversion parameter.
Please explain why Fetch task is running much faster than Map especially when doing some simple work (for example select * from table limit 10;)? What map-only task is doing additionally in this case? The performance difference is more than 20 times faster in my case. Both tasks should read the table data, isn't it?
FetchTask directly fetches data, whereas Mapreduce will invoke a map reduce job
<property>
<name>hive.fetch.task.conversion</name>
<value>minimal</value>
<description>
Some select queries can be converted to single FETCH task
minimizing latency.Currently the query should be single
sourced not having any subquery and should not have
any aggregations or distincts (which incurrs RS),
lateral views and joins.
1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
2. more : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns)
</description>
</property>
Also there is another parameter hive.fetch.task.conversion.threshold which by default in 0.10-0.13 is -1 and >0.14 is 1G(1073741824)
This indicates that, If table size is greater than 1G use Mapreduce instead of Fetch task
more detail
I have oracle table contain 900 million records , this table partioned to 24 partion , and have indexes :
i try to using hint and i put fetch_buffer to 100000:
select /+ 8 parallel +/
* from table
it take 30 minutes to get 100 million records
my question is :
is there are any way more faster to get the 900 million (all data in the table ) ? should i use partions and did 24 sequential queries ? or should i use indexes and split my query to 10 queries for example
The network is almost certainly the bottleneck here. Oracle parallelism only impacts the way the database retrieves the data, but data is still sent to the client with a single thread.
Assuming a single thread doesn't already saturate your network, you'll probably want to build a concurrent retrieval solution. It helps that the table is already partitioned, then you can read large chunks of data without re-reading anything.
I'm not sure how to do this in Scala, but you want to run multiple queries like this at the same time, to use all the client and network resources possible:
select * from table partition (p1);
select * from table partition (p2);
...
Not really an answer but too long for a comment.
A few too many variables can impact this to give informed advice, so the following are just some general hints.
Is this over a network or local on the server? If the database is remote server then you are paying a heavy network price. I would suggest (if possible) running the extract on the server using the BEQUEATH protocol to avoid using the network. Once the file(s) complete, is will be quicker to compress and transfer to destination than transferring the data direct from database to local file via JDBC row processing.
With JDBC remember to set the cursor fetch size to reduce round tripping - setFetchSize. The default value is tiny (10 I think), try something like 1000 to see how that helps.
As for the query, you are writing to a file so even though Oracle might process the query in parallel, your write to file process probably doesn't so it's a bottleneck.
My approach would be to write the Java program to operate off a range of values as command line parameters, and experiment to find which range size and concurrent instances of the Java give optimal performance. The range will likely fall within discrete partitions so you will benefit from partition pruning (assuming the range value is an a indexed column ideally the partition key).
Roughly speaking I would start with range of 5m, and run concurrent instances that match the number of CPU cores - 2; this is not a scientifically derive number just one that I tend to use as my first stab and see what happens.
My cluster configuration is as follows:
3 Node cluster
128GB RAM per cluster node.
Processor: 16 core HyperThreaded per cluster node.
All 3 nodes have Kudu master and T-Server and Impala server, one of the node has Impala catalogue and Impala StateStore.
My issues are as follows:
1) I've a hard time figuring out Dynamic resource pooling in Impala while running concurrent queries. I've tried giving mem_limit still no luck. I've also tried static service pool but with that also I couldn't achieve required concurrency. Even with admission control, the required concurrency was not achieved.
I) The time taken for 1 query: 500-800ms.
II) But if 10 concurrent queries are given the time taken grows to 3-6s per query.
III) But if more than 20 concurrent queries are given the time taken is exceeding 10s per query.
2) One of my cluster nodes is not taking the load after submitting the query, I checked this by the summary of the query. I've tried giving the NUM_NODES as 0 and 1 on the node which is not taking the load, still, the summary shows that the node is not taking the load.
What is the table size ? How many rows are there in the tables ? Are the tables partitioned ? It will be nice if you can compare your configurations with the Impala Benchmarks
As mentioned above Impala is designed to run on a Massive Parallel Processing Infrastructure. If when we had a setup of 10 nodes with 80 cores and 160 virtual cores with 12 TB SAN storage, we could get a computation time of 60 seconds with 5 concurrent users.
I'm new to PDI, im using PDI 7, i have excel input with 6 rows and want to insert it into postgresDB. My transformation is : EXCEL INPUT --> Postgres Bulk Loader (2 steps only).
Condition 1 : When i Run the transformation the Postgres Bulk Load not stopping and not inserting anything into my postgresDB.
Condition 2 : So, I add "Insert/Update" step after Postgres Bulk Loader, and all data inserted to postgresDB which means success, but the bulk loader still running.
My transformation
From all sources i can get, they only need input and Bulk Loader step, and the after finished the transformation, the bulk loader is "finished" (mine's "running"). So, i wanna ask how to to this properly for Postgres? Do i skipped something important? Thanks.
The PostgreSQL bulk loader used to be only experimental. Haven't tried it in some time. Are you sure you need it? If you're loading from Excel, it's unlikely you'll have enough rows to warrant use of a bulk loader.
Try just the regular Table Output step. If you're only inserting, you shouldn't need the Insert/Update step either.
To insert just 7 rows you don't need bulk loader.
Bulk loader designed to load huge amount of data. It uses native psql client. PSQL client transfers data much faster since it uses all features of binary protocol without any restriction of jdbc specification. JDBC is used in other steps like Table Output. Most of time Table Output is enough sufficient.
Postgres Bulk Loader step just builds in memory data in csv format from incoming steps and pass them to psql client.
I did made some experiments.
Environment:
DB: Postgresv9.5x64
PDI KETTLE v5.2.0
PDI KETTLE defautl jvm settings 512mb
Data source: DBF FILE over 2_215_000 rows
Both PDI and Kettle on same localhost
Table truncated on each run
PDI Kettle restarted on each run(to avoid heavily CPU load of gc run due huge amount rows)
Results are underneath to help you make decision
Bulk loader: average over 150_000 rows per second around 13-15s
Table output (sql inserts): average 11_500 rows per second. Total is around 3min 18s
Table output (batch inserts, batch size 10_000): average 28_000 rows per second. Total is around 1min 30s
Table output (batch inserts in 5 threads batch size 3_000): average 7_600 rows per second per each thread. Means around 37_000 rows per second. Total time is around 59s.
Advantage of Buld loader is that is doesn't fill memory of jmv, all data is streamed into psql process immediately.
Table Output fill jvm memory with data. Actually after around 1_600_000 rows memory is full and gc is started. CPU that time loaded up to 100% and speed slows down significantly. That is why worth to play with batch size, to find value which will provide best performance (bigger better), but on some level cause GC overhead.
Last experiment. Memory provided to jvm is enough to hold data. This can be tweaked in variable PENTAHO_DI_JAVA_OPTIONS. I set value of jvm heap size to 1024mb and increased value of batch size.
Table output (batch inserts in 5 threads batch size 10_000): average 12_500 rows per second per each thread. Means total around 60_000 rows per second. Total time is around 35s.
Now much easier to make decision. But your have to notice the fact, that kettle pdi and database located on same host. In case if hosts are different network bandwidth can play some role in performance.
Slow insert/update step
Why you have to avoid using insert/update (in case of huge amount of data processed or you are limited by time)?
Let' look on documentation
The Insert/Update step first looks up a row in a table using one or
more lookup keys. If the row can't be found, it inserts the row. If it
can be found and the fields to update are the same, nothing is done.
If they are not all the same, the row in the table is updated.
Before states, for each row in stream step will execute 2 queries. It is lookup first and then update or insert. Source of PDI Kettle states that PreparedStatement is used for all queries: insert, update and lookup.
So if this step is bottleneck then, try to figure out what exactly slow.
Is lookup slow? (Run manually lookup query on database on sample data. Check is it slow ? Does lookup fields has index on those columns used to find correspond row in database)
Is update slow? (Run manually lookup query on database on sample data. Check is is slow? Does update where clause use index on lookup fields)
Anyway this step is slow since it requires a lot of network communication, and data processing in kettle.
The only way to make it faster, is to load all data in database into "temp" table and call function which will upsert data. Or just use simple sql step in job to do the same.
I'm loading a table in Hive thats partitioned by date. It currently contains about 3 years worth of records, so circa 900 partitions (i.e. 365*3).
I'm loading daily deltas into this table, adding an additional partition per day. I achieve this using dynamic partitioning as I cant guarantee my source data only contains one days worth of data (e.g. if I'm recovering from a failure I may have multiple days of data to process).
This is all fine and dandy, however I've noticed the final step of actually writing the partition has become very slow. By this I mean the logs show the MapReduce stage completes quickly, its just very slow on the final step as it seems to scan and open all existing partitions regardless of if they will be overwritten.
Should I be explicitly creating partitions to avoid this step?
Whether the partitions are dynamic or static typically should not alter the performance drastically. Can you check in each of the partitions how many actual files are getting created? Just want to make sure that the actual writing is not serialized which it could be if it's writing to only one file. Also check how many mappers & reducers were employed by the job.