Find the % of the used memory in impala - impala

How can I see the memory % that is used right now in Impala?
Also I want to check the whole memory size

Best idea would be to use cloudera manager.
It shows health of impala. Queries running, which query taking more memory and kill them if needed.
Manage/distribute the memory between yarn, impala as well.
I am not sure about licensing and probably its not free but you can ask your CDH administrator.
Here is few screenshots from sample system -
Health -
CDH Manager view -
Running Impala Queries(shows GB, duration) -

Related

Matillion: How to identify performance bottleneck

We're running Matillion (v1.54) on an AWS EC2 instance (CentOS), based on Tomcat 8.5.
We have developped a few ETL jobs by now, and their execution takes quite a lot of time (that is, up to hours). We'd like to speed up the execution of our jobs, and I wonder how to identify the bottle neck.
What confuses me is that both the m5.2xlarge EC2 instance (8 vCPU, 32G RAM) and the database (Snowflake) don't get very busy and seem to be sort of idle most of the time (regarding CPU and RAM usage as shown by top).
Our environment is configured to use up to 16 parallel connections.
We also added JVM options -Xms20g -Xmx30g to /etc/sysconfig/tomcat8 to make sure the JVM gets enough RAM allocated.
Our Matillion jobs do transformations and loads into a lot of tables, most of which can (and should) be done in parallel. Still we see, that most of the tasks are processed in sequence.
How can we enhance this?
By default there is only one JDBC connection to Snowflake, so your transformation jobs might be getting forced serial for that reason.
You could try bumping up the number of concurrent connections under the Edit Environment dialog, like this:
There is more information here about concurrent connections.
If you do that, a couple of things to avoid are:
Transactions (begin, commit etc) will force transformation jobs to
run in serial again
If you have a parameterized transformation job,
only one instance of it can ever be running at a time. More information on that subject is here
Because the Matillion server is just generating SQL statements and running them in Snowflake, the Matillion server is not likely to be the bottleneck. You should make sure that your orchestration jobs are submitting everything to Snowflake at the same time and there are no dependencies (unless required) built into your flow.
These steps will be done in sequence:
These steps will be done in parallel (and will depend on Snowflake warehouse size to scale):
Also - try the Alter Warehouse Component with a higher concurrency level

Spotfire and BigQuery

I am quite puzzled by BigQuery connector on Spotfire. It is taking !extremely! long time to import my dataset in-memory.
my configuration: spotfire on AWS windows instance (8vCPU - 32Go RAM). dataset 50Go >100M rows on BigQuery.
Yes - I should use in-database for such large dataset and push the queries to BigQuery and use Spotfire only for display, but that is not my question today 😋
Today i am trying to understand how the import works and why it is taking so long. this import job started 21hrs ago and it is still not finished. The resources of the server are barely used (CPU, Disk, Network).
Testing done:
I tried importing data from Redshift and it was much faster (14min for 22Go)
I checked resources used during import: network speed (Redshift ~ 370Mbs, BQ ~ 8Mbs for 30min), CPU (Redshift ~ 25%, BQ < 5%), RAM (Redshift & BQ ~ 27Go), Disk write (Redshift 30Mbs, BQ 5MBs)
I really don't understand what is Spotfire actually doing for all this time while importing dataset from BQ in memory. There seems to be no use of server resources and there is no indication of status apart from time running.
Any Spotfire experts have any insights on what's happening? Is the connector to BigQuery actually not to be used for In-memory analysis - what is the actual implementation limiting factor?
Thanks!
😇
We had an issue which is fixed in the Spotfire versions below:
TS 10.10.3 LTS HF-014
TS 11.2.0 HF-002
Please also vote and comment on the idea of using the Storage API when extracting data from BigQuery:
https://ideas.tibco.com/ideas/TS-I-7890
Thanks,
Thomas Blomberg
Senior Product Manager TIBCO Spotfire
#Tarik, you need to install the above hotfix at your end.
You can download the latest hotfix from the link: https://community.tibco.com/wiki/list-hotfixes-tibco-spotfire-clients-analyst-web-player-consumerbusiness-author-and-automation
An update after more testing. Thanks #Thomas and #Manoj's very helpful support. Here are the results:
I updated spotfire version to 11.2.0 HF002 and it fixed the issue with bringing data in-memory with BigQuery 👌 - Using (Data > Add Data...), the data throughput was very low though ~ 13min/Go. The network throughput was doing burst of 8Mbs.
As suggested in tibco ideas by Thomas, I installed Simba JDBC driver and the data throughput improved drammatically to ~ 50s/Go using (Data > Information Designer). The issue off course is that you need access to the server to install it. the network throughput was roughly 200Mbs. I am not sure what is the limiting factor (Spotfire config, Samba Driver or BigQuery).
Using Redshift connector to a Redshift cluster with the same data and connecting using (Data > Information Designer), I get to a data import throughput of ~ 30s/Go with a network throughput of 380Mbs.
So my recommandation is to use the latest simba driver along with the Information Designer to get the best "in-memory" data import throughput when connecting to medium size dataset in BigQuery (10-30Go). This leads to a data import throughput of 1min/Go.
It's not clear what makes Redshift connection faster though and if there is faster method to import data from GCP/BigQuery to Spotfire 🤷‍♂️
Any comments or suggestions are welcome!
Tarik

Apache Impala - YARN like CPU utilization report for queries (on Cloudera)

We have YARN and Impala co-located on the same cloudera cluster, YARN utilization report and YARN history server provides more valuable information like YARN CPU (Vcores) and Memory usage.
Does something like that exist for IMPALA where I can fetch CPU and memory usage per query and as a whole on the Cloudera cluster.
Precisely I want to know how many Vcores are utilized out of its CPU allocation.
For example, an Impala Query takes 10s to execute a query, and lets say it used 4 vcores and 50MB of RAM, how do I find out that 4 vcores utilized.
Is there any direct way to query this from the cluster or any other method on how to compute the CPU utilization?
You can get a lot of information through the Cloudera Manager Charts. You can find an overview of all available metrics on their website or by clicking on the help symbol on the right side when creating a new chart.
There are quite a few categories for Impala that might be worth a read for you. For example the general Impala metrics and the Impala query metrics. The query metrics for example contain "memory_usage" measured in byte and the general metrics contain "impala_query_cm_cpu_milliseconds_rate" and "impala_query_memory_accrual_rate". These seem to be relevant for your usecase, but check them out and the linked sites to see which ones fit your usecase.
More information is available from the service page of the Impala service in your Cloudera Manager. You can find out more about this page here, but for example the linked page mentions:
The Impala Queries page displays information about Impala queries that are running and have run in your cluster. You can filter the queries by time period and by specifying simple filtering expressions.
It also allows you to display "Threads: CPU Time" and "Work CPU Time" for each query, which again could be relevant for you.
That is all the information available from Impala.

Presto Nodes with too much load

I'm performing some queries over a tpch 100gb dataset on presto, I have 4 nodes, 1 master, 3 workers. When I try to run some queries, not all of them, I see on Presto web interface that the nodes die during the execution, resulting in query failure, the error is the following:
.facebook.presto.operator.PageTransportTimeoutException: Encountered too many errors talking to a worker node. The node may have crashed or been under too much load. This is probably a transient issue, so please retry your query in a few minutes.
I rebooted all nodes and presto service but the error remains, this problem doesn't exist if I run the same queries over a smaller dataset.Can someone provide some help on this problem?
Thanks
3 possible causes for this kind of error. You may ssh into one of worker to find out what the problem is when the query is running.
High CPU
Tune down the task.concurrency to, for example, 8
High memory
In the jvm.config, -Xmx should no more than 80% total memory. In the config.properties, query.max-memory-per-node should be no more than the half of Xmx number.
Low open file limit
Set in the /etc/security/limits.conf a larger number for the Presto process. The default is definitely way too low.
It might be an issue for configuration. For example, if the local maximum memory is not set appropriately and the query use too much heap memory, full GC might happen to cause such errors. I would suggest to ask in the Presto Google Group and describe someway to reproduce the issue :)
I was running presto on Mac with 16GB of ram below is the configuration of java.config file.
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
I was getting following error even for running the Query
Select now();
Query 20200817_134204_00005_ud7tk failed: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes.
I changed my -Xmx16G value to -Xmx10G and it works fine.
I used following link to install presto on my system.
Link for Presto Installation

S3cmd sync returns "killed"

I am trying to sync a few large buckets on amazon S3.
When I run my S3cmd sync --recursive command I get a response saying "killed".
What does this refer to? Is there a limit on the number of files that can be synced in S3?
After reading around it looks like the program has memory consumption issues. In particular this can cause the OOM killer (out of memory killer) to take down the process and prevent the system from getting bogged down. A quick look at dmesg after the process is killed will generally show if this is the case or not.
With that in mind I would ensure you're on the latest release, which notes memory consumption issues being solved in the release notes.
Old question, but I would like to say that, before you try to add more physical memory or increase vm memory, try just adding more swap.
I did this with 4 servers (ubuntu and centos) with low ram (700MB total, only 15MB available) and it is working fine now.