Apache Impala - YARN like CPU utilization report for queries (on Cloudera) - hadoop-yarn

We have YARN and Impala co-located on the same cloudera cluster, YARN utilization report and YARN history server provides more valuable information like YARN CPU (Vcores) and Memory usage.
Does something like that exist for IMPALA where I can fetch CPU and memory usage per query and as a whole on the Cloudera cluster.
Precisely I want to know how many Vcores are utilized out of its CPU allocation.
For example, an Impala Query takes 10s to execute a query, and lets say it used 4 vcores and 50MB of RAM, how do I find out that 4 vcores utilized.
Is there any direct way to query this from the cluster or any other method on how to compute the CPU utilization?

You can get a lot of information through the Cloudera Manager Charts. You can find an overview of all available metrics on their website or by clicking on the help symbol on the right side when creating a new chart.
There are quite a few categories for Impala that might be worth a read for you. For example the general Impala metrics and the Impala query metrics. The query metrics for example contain "memory_usage" measured in byte and the general metrics contain "impala_query_cm_cpu_milliseconds_rate" and "impala_query_memory_accrual_rate". These seem to be relevant for your usecase, but check them out and the linked sites to see which ones fit your usecase.
More information is available from the service page of the Impala service in your Cloudera Manager. You can find out more about this page here, but for example the linked page mentions:
The Impala Queries page displays information about Impala queries that are running and have run in your cluster. You can filter the queries by time period and by specifying simple filtering expressions.
It also allows you to display "Threads: CPU Time" and "Work CPU Time" for each query, which again could be relevant for you.
That is all the information available from Impala.

Related

Suggestion for Non Analytical Distributed Processing Frameworks

Can someone please suggest a tool, framework or a service to perform the below task faster.
Input : The input to the service is a CSV file which consists of an identifier and several image columns with over a million rows.
Objective: To check if any of the image column of the row meets the minimum resolution and create a new boolean column for every row according to the results.
True - If any of the image in the row meets the min resolution
False - If no images in the row meets the min resolution
Current Implementation: Python script with pandas and multiprocessing running on a large VM(60 Core CPU) which takes about 4 - 5 Hours. Since this is a periodic task we schedule and manage it with Cloud Workflow and Celery Backend.
Note: We are looking to cut down on costs as uptime of server is just about 4-6Hrs a day. Hence 60 Core CPU 24*7 would be a lot of resources wasted.
Options Explored:
We have ruled out Cloud Run due to the memory, cpu and timeout limitations.
Apache Beam with Cloud Dataflow, seems like there is less support for non analytical workloads and Dataframe implementation with Apache Beam looks buggy still.
Spark and Dataproc seems to be good for analytical workloads. Although a Serverless option would be much preferred.
Which direction should i be looking into?

Matillion: How to identify performance bottleneck

We're running Matillion (v1.54) on an AWS EC2 instance (CentOS), based on Tomcat 8.5.
We have developped a few ETL jobs by now, and their execution takes quite a lot of time (that is, up to hours). We'd like to speed up the execution of our jobs, and I wonder how to identify the bottle neck.
What confuses me is that both the m5.2xlarge EC2 instance (8 vCPU, 32G RAM) and the database (Snowflake) don't get very busy and seem to be sort of idle most of the time (regarding CPU and RAM usage as shown by top).
Our environment is configured to use up to 16 parallel connections.
We also added JVM options -Xms20g -Xmx30g to /etc/sysconfig/tomcat8 to make sure the JVM gets enough RAM allocated.
Our Matillion jobs do transformations and loads into a lot of tables, most of which can (and should) be done in parallel. Still we see, that most of the tasks are processed in sequence.
How can we enhance this?
By default there is only one JDBC connection to Snowflake, so your transformation jobs might be getting forced serial for that reason.
You could try bumping up the number of concurrent connections under the Edit Environment dialog, like this:
There is more information here about concurrent connections.
If you do that, a couple of things to avoid are:
Transactions (begin, commit etc) will force transformation jobs to
run in serial again
If you have a parameterized transformation job,
only one instance of it can ever be running at a time. More information on that subject is here
Because the Matillion server is just generating SQL statements and running them in Snowflake, the Matillion server is not likely to be the bottleneck. You should make sure that your orchestration jobs are submitting everything to Snowflake at the same time and there are no dependencies (unless required) built into your flow.
These steps will be done in sequence:
These steps will be done in parallel (and will depend on Snowflake warehouse size to scale):
Also - try the Alter Warehouse Component with a higher concurrency level

Google Dataflow instance and BigQuery cost considerations

I am planning to spin up a dataflow instance on google cloud platform to run some experiments. I want to get familiar with, and experiment with using apache beam to pull data from BigQuery, run some ETL jobs (in python) and streaming jobs, and finally store the result in BigQuery.
However, I am also concerned with sending my company's GCP bill through the roof. What are the main cost considerations, or any methods to estimate what the cost will be, so I don't get an earful from my boss.
Any help would be greatly appreciated, thanks!
You can use calculator to get an estimate of price of the job.
One of the most important resource on the dataflow side is CPU per hour. To limit the cpu hours, you can set the maximum machines using option maxNumWorkers in your pipeline.
Here are more pipeline options that you can set while running your dataflow job https://cloud.google.com/dataflow/docs/guides/specifying-exec-params
For BQ, you can do a similar estimate using the calculator.

Why the vertex count is getting only in Development mode in DSE Graph

When I try to get vertex count in DSE Graph using
g.V().count()
in Production mode below error is displaying.
Could not find a suitable index to answer graph query and graph scans are disabled
But the same is working in Development mode. Am I doing something wrong.
I am trying quick start link but created my own schema
https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/QuickStartStudio.html
Schema:
schema.propertyKey("id").Int().single().create()
schema.propertyKey("name").Text().single().create()
schema.edgeLabel("created").multiple().create()
schema.vertexLabel("feed").properties("id").create()
schema.vertexLabel("user").properties("id", "name").create()
schema.vertexLabel("feed").index("byFeedId").materialized().by("id").add()
schema.vertexLabel("user").index("byUser").materialized().by("id").add()
TL;DR
Counting across your vertexes is an expensive query that is not expected to be done in production in an OLTP use case.
Why?
Vertices are stored in an adjacency list by vertex type in DSE Graph and to get a count of all the vertices you would have to do a full table scan for each of these which is impractical in a distributed transactional system (you'd have to hit an entire replica set which could be a lot of nodes).
It's a real query. What do I do?
If this is a real use case, it is probably an analytical use case in which case you should be leveraging the spark graph computer by running the query in analytics mode.
Note: the full table scan may still take time but it will be executed in massively distributed fashion via DSE Analytics.

Google BigQuery basic questions

These may be few basic questions.
When i load data into BQ tables, where exactly data stored? (If billing is already enabled). if it is data center, what would be data center capacity? Does our data co-exist with other users data?
When we fire queries, How our queries processed? What is the default compute engine used for this?
How can we increase query processing capacity?
Thanks
CP
BigQuery datacenter capacity is practically unlimited. If you plan to upload petabytes in a very short time frame you might need to contact support first just to make sure, but for normal big loads everything should be fine.
BigQuery doesn't use compute engine, but a series of very large clusters where all queries run. That's the secret to a low cost per query, without ongoing costs per hour like other alternatives.
BigQuery increases the number of CPUs involved in your query elastically as the query needs. You don't need to manage storage nor processing capacity.