How to get impala queries - impala

I need to monitor and statistic impala history queries.
Cloudera manager UI can get impala history queries.
Does impala has any restful api to get history queries?

On my cluster, $MYMACHINE:25000/queries has a list of queries.

Yep, it sure does: http://cloudera.github.io/cm_api/apidocs/v10/
A lot of the bugs of this API have gotten worked out over the last few releases.
I am guessing the specific get() you will want will come from here:
http://cloudera.github.io/cm_api/apidocs/v10/path__clusters_-clusterName-services-serviceName-_impalaQueries.html

CDH has its own REST METRICS API (the same used for Cloudera Charts):
e.g. Total Queries Across Impala Daemons: http://CDH_MGMT_HOST:7180/api/v6/timeseries?query=select+total_num_queries_rate_across_impalads+where+entityName%3D%22impala%22&contentType=application%2Fjson
Official CDH documentation: https://www.cloudera.com/documentation/enterprise/5-7-x/topics/cm_metrics_impala_daemon.html

Related

Power BI query is not visible in BigQuery query history

More of a curiosity question really. I load data into Power BI report from Google BigQuery (using native Google BigQuery connector in Power BI). All works fine, but for some reason I don't see this query in BigQuery's query history.
Did anyone experience something similar and knows the reason why this happens or how to change that (if at all possible)?
If I do exactly the same thing but using simba ODBC connector, I see this query in BigQuery's query history as expected.
Never seen that before. I am always able to find the query history no matter what 3rd party connection I used. Could you confirm the GCP service-account or auth-account and the GCP project for BQ job query that you used for your native Google BigQuery connector in Power BI?
Please make sure you have the access to the query history of that GCP account in that BQ job project.

Dump materialize aggregation from BigQuery to SQL server, Dataflow vs Airflow

I use a BigQuery dataset as data lake to store all records/events level data, and a SQL server to store aggregated reports that are updated regularly. Because the reports will be accessed frequently by clients via web interface, and each report aggregates large amount of data, so storing it BigQuery is a no go.
What is the best practise for doing this? Internally we have 2 ideas running around:
Run a Dataflow batched job every X hr to recalculate the aggregation and update the SQL server. It will need a scheduler to trigger the job, and the same job can be used to backfill all data.
Run an Airflow job that does the same thing. A separate job will be needed for backfill (but can still share most of the code with the regular job)
I know Dataflow does well in terms of processing chunks of data in parallel, but I wonder about Airflow's performance, as well as the risk of exhausting connection limit
Please check this answer from a previous similar question
In conclusion: Using Airflow will result in a more efficient way to manage all the process from the workflow. A solution that Google offers based on Airflow is Cloud Composer.

Accessing BigQuery from Presto

I'm curious if there is a way to connect BigQuery to one's Presto catalog. I don't see anything from either project that references the other and nothing in Presto's roadmap to suggest that a BigQuery connector would be forthcoming. BigQuery has ODBC/JDBC drivers which is promising, and Presto has Postgres and MySQL connectors but I'm not seeing a way to connect one to the other. Any ideas? Thanks!

Data migration from teradata to bigquery

My requirement is to migrate data from teradata database to Google bigquery database where table structure and schema remains unchanged. Later, using the bigquery database, I want to generate reports.
Can anyone suggest how I can achieve this?
I think you should try TDCH to export data to Google Cloud Storage in Avro format. TDCH runs on top of hadoop and exports data in parallel. You can then import data from avro files into BigQuery.
I was part of a team that addressed this issue in a Whitepaper.
The white paper documents the process of migrating data from Teradata Database to Google BigQuery. It highlights several key areas to consider when planning a migration of this nature, including the rationale for Apache NiFi as the preferred data flow technology, pre-migration considerations, details of the migration phase, and post-migration best practices.
Link: How To Migrate From Teradata To Google BigQuery
I think you can also try to use cloud composer(apache airflow) or install apache airflow in instance.
If you can open the ports from Teradata DB then you can run 'gsutil' command from there and schedule it via airflow/composer to run the jobs on daily basis. Its quick and you can leverage the scheduling capabilities of airflow.
BigQuery introduced Migration Service which is a comprehensive solution for migrating the data warehouse to BigQuery. It includes free to use tools that help with each phase of migration including assessment and planning to execution and verification.
Reference:
https://cloud.google.com/bigquery/docs/migration-intro

Impala OR hive with SPARK as execution engine?

I want to design Web UI which fetches data from HDFS. I want to generate some reports using this data which is stored in HDFS. I have my own custom reports format. I am writing REST API's to fetch data. But running HIVE queries gives latency issues Hence I want different approach for this, I could think of two.
Using IMPALA to create tables. But I am not sure about REST support for IMPALA.
Using HIVE but instead of MR use SPARK as execution engine. .
spark-job-server provides REST support, and fetch data with SPARK-SQL.
Which of the approach will be suitable or is there any better approach for this?
Please can anyone help as I am very new in this.
I'd prefer to choose impala if latency is the main consideration. It's dedicated to SQL processing on hdfs and does it well. About REST api and the application logic you are achieving, this seems to be a good example