Accessing BigQuery from Presto - google-bigquery

I'm curious if there is a way to connect BigQuery to one's Presto catalog. I don't see anything from either project that references the other and nothing in Presto's roadmap to suggest that a BigQuery connector would be forthcoming. BigQuery has ODBC/JDBC drivers which is promising, and Presto has Postgres and MySQL connectors but I'm not seeing a way to connect one to the other. Any ideas? Thanks!

Related

Audit hive table

I have a hive table lets say it as table A. My requirement is to capture all the DML and DDL operations on table A in table B. Is there any way to capture the same?
Thanks in advance..
I have not come across any such tool however Cloudera Navigator helps to manage it. Refer the detailed documentation.
Cloudera Navigator
Cloudera Navigator auditing supports tracking access to:
HDFS entities accessed by HDFS, Hive, HBase, Impala, and Solr
services
HBase and Impala
Hive metadata
Sentry
Solr
Cloudera Navigator Metadata Server
Alternatively, if you are not using cloudera distribution, you can still access hive-metastore log file under /var/log/hive/hadoop-cmf-hive-HIVEMETASTORE.log.out and check the changes applied to the different table.
I haven't used Apache atlas yet, but from the documentation, it looks like they have Audit store and hive bridge. That works for operational events as well.
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/atlas-overview/content/apache_atlas_features.html

Data migration from teradata to bigquery

My requirement is to migrate data from teradata database to Google bigquery database where table structure and schema remains unchanged. Later, using the bigquery database, I want to generate reports.
Can anyone suggest how I can achieve this?
I think you should try TDCH to export data to Google Cloud Storage in Avro format. TDCH runs on top of hadoop and exports data in parallel. You can then import data from avro files into BigQuery.
I was part of a team that addressed this issue in a Whitepaper.
The white paper documents the process of migrating data from Teradata Database to Google BigQuery. It highlights several key areas to consider when planning a migration of this nature, including the rationale for Apache NiFi as the preferred data flow technology, pre-migration considerations, details of the migration phase, and post-migration best practices.
Link: How To Migrate From Teradata To Google BigQuery
I think you can also try to use cloud composer(apache airflow) or install apache airflow in instance.
If you can open the ports from Teradata DB then you can run 'gsutil' command from there and schedule it via airflow/composer to run the jobs on daily basis. Its quick and you can leverage the scheduling capabilities of airflow.
BigQuery introduced Migration Service which is a comprehensive solution for migrating the data warehouse to BigQuery. It includes free to use tools that help with each phase of migration including assessment and planning to execution and verification.
Reference:
https://cloud.google.com/bigquery/docs/migration-intro

Presto and Hive

I'm trying to enable basic SQL querying of CSV files located in an s3 directory. Presto seemed like a natural fit (the files are 10s GB). As I went through the setup in Presto, I tried creating a table using the Hive connector. It was not clear to me if I only needed the hive metastore to save my table configurations in Presto, or if I have to create them in there first.
The documentation makes it seem that you can use Presto without having to CONFIGURE Hive, but using Hive syntax. Is that accurate? My experiences are that AWS S3 has not been able to connect.
Presto syntax is similar to Hive syntax. For most simple queries, the identical syntax would function in both. However, there are some key differences that make Presto and Hive not entirely the same thing. For example, in Hive, you might use LATERAL VIEW EXPLODE, whereas in Presto you'd use CROSS JOIN UNNEST. There are many such examples of nuanced syntactical differences between the two.
It is not possible to use vanilla Presto to analyze data on S3 without Hive. Presto provides only distributed execution engine. However, it lacks metadata information about tables. Thus, Presto Coordinator needs Hive to retrieve table metadata to parse and execute a query.
However, you can use AWS Athena, which is managed Presto, to run queries on top of S3.
Another option, in recent 0.198 release Presto adds a capability to connect AWS Glue and retrieve table metadata on top of files in S3.
I know it's been a while, but if this question is still outstanding, have you considered using Spark? Spark connects easily with out-of-the-box methods and can query/process data living in S3/CSV formats.
Also, I'm curious: what solution did you end up implementing to resolve your issue?

How to get impala queries

I need to monitor and statistic impala history queries.
Cloudera manager UI can get impala history queries.
Does impala has any restful api to get history queries?
On my cluster, $MYMACHINE:25000/queries has a list of queries.
Yep, it sure does: http://cloudera.github.io/cm_api/apidocs/v10/
A lot of the bugs of this API have gotten worked out over the last few releases.
I am guessing the specific get() you will want will come from here:
http://cloudera.github.io/cm_api/apidocs/v10/path__clusters_-clusterName-services-serviceName-_impalaQueries.html
CDH has its own REST METRICS API (the same used for Cloudera Charts):
e.g. Total Queries Across Impala Daemons: http://CDH_MGMT_HOST:7180/api/v6/timeseries?query=select+total_num_queries_rate_across_impalads+where+entityName%3D%22impala%22&contentType=application%2Fjson
Official CDH documentation: https://www.cloudera.com/documentation/enterprise/5-7-x/topics/cm_metrics_impala_daemon.html

How to set up Hive metastore off Redshift

I couldn't find a way to set up a metastore off Redshift for Hive. I am wondering if there is anyone who has tried this. Also since Redshift supports PostgreSQL, maybe it is possible. Please share if you have any experience.
I am new to Hive and am using CDH5.4.
Redshift as DSS isn't suitable to store Hive meta store by definition. Use RDS service for that purpose.