Apache Drill and Apache Kudu - not able to run "select * from <some table>" using Apache Drill, for the table created in Kudu through Apache Impala - impala

I'm able to connect to Kudu through Apache Drill, and able to list tables fine. But when I have to fetch data from the table "impala::default.customer" below, I tried different options but none is working for me.
The table in Kudu was created through Impala-Shell as external table.
Initial connection to Kudu and listing objects
ubuntu#ubuntu-VirtualBox:~/Downloads/apache-drill-1.19.0/bin$ sudo ./drill-embedded
Apache Drill 1.19.0
"A Drill is a terrible thing to waste."
apache drill> SHOW DATABASES;
+--------------------+
| SCHEMA_NAME |
+--------------------+
| cp.default |
| dfs.default |
| dfs.root |
| dfs.tmp |
| information_schema |
| kudu |
| sys |
+--------------------+
7 rows selected (24.818 seconds)
apache drill> use kudu;
+------+----------------------------------+
| ok | summary |
+------+----------------------------------+
| true | Default schema changed to [kudu] |
+------+----------------------------------+
1 row selected (0.357 seconds)
apache drill (kudu)> SHOW TABLES;
+--------------+--------------------------------+
| TABLE_SCHEMA | TABLE_NAME |
+--------------+--------------------------------+
| kudu | impala::default.customer |
| kudu | impala::default.my_first_table |
+--------------+--------------------------------+
2 rows selected (9.045 seconds)
apache drill (kudu)> show tables;
+--------------+--------------------------------+
| TABLE_SCHEMA | TABLE_NAME |
+--------------+--------------------------------+
| kudu | impala::default.customer |
| kudu | impala::default.my_first_table |
+--------------+--------------------------------+
Now when trying to run "select * from impala::default.customer ", not able to run it at all.
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `impala::default`.customer;
Error: VALIDATION ERROR: Schema [[impala::default]] is not valid with respect to either root schema or current default schema.
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `default`.customer;
Error: VALIDATION ERROR: Schema [[default]] is not valid with respect to either root schema or current default schema.
Current default schema: kudu
[Error Id: 8a4ca4da-2488-4775-b2f3-443b8b4b17ef ] (state=,code=0)
Current default schema: kudu
[Error Id: ce96ea13-392f-4910-9f6c-789a6052b5c1 ] (state=,code=0)
apache drill (kudu)>
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `impala`::`default`.customer;
Error: PARSE ERROR: Encountered ":" at line 1, column 23.
SQL Query: SELECT * FROM `impala`::`default`.customer
^
[Error Id: 5aacdd98-db6e-4308-9b33-90118efa3625 ] (state=,code=0)
>>>>>>>>>
apache drill (kudu)> SELECT * FROM `impala::`.`default`.customer;
Error: VALIDATION ERROR: Schema [[impala::, default]] is not valid with respect to either root schema or current default schema.
Current default schema: kudu
[Error Id: 5450bd90-dfcd-4efe-a8d3-b517be85b10a ] (state=,code=0)
>>>>>>>>>>>

In Drill conventions, the first part of the FROM clause is the storage plugin, in this case kudu. When you ran the SHOW TABLES query, you saw that the table name is actually impala::default.my_first_table. If I'm reading that correctly, that whole bit is the table name and the query below is how you should escape it.
Note the back tick before impala and after first_table but nowhere else.
SELECT *
FROM kudu.`impala::default.my_first_table`
Does that work for you?

Related

Insufficient Resources error while inserting into SQL table using Vertica

I'm running a Python script to load data from a DataFrame into a SQL Table. However, the insert command is throwing this error:
(pyodbc.Error) ('HY000', '[HY000] ERROR 3587: Insufficient resources to execute plan on pool fastlane [Request exceeds session memory cap: 28357027KB > 20971520KB]\n (3587) (SQLExecDirectW)')
This is my code:
df.to_sql('TableName',engine,schema='trw',if_exists='append',index=False) #copying data from Dataframe df to a SQL Table
Can you do the following for me:
run this command - and share the output. MAXMEMORYSIZE, MEMORYSIZE and MAXQUERYMEMORYSIZE, plus PLANNEDCONCURRENCY give you an idea of the (memory) budget at the time when the query / copy command was planned.
gessnerm#gessnerm-HP-ZBook-15-G3:~/1/fam/fam-ostschweiz$ vsql -x -c \
"select * from resource_pools where name='fastlane'"
-[ RECORD 1 ]------------+------------------
pool_id | 45035996273841188
name | fastlane
is_internal | f
memorysize | 0%
maxmemorysize |
maxquerymemorysize |
executionparallelism | 16
priority | 0
runtimepriority | MEDIUM
runtimeprioritythreshold | 2
queuetimeout | 00:05
plannedconcurrency | 2
maxconcurrency |
runtimecap |
singleinitiator | f
cpuaffinityset |
cpuaffinitymode | ANY
cascadeto |
Then, you should dig, out of the QUERY_REQUESTS system table, the acutal SQL command that your python script triggered. It should be in the format of:
COPY <_the_target_table_>
FROM STDIN DELIMITER ',' ENCLOSED BY '"'
DIRECT REJECTED DATA '<_bad_file_name_>'
or similar.
Then: how big is the file / are the files you're trying to load in one go? If too big, then B.Muthamizhselvi is right - you'll need to portion the data volume you load.
Can you also run:
vsql -c "SELECT EXPORT_OBJECTS('','<schema>.<table>',FALSE)"
.. .and share the output? It could well be that you have too many projections for the memory to be enough, that you are sorting by too many columns.
Hope this helps for starters ...

How to fix: cannot retrieve all fields MongoDB collection with Apache Drill SQL expression query

I'm trying to retrieve all(*) columns from a MongoDB object with Apache Drill expression SQL:
`_id`.`$oid`
Background: I'm using Apache Drill to query MongoDB collections. By default, Drill retrieves the ObjectId values in a different format than the stored in the database. For example:
Mongo: ObjectId(“59f2c3eba83a576fe07c735c”)
Drill query result: [B#3149a…]
In order to get the data in String format (59f2c3eba83a576fe07c735c) from the object, I changed the Drill config "store.mongo.bson.record.reader" to "false".
ALTER SESSION SET store.mongo.bson.record.reader = false
Drill query result after config set to false:
select * from calc;
+--------------------------------------+---------+
| _id | name |
+--------------------------------------+---------+
| {"$oid":"5cb0e161f0849231dfe16d99"} | thiago |
+--------------------------------------+---------+
Running a query by _id:
select `o`.`_id`.`$oid` , `o`.`name` from mongo.od_teste.calc o where `o`.`_id`.`$oid`='5cb0e161f0849231dfe16d99';
Result:
+---------------------------+---------+
| EXPR$0 | name |
+---------------------------+---------+
| 5cb0e161f0849231dfe16d99 | thiago |
+---------------------------+---------+
For an object with a few columns like the one above (_id, name) it's ok to specify all the columns in the select query by id. However, in my production database, the objects have a "hundred" of columns.
If I try to query all (*) columns from the collection, this is the result:
select `o`.* from mongo.od_teste.calc o where `o`.`_id`.`$oid`='5cb0e161f0849231dfe16d99';
or
select * from mongo.od_teste.calc o where `o`.`_id`.`$oid`='5cb0e161f0849231dfe16d99';
+-----+
| ** |
+-----+
+-----+
No rows selected (6.112 seconds)
Expected result: Retrieve all columns from a MongoDB collection instead of declaring all of them on the SQL query.
I have no suggestions here, because it is a bug in Mongo Storage Plugin.
I have created Jira ticket for it, please take a look and feel free to add any related info there: DRILL-7176

Can I display column headings when querying via gcloud dataproc jobs submit spark-sql?

I'm issuing a spark-sql job to dataproc that simply displays some data from a table:
gcloud dataproc jobs submit spark-sql --cluster mycluster --region europe-west1 -e "select * from mydb.mytable limit 10"
When the data is returned and outputted to stdout I don't see column headings, I only see the raw data, whitespace delimited. I'd really like the output to be formatted better, specifically I'd like to see column headings. I tried this:
gcloud dataproc jobs submit spark-sql --cluster mycluster --region europe-west1 -e "SET hive.cli.print.header=true;select * from mydb.mytable limit 10"
But it had no affect.
Is there a way to get spark-sql to display column headings on dataproc?
If there is a way to get the data displayed like so:
+----+-------+
| ID | Name |
+----+-------+
| 1 | Jim |
| 2 | Ann |
| 3 | Simon |
+----+-------+
then that would be even better.
I have been performing some tests with a Dataproc cluster, and it looks like it is not possible to retrieve query results with the column names using Spark SQL. However, this is more a Apache Spark SQL issue, rather than Dataproc, so I have added that tag to your question too, in order for it to receive a better attention.
If you get into the Spark SQL console in your Dataproc cluster (by SSHing in the master and typing spark-sql), you will see that the result for SELECT queries does not include the table name:
SELECT * FROM mytable;
18/04/17 10:31:51 INFO org.apache.hadoop.mapred.FileInputFormat: Total input files to process : 3
2 Ann
1 Jim
3 Simon
There's no change if using instead SELECT ID FROM mytable;. Therefore, the issue is not in the gcloud dataproc jobs sbmit spark-sql command, but instead in the fact that Spark SQL does not provide this type of data.
If you do not necessarily have to use Spark SQL, you can try using HIVE instead. HIVE does provide the type of information you want (including the column names plus a prettier formatting):
user#project:~$ gcloud dataproc jobs submit hive --cluster <CLUSTER_NAME> -e "SELECT * FROM mytable;"
Job [JOB_ID] submitted.
Waiting for job output...
+-------------+---------------+--+
| mytable.id | mytable.name |
+-------------+---------------+--+
| 2 | Ann |
| 1 | Jim |
| 3 | Simon |
+-------------+---------------+--+
Job [JOB_ID] finished successfully.

Issue when issuing a SELECT query to a Hive table in Prestodb

I am able to connect to my Hive metastore, and doing a DESCRIBE:
DESCRIBE sample_07;
Query 20131113_025614_00005_af2fx, RUNNING, 1 node, 2 splits
Column | Type | Null | Partition Key
-------------+---------+------+---------------
code | varchar | true | false
description | varchar | true | false
total_emp | bigint | true | false
salary | bigint | true | false
(4 rows)
However, a SELECT does not work:
select * from sample_07;
2013-11-12T16:54:58.611-0800 DEBUG query-scheduler-7 com.facebook.presto.execution.QueryStateMachine Query 20131113_005458_00004_af2fx is PLANNING
Query 20131113_005458_00004_af2fx failed: java.io.IOException: Failed on local exception: com.facebook.presto.hadoop.shaded.com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: callId, status; Host Details : local host is: "sandbox.hortonworks.com/xx.xx.2.15"; destination host is: "sandbox.hortonworks.com":8020;
presto:default> 2013-11-12T16:56:04.771-0800 ERROR Stage-20131113_005458_00004_af2fx.1-219 com.facebook.presto.execution.SqlStageExecution Error while starting stage 20131113_005458_00004_af2fx.1 ~[guava-15.0.jar:na]
at com.facebook.presto.hive.HiveSplitIterable$HiveSplitQueue.computeNext(HiveSplitIterable.java:433) ~[na:na]
at com.facebook.presto.hive.HiveSplitIterable$HiveSplitQueue.computeNext(HiveSplitIterable.java:392) ~[na:na]
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-15.0.jar:na]
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-15.0.jar:na]
As you can tell, i am using Hortonworks' sandbox, so it may be that that's the issue? Or is it choking on the IP address ? Not completely sure i understand the problem.
cheers,
Matt
Your error message suggests that you are not running Presto against CDH4 but against Hortonworks Sandbox which I believe is Hadoop 2.2.0. There are known incompatibilities at this point. See this thread on the Presto Google Group for more information: https://groups.google.com/forum/#!topic/presto-users/lVLvMGP1sKE

Get list of MySQL databases, and server version?

My connection string for MySQL is:
"Server=localhost;User ID=root;Password=123;pooling=yes;charset=utf8;DataBase=.;"
My questions are :
What query should I write to get database names that exist?
What query should I write to get server version?
I have error because of my connection string ends with DataBase=.
What should I write instead of the dot?
SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA
SELECT VARIABLE_NAME, VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_VARIABLES WHERE VARIABLE_NAME = 'VERSION'
Use INFORMATION_SCHEMA as the database.
To get the list of databases, you can use SHOW DATABASES:
SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| test |
+--------------------+
3 rows in set (0.01 sec)
To get the version number of your MySQL Server, you can use SELECT VERSION():
SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 5.1.45 |
+-----------+
1 row in set (0.01 sec)
As for the question about the connection string, you'd want to put a database name instead of the dot, such as Database=test.
show Databases;
Will return you all the registered databases.
And
show variables;
will return a bunch of name value pairs, one of which is the version number.