Situation :
I am aiming to retrieve location info for a list of hive external tables. For one table the easiest way is just using show create table table_name but I have quite a number of table, so I am finding alternative to achieve this. I managed to find there is the sys db in hive.
It seems the db is storing meta info of all tables, and I found the table sds that is storing these location info.
However when I query this sds table with the simplest where query select * from sds where sd_id = a_sd_id searching for info of only one table. It takes more than 50 seconds to return the result.
On the other hand, what is weird is that if I try to retrieve the same info using show create table the_table_name command, all table info include the location info is returned in 0.05 second .
So now my questions is when I trigger show create table, where did hive retrieve these info? Is it the same source when I query from the sys.sds table? If the two are the same source then the huge time gap between the ways cannot be explained.
Could anyone help cast some light on why the situation turns out like this and how can I retrieve the location info as I expected, i.e. retrieving from mysql metastore which can return as fast as the show create table command? I suppose the show create table should be accessing the mysql. But if the sys db is a mapping of the mysql db, why the query on these tables return 100 times slower than the show create?
Related
we have a table in Azure Data Warehouse with 17 billion records. Now we have a scenario where we have to delete records from this table based on some where condition. We are writing Spark in Scala language in Azure Databricks notebooks.
We searched for different options to do this in Spark, but all suggested to first read the entire table, delete records from this and then overwrite the entire table in Data Warehosue. However this approach will not work in our case due to huge number of records in our table.
Can you please suggest how we can achieve this functionality using spark/scala?
1) checked if we can call stored procedure through spark/scala code in azure databricks but Spark do not support stored procedures.
2) Tried reading the entire table first to delete the records but it goes into never ending loop.
Is possible to create view with select clause as per your requirement, then using of the view
I have created a table using Drill and it is located at
/user/abc/drill/Drilltable.
Now I would like to load the data from DrillTable into HiveTable which is located at path
/user/hive/warehouse/userxyz.db
I am using below statement to load data
INSERT INTO TABLE HiveTable select * from DrillTable;
I get the error
Table not found
and I am bit confused how to let Hive know the path of Drill table.
What would be the right way to handle this?
Hive might be confused about the schema of the drill data as well as the location. If you're willing to experiment, try something like this:
Store the data in a Drill format you can model in Hive, CSV for example, as described in this post.
In Hive, create an external table that defines the schema and location of the textual data. You can then convert the external table to a managed table (optional). For example ....
Can anyone please suggest how to create partition table in Big Query ?.
Example: Suppose I have one log data in google storage for the year of 2016. I stored all data in one bucket partitioned by year , month and date wise. Here I want create table with partitioned by date.
Thanks in Advance
Documentation for partitioned tables is here:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables
In this case, you'd create a partitioned table and populate the partitions with the data. You can run a query job that reads from GCS (and filters data for the specific date) and writes to the corresponding partition of a table. For example, to load data for May 1st, 2016 -- you'd specify the destination_table as table$20160501.
Currently, you'll have to run several query jobs to achieve this process. Please note that you'll be charged for each query job based on bytes processed.
Please see this post for some more details:
Migrating from non-partitioned to Partitioned tables
There are two options:
Option 1
You can load each daily file into separate respective table with name as YourLogs_YYYYMMDD
See details on how to Load Data from Cloud Storage
After tables created, you can access them either using Table wildcard functions (Legacy SQL) or using Wildcard Table (Standar SQL). See also Querying Multiple Tables Using a Wildcard Table for more examples
Option 2
You can create Date-Partitioned Table (just one table - YourLogs) - but you still will need to load each daily file into respective partition - see Creating and Updating Date-Partitioned Tables
After table is loaded you can easily Query Date-Partitioned Tables
Having partitions for an External Table is not allowed as for now. There is a Feature Request for it:
https://issuetracker.google.com/issues/62993684
(please vote for it if you're interested in it!)
Google says that they are considering it.
If every query creates a temporary table, is it possible to find out the name of that temporary table in a subsequent SQL statement?
Per https://cloud.google.com/bigquery/querying-data :
All query results are saved to a table, which can be either persistent
or temporary:
A temporary table is a randomly named table saved in a special
dataset; The table has a lifetime of approximately 24 hours. Temporary
tables are not available for sharing, and are not visible using any of
the standard list or other table manipulation methods.
For example:
/* query 1 */
select * from whatever;
/* query 2 */
select * from [temporary_table_that_just_got_created];
You cannot determine that name from within BigQuery Web UI using BigQuery SQL
But, Yes, you can find it by looking in Query History
You should find you query in history list and expand it by clicking on it
Name you are looking for is under "Destination Table"
It is also possible to do this with BigQuery API and very useful in client applications
You should use Jobs: get API and see configuration.query.destinationTable (https://cloud.google.com/bigquery/docs/reference/v2/jobs#resource)
So I am stuck on this Teradata problem and I am looking to the community for advice as I am new to the TD platform. I am currently working with a Teradata Data Warehouse and have an interesting task to solve. Currently we store our information in a live production database but want to stage tables in another database before using FastExport to export the files. Basically we want to move our tables into a database to take a quick snapshot.
I have been exploring different solutions and am unsure how to proceed. I need to be able to automate a create table process from one DB in Teradata to another. The tricky part is I would like to create many tables off of the source table using a WHERE clause. For example, I have a transaction table and want to take a snapshot of the transaction table for a certain date range month by month. Meaning that the original table Transaction would be split into many tables such as Transaction_May2001, Transaction_June2001, Transaction_July2001 and so on and so forth.
Thanks
This is assuming by two databases you are referring to the same physical installation of Teradata.
You can use the CREATE TABLE AS construct to accomplish this:
CREATE TABLE {MyDB}.Transaction_May2001
AS (
SELECT *
FROM Transaction
WHERE Transaction_Date BETWEEN DATE '2001-05-01' AND '2001-05-31'
)
{UNIQUE} PRIMARY INDEX ({Same PI definition as Transaction Table})
WITH DATA AND STATS;
If you neglect to specify the explicit PI in the CREATE TABLE AS then Teradata will take the first column of the SELECT clause and use it as the PI of the new table.
Otherwise, you would be looking to use a Teradata utility as suggested by ryanbwork in the comment to your question.