I've tried to make a Hive view so we can start sqooping on the columns we need.
The statement used to create the view:
CREATE VIEW new_view (col1, col2, col3)
AS SELECT col1, col2, col3 FROM source_table;
However the view has no perceived location within our Hadoop cluster. What's odd if we run a SELECT * FROM new_view; we get the data.
But when we try to run an Oozie job to hook into the view we get a table not found error. The table isn't in file browser either.
Reason why you are getting this error is that your view doesn't have any HDFS location to it. Sqoop export looks for export-dir but since your view don't contain any location to it, you are getting this error.
A view is just a layer of abstraction on top of your Hive table (which redirects to HDFS location associated to it). Please find below view definition:
A VIEW statement lets you create a shorthand abbreviation for a more
complicated query.The base query can involve joins, expressions,
reordered columns, column aliases,and other SQL features that can make
a query hard to understand or maintain.It is purely a logical
construct (an alias for a query) with no physical data behind it.
You may have to use the source_table for sqoop-export.
Note that a view is a purely logical object with no associated storage.
When a query references a view, the view's definition is evaluated in order to produce a set of rows for further processing by the query.
You can consider reading :- https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView
So you might think of storing the view data in a table in order to access it from oozie.
Related
Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).
I have created a table using Drill and it is located at
/user/abc/drill/Drilltable.
Now I would like to load the data from DrillTable into HiveTable which is located at path
/user/hive/warehouse/userxyz.db
I am using below statement to load data
INSERT INTO TABLE HiveTable select * from DrillTable;
I get the error
Table not found
and I am bit confused how to let Hive know the path of Drill table.
What would be the right way to handle this?
Hive might be confused about the schema of the drill data as well as the location. If you're willing to experiment, try something like this:
Store the data in a Drill format you can model in Hive, CSV for example, as described in this post.
In Hive, create an external table that defines the schema and location of the textual data. You can then convert the external table to a managed table (optional). For example ....
I have a DWH with a few schemas. I always have to use tables and views from different schemas to combine them for new views. This is done in a REPORT schema that has all the synonyms to almost all the tables and views in the other schemas. All those tables and views also have privileges granted to REPORT.
Whenever I make a reference to a different schema then REPORT, Datagrip is not able to resolve that reference, stating that it is "Unable to resolve symbol.."
I am not sure, whether this is needed for the referencing or not, but I have database connections in the project to all schemas.
Let's say, I need col1 and col2 from a table srctable located in a schema DATA. I do all the code in the schema REPORT.
I have tried code like
SELECT
col1,
col2
FROM
srctable
And also
SELECT
col1,
col2
FROM
DATA.srctable
Since I get the data from the query, everything seems to be set up right. But I want to utilise the power of Datagrip and it is annoying I can not get the references to work.
You need to choose destination schemas in database tree for completion to work.
I have a lot of data in a Parquet based Hive table (Hive version 0.10). I have to add a few new columns to the table. I want the new columns to have data going forward. If the value is NULL for already loaded data, that is fine with me.
If I add the new columns and not update the old Parquet files, it gives an error and it looks strange as I am adding String columns only.
Error getting row data with exception java.lang.UnsupportedOperationException: Cannot inspect java.util.ArrayList
Can you please tell me how to add new fields to Parquet Hive without affecting the already existing data in the table ?
I use Hive version 0.10.
Thanks.
1)
Hive starting with version 0.13 has parquet schema evoultion built in.
https://issues.apache.org/jira/browse/HIVE-6456
https://github.com/Parquet/parquet-mr/pull/297
ps. Notice that out-of-the-box support for schema evolution might take a toll on performance. For example, Spark has a knob to turn parquet schema evolution on and off. After one of the recent Spark releases, it's now off by default because of performance hit (epscially when there are a lot of parquet files). Not sure if Hive 0.13+ has such a setting too.
2)
Also wanted to suggest to try creating views in Hive on top of such parquet tables where you expect often schema changes, and use views everywhere but not tables directly.
For example, if you have two tables - A and B with compatible schemas, but table B has two more columns, you could workaround this by
CREATE VIEW view_1 AS
SELECT col1,col2,col3,null as col4,null as col5 FROM tableA
UNION ALL
SELECT col1,col2,col3,col4,col5 FROM tableB
;
So you don't actually have to recreate any tables like #miljanm has suggested, you can just recreate the view. It'll help with the agility of your projects.
Create a new table with the two new columns. Insert data by issuing:
insert into new_table select old_table.col1, old_table.col2,...,null,null from old_table;
The last two nulls are for the two new columns. That's it.
If you have too many columns, it may be easier for you to write a program that reads the old files and writes the new ones.
Hive 0.10 does not have support for schema evolution in parquet as far as I know. Hive 0.13 does have it, so you may try to upgrade hive.
I have a view that is returning four columns of data to be pushed to an external program. When I simply query the view ("Select * from schema.view_name") I get 10353 rows. When I run the actual SQL that created the view (I literally copied and pasted what Oracle had stored, minus the "Create or Replace" statement), I get 238745 rows.
Any ideas why this might occur?
Best guess: when you run the query standalone you're not running it in the same schema the view was created in (I am inferring this from the fact that you included the schema name in your example SELECT). The schema where you're running the query either has its own table with the same name as one of the base tables in the view, or one of the names is a synonym pointing to yet another view that contains only a subset of the rows in the underlying table.