apache hive column comment with CTAS - apache

Sorry for all the setup. This is a hive datatype and comment question.
I have a single file in HDFS which combines 4 sets of table data. Breaking the data out ahead of time is not my preferred option. The first 4 rows specify the column headers:
*1 col1, col2, col3
*2 cola, colb, colc, cold, col5e
etc....
data rows begin with matching number at position 1 of the header.
1 data, data, data,
2 data, data, data, data, data,
etc...
The base hive table is just col0 - col60 for the raw file. I've tried creating a CTAS table to hold all of the "1" columns and one for the "2" columns where I can specify data type, and comments. Since the column names vary, I cannot give the columns names on the base table nor can I comment them with column based metadata.
This DDL didn't work but giving an example of what I'm hoping to do. Any thoughts ?
CREATE TABLE foo (
col1 as meaningful_name string comment 'meaningful comment')
as
SELECT col1
FROM base_hive table
WHERE col1 = 1;
CREATE TABLE foo
as
SELECT col1 string comment 'meaningful comment'
FROM base_hive table
WHERE col1 = 1;
thanks TD

I dont understand much what you are trying to achieve here, but looking at your DDL, I can see some errors. For the correct CREATE TABLE AS SELECT implementation, pl use the below DDL:
CREATE TABLE foo (
col1 STRING COMMENT 'meaningful comment')
AS
SELECT col1 AS meaningful_name
FROM base_hive table
WHERE col1 = 1;

Related

How do i append query results to create new columns in dataset on bigquery?

Would like to alter my original dataset to include the results from the query. Currently year and month are connected, would like to split the string and append the results to the original dataframe
I see that you are sourcing the value for your new column from your own source table.
You can do it two ways:
Idea1:
Let's say you have a table with columns:
col1, col2
and you want to add col3, you can always do something like:
CREATE OR REPLACE table your_source_table as
select col1, col2, (your_calculation_for_col3) as col3 from your_source_table
Idea 2:
Add a new column to your table and update value of it like below:
ALTER TABLE your_source_table
ADD COLUMN COL3 DATA_TYPE_FOR_COL3;
UPDATE your_source_table
SET col3 = your_new_calculated_value
WHERE TRUE;
see if any of this helps.

Changing order of columns for a table

I have a bigquery table with schema as :
CREATE TABLE `abc`
(
col2 STRING,
col1 DATE,
col3 STRING,
);
and after creating and loading months worth of data in it, I realised I want the DDL to look like,
CREATE TABLE `abc`
(
col1 DATE,
col2 STRING,
col3 STRING,
);
I want this change because the upstream ETL code expects it in this way.
Is there a way to achieve this?
PS: drop and create the table isn't an option as it has important data.
Thanks :)
You won't miss any data. Try this.
create or replace table <SCHEMA.NEW_TABLE_NAME> as
select col1,col2,col3 from <SCHEMA.OLD_TABLE_NAME>;
When you make the select you can pass the order you want for your columns.
Instead of selecting it like SELECT * FROM ...
just do it as SELECT col1, col2, col3 FROM ...

SQL UPDATE value based on row and column location without ID or key

In SQL (I'm using postgres, but am open to other variations), is it possible to update a value based on a row location and a column name when the table doesn't have unique rows or keys? ...without adding a column that contains unique values?
For example, consider the table:
col1
col2
col3
1
1
1
1
1
1
1
1
1
I would like to update the table based on the row number or numbers. For example, change the values of rows 1 and 3, col2 to 5 like so:
col1
col2
col3
1
5
1
1
1
1
1
5
1
I can start with the example table:
CREATE TABLE test_table (col1 int, col2 int, col3 int);
INSERT INTO test_table (col1, col2, col3) values(1,1,1);
INSERT INTO test_table (col1, col2, col3) values(1,1,1);
INSERT INTO test_table (col1, col2, col3) values(1,1,1);
Now, I could add an additional column, say "id" and simply:
UPDATE test_table SET col2 = 5 WHERE id = 1
UPDATE test_table SET col2 = 5 WHERE id = 3
But can this be done just based on row number?
I can select based on row number using something like:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER() FROM test_table
) as sub
WHERE row_number BETWEEN 1 AND 2
But this doesn't seem to play well with the update function (at least in postgres). Likewise, I have tried using some subsets or common table expressions, but again, I'm running into difficulties with the UPDATE aspect. How can I perform something that accomplishes something like this pseudo code?: UPDATE <my table> SET <col name> = <new value> WHERE row_number = 1 or 3, or... This is trivial other languages like R or python (e.g., using pandas's .iloc function). It would be interesting to know how to do this in SQL.
Edit: in my table example, I should have specified the column types to something like int.
This is one of the many instances where you should embrace the lesser evil that is Surrogate Keys. Whichever table has a primary key of (col1,col2,col3) should have an additional key created by the system, such as an identity or GUID.
You don't specify the data type of (col1,col2,col3), but if for some reason you're allergic to surrogate keys you can embrace the slightly greater evil of a "combined key", where instead of a database-created value your unique key field is derived from some other fields. (In this instance, it'd be something like CONCAT(col1, '-', col2, '-', col3) ).
Should neither of the above be practical, you will be left with the greatest evil of having to manually specify all three columns each time you query a record. Which means that any other object or table which references this one will need to have not one but three distinct fields to identify which record you're talking about.
Ideally, btw, you would have some business key in the actual data which you can guarantee by design will be unique, never-changing, and never-blank. (Or at least changing so infrequently that the db can handle cascade updates reasonably well.)
You may wind up using a surrogate key for performance in such a case anyway, but that's an implementation detail rather than a data modeling requirement.

archive one table date in another table with archive date in Oracle

i have one table test it has 10 column with 20 rows.
I need to move this data to archive_test table which has 11 column (10 same as test table plus one column is archive date).
when i tried to insert like below its shows error because number of column mismatch.
insert into archive_test
select * from test;
Please suggest the better way to do this.Thanks!
Well, obviously you need to supply values for all the columns, and although you can avoid doing so you should also explicitly state whic value is going to be inserted into which column. If you have an extra column in the target table you either:
Do not mention it
Specify a default value as part of its column definition in the table
Have a trigger to populate it
Specify a value for that column.
eg.
insert into table archive_test (col1, col2, col3 ... col11)
select col1,
col2,
col3,
...
sysdate
from test;
assuming that archive_date is the last column:
INSERT INTO archive_test
SELECT test.*, sysdate
FROM test

Can I set a formula for a particular column in SQL?

I want to implement something like Col3 = Col2 + Col1 in SQL.
This is somewhat similar to Excel, where every value in column 3 is sum of corresponding values from column 2 and column 1.
Have a look at Computed Columns
A computed column is computed from an
expression that can use other columns
in the same table. The expression can
be a noncomputed column name,
constant, function, and any
combination of these connected by one
or more operators.
Also from CREATE TABLE point J
Something like
CREATE TABLE dbo.mytable
( low int, high int, myavg AS (low + high)/2 ) ;
Yes, you can do it in SQL using the UPDATE command:
UPDATE TABLE table_name
SET col3=col1+col2
WHERE <SOME CONDITION>
This assumes that you already have a table with populated col1 and col2 and you want to populate col3.
Yes. Provided it is not aggregating data across rows.
assume that col1 and col2 are integers.
SELECT col1, col2, (col1 + col2) as col3 FROM mytable