External Tables (HIVE) Choose only a few columns from a file - hive

How can I create a external table setting only a few columns from a file?
Ex: In archive I have six columns, A,B,C,D,E,F. But in my table i want only A, C, F.
Is It possible?

I do not know of a way to selectively include columns from HDFS files for an external table. Depending on your use case, it may be sufficient to define a view based on the external table to only include the columns you want. For example, given the following silly example of an external table:
hive> CREATE EXTERNAL TABLE ext_table (
> A STRING,
> B STRING,
> C STRING,
> D STRING,
> E STRING,
> F STRING
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> STORED AS TEXTFILE
> LOCATION '/tmp/ext_table';
OK
Time taken: 0.401 seconds
hive> SELECT * FROM ext_table;
OK
row_1_col_A row_1_col_B row_1_col_C row_1_col_D row_1_col_E row_1_col_F
row_2_col_A row_2_col_B row_2_col_C row_2_col_D row_2_col_E row_2_col_F
row_3_col_A row_3_col_B row_3_col_C row_3_col_D row_3_col_E row_3_col_F
Time taken: 0.222 seconds, Fetched: 3 row(s)
Then create a view to only include the columns you want:
hive> CREATE VIEW filtered_ext_table AS SELECT A, C, F FROM ext_table;
OK
Time taken: 0.749 seconds
hive> DESCRIBE filtered_ext_table;
OK
a string
c string
f string
Time taken: 0.266 seconds, Fetched: 3 row(s)
hive> SELECT * FROM filtered_ext_table;
OK
row_1_col_A row_1_col_C row_1_col_F
row_2_col_A row_2_col_C row_2_col_F
row_3_col_A row_3_col_C row_3_col_F
Time taken: 0.301 seconds, Fetched: 3 row(s)
Another way to achieve what you want would require that you have the ability to modify the HDFS files backing your external table - if the columns you are interested in are all near the beginning of each line, then you can define your external table to capture only the first 3 columns (without regard for how many more columns are actually in the file). For example, with the same data file as above:
hive> DROP TABLE IF EXISTS ext_table;
OK
Time taken: 1.438 seconds
hive> CREATE EXTERNAL TABLE ext_table (
> A STRING,
> B STRING,
> C STRING
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> STORED AS TEXTFILE
> LOCATION '/tmp/ext_table';
OK
Time taken: 0.734 seconds
hive> SELECT * FROM ext_table;
OK
row_1_col_A row_1_col_B row_1_col_C
row_2_col_A row_2_col_B row_2_col_C
row_3_col_A row_3_col_B row_3_col_C
Time taken: 0.727 seconds, Fetched: 3 row(s)

I found answer here
create table tmpdc_ticket(
SERVICE_ID CHAR(144),
SERVICE_TYPE CHAR(50),
CUSTOMER_NAME CHAR(200),
TELEPHONE_NO CHAR(144),
ACCOUNT_NUMBER CHAR(144),
FAULT_STATUS CHAR(50),
BUSINESS_GROUP CHAR(100)
)
organization external(
type oracle_loader
default directory sample_directory
access parameters(
records delimited by newline
nologfile
skip 1
fields terminated by '|'
missing field values are null
(DUMMY_1,
DUMMY_2,
SERVICE_ID CHAR(144),
SERVICE_TYPE CHAR(50),
CUSTOMER_NAME CHAR(200),
TELEPHONE_NO CHAR(144),
ACCOUNT_NUMBER CHAR(144),
FAULT_STATUS CHAR(50),
BUSINESS_GROUP CHAR(100)
)
)
location(sample_directory:'sample_file.txt')
)
reject limit 1
noparallel
nomonitoring;

Related

SQL Group by joining with time difference

I have this college project with a good focus on the frontend, but I'm struggling with a SQL query (PostgreSQL) that needs to be executed at one of the backend endpoints.
The table I'm speaking of is the following:
id
todo_id
column_id
time_in_status
0
259190
3
0
1
259190
10300
30
2
259190
10001
60
3
259190
10600
90
4
259190
6
30
A good way to simplify what it is, is saying it's a to-do organizer by vertical columns where each column would be represented by its column_id, and each row is task column change event.
With all that said what I need to get the job done is to generate a view (or another suggested better way) from this table that will show how long each task spent on each column_id. Also for a certain todo_id, column_id is not unique, so that could be multiple events on column 10300 and the table below would group by it and sum them
For example, the table above would output a view like this:
id
todo_id
time_in_column_3
time_in_column_10300
time_in_column_10001
...
0
259190
0
30
60
...
select *
from crosstab(
'select todo_id, id, time_in_status
from t'
)
as t(todo_id int, "time_in_column_3" int, "time_in_column_10300" int, "time_in_column_10001" int, "time_in_column_10600" int, "time_in_column_6" int )
todo_id
time_in_column_3
time_in_column_10300
time_in_column_10001
time_in_column_10600
time_in_column_6
259190
0
30
60
90
30
Fiddle

Encounter "Error Unable to alter table." when alter table column positons in Hive

I have a simple table fiction_movie:
hive>describe fiction_movie;
OK
title string
actors array<string>
genre string
rating int
4 rows selected (0.03 seconds)
content in table:
hive> select * from fiction_movie;
OK
avatar ["zoe saldana","Sam worthington"] science fiction 7
logan ["hugh jackman","Patrick stewart"] science fiction 8
2 rows selected (0.352 seconds)
What I want to do is to rearrange column positions and put title after genre:
#I tried
hive>alter table fiction_movie change column title title string after genre;
But it gave me the following error:
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions :
actors,genre (state=08S01,code=1)
Does anyone know why? Thank you!
This code is perfectly working in my machine.
hive> create table fiction_movie(
> title string,
> actors array<string>,
> genre string,
> rating int);
OK
Time taken: 0.155 seconds
hive> alter table fiction_movie change column title title string after genre;
OK
Time taken: 0.4 seconds
hive> describe fiction_movie;
OK
actors array<string>
genre string
title string
rating int
Time taken: 0.37 seconds, Fetched: 4 row(s)
Or try to set this property.
You can force hive to allow you to change it by using:
`SET hive.metastore.disallow.invalid.col.type.changes=true;`
than alter your table.

Finding the First & Last of Array struct

Having an array struct in file like below
[{"A":"1","B":"2","C":"3"},{"A":"4","B":"5","C":"6"},{"A":"7","B":"8","C":"9"}]
How can I get the first & last value of column "A" ("1","7")
Need to write in Hive SQL.
Thanks in advance.
first element of array is array_name[0], last is array_name[size(array_name)-1].
Demo:
select example_data[0].A, example_data[size(example_data)-1].A
from
( --Your example data
select array(named_struct("A","1","B","2","C","3"),named_struct("A","4","B","5","C","6"),named_struct("A","7","B","8","C","9")) as example_data
)s;
OK
1 7
Time taken: 2.72 seconds, Fetched: 1 row(s)

Loading null values to Hive table

I have a .txt file that has the following rows:
Steve,1 1 1 1 1 5 10 20 10 10 10 10
when i created an external table, loaded the data and select *, i got null values. Please help how to show the number values instead of null. I very much appreciate the help!
create external table Teller(Name string, Bill array<int>)
row format delimited
fields terminated by ','
collection items terminated by '\t'
stored as textfile
location '/user/training/hive/Teller';
load data local inpath'/home/training/hive/input/*.txt' overwrite into table Teller;
output:
Steve [null]
It seems the integers are separated by spaces and not tabs
bash
hdfs dfs -mkdir -p /user/training/hive/Teller
echo Steve,1 1 1 1 1 5 10 20 10 10 10 10 | hdfs dfs -put - /user/training/hive/Teller/data.txt
hive
hive> create external table Teller(Name string, Bill array<int>)
> row format delimited
> fields terminated by ','
> collection items terminated by ' '
> stored as textfile
> location '/user/training/hive/Teller';
OK
Time taken: 0.417 seconds
hive> select * from teller;
OK
Steve [1,1,1,1,1,5,10,20,10,10,10,10]

hive count and count distinct not correct

I have a table in Hive that has 20 columns and I want to count unique records and all records per hour.
Table looks like:
CREATE EXTERNAL TABLE test1(
log_date string,
advertiser_creatives_id string,
cookieID string,
)
STORED AS ORC
LOCATION "/day1orc"
tblproperties ("orc.compress"="ZLIB");
And my query like this:
SELECT Hour(log_date),
Count(DISTINCT cookieid) AS UNIQUE,
Count(1) AS impressions
FROM test1
GROUP BY Hour(log_date);
But the results are not correct. I have about 70 million entries and when I do a sum of impressions I only get like 8 million so I suspect the distinct takes too many columns in account.
So how can I fix this so that I get the correct amount of impressions?
** Extra information **
hive.vectorized.execution.enabled is undefined so it is not active.
The same query in TEXT format returns even less rows (about 2.7 million)
Result of COUNT(*): 70643229
Result of COUNT(cookieID): 70643229
Result of COUNT(DISTINCT cookieID): 1440195
Cheers
I have an example,may be useful for you.I think you "row format delimited fields terminated by" has some problems .
I have a text,seperate by "\t",like below:
id date value
1 01-01-2014 10
1 03-01-2014 05
1 07-01-2014 40
1 05-01-2014 20
2 05-01-2014 10
but I only create a table have 2 columns, like below:
use tmp ;
create table sw_test(id string,td string) row format delimited fields terminated by '\t' ;
LOAD DATA LOCAL INPATH '/home/hadoop/b.txt' INTO TABLE sw_test;
How do you think the result of "select td from sw_test ;"
NOT
td
01-01-2014 10
03-01-2014 05
07-01-2014 40
05-01-2014 20
05-01-2014 10
BUT
td
01-01-2014
03-01-2014
07-01-2014
05-01-2014
05-01-2014
So,I think you cookie contains some special column include your defined seperator.
I hope this can help you .
good luck!