Adding column headers to hive result set - amazon-s3

I am using a hive script on Amazon EMR to analyze some data.
And I am transferring the output to Amazon s3 bucket. Now the results of hive script do not contain column headers.
I have also tried using this:
set hive.cli.print.header=true;
But it does not help. Can you help me out?

Exactly what does your hive script look like?
Does the output from your hive script have the header data in it? Is it then being lost when you copy the output to your s3 bucket?
If you could provide some more details about exactly what you are doing that would be helpful.
Without knowing those details, here is something that you could try.
Create your hive script as follows:
USE dbase_name:
SET hive.cli.print.header=true;
SELECT some_columns FROM some_table WHERE some_condition;
Then run your script:
$ hive -f hive_script.hql > hive_output
Then copy your output to your s3 bucket
$ aws s3 cp ./hive_output s3://some_bucket_name/foo/hive_output

I guess that direct way is still impossible (HIve: writing column headers to local file?).
Some solution would be export result of DESCRIBE table_name to file:
$ hive -e 'DESCRIBE table_name' > file
And write some script that add column names into your data file. GL!

I ran into this problem today and was able to get what I needed by doing a UNION ALL between the original query and a new dummy query that creates the header row. I added a sort column on each section and set the header to 0 and the data to a 1 so I could sort by that field and ensure the header row came out on top.
create table new_table as
select
field1,
field2,
field3
from
(
select
0 as sort_col, --header row gets lowest number
'field1_name' as field1,
'field2_name' as field2,
'field3_name' as field3
from
some_small_table --table needs at least 1 row
limit 1 --only need 1 header row
union all
select
1 as sort_col, --original query goes here
field1,
field2,
field3
from
main_table
) a
order by
sort_col --make sure header row is first
It's a little bulky, but at least you can get what you need with a single query.
Hope this helps!

It might be just a typo (or a version-dependent change), but the following works for me:
set hive.cli.print.headers=true;
It's "headers" instead of "header"

Related

Using Update statement with the _PARTITIONDATE Pseudo-column

I'm trying to update a table in BigQuery that is partitioned on _PARTITIONTIME and really struggling.
Source is an extract from destination that I need to backfill destination with. Destination is a large partitioned table.
To move data from source to destination, I tried this:
update t1 AS destination
set destination._PARTITIONTIME = '2022-02-09'
from t2 as source
WHERE source.id <> "1";
Because it said that the WHERE clause was required for UPDATE, but when I run it, I get a message that "update/merge must match at most one source row for each target row". I've tried... so many other methods that I can't even remember them all. INSERT INTO seemed like a no-brainer early on but it wants me to specify column names and these tables have about 800 columns each so that's less than ideal.
I would have expected this most recent attempt to work because if I do
select * from source where source.id <> "1";
I do, in fact, get results exactly the way I would expect, so that query clearly functions, but for some reason it can't load the data. This is interesting, because I created the source table by running something along the lines of:
select * from destination where DATE(createddate) = '2022-02-09' and DATE(_PARTITIONTIME) = '2022-02-10'
Is there a way to make Insert Into work for me in this instance? If there is not, does someone have an alternate approach they recommend?
You can use the bq command line tool (usually comes with the gcloud command line utility) to run a query that will overwrite a partition in a target table with your query results:
bq query --allow_large_results --replace --noflatten_results --destination_table 'target_db.target_table$20220209' "select field1, field2, field3 from source_db.source_table where _PARTITIONTIME = '2022-02-09'";
Note the $YYYYMMMDD postfix with the target_table. This indicates
that the partition corresponding to YYYYMMDD is to be overwritten
by the query results.
Make sure to distinctively select fields in your query (as a good practice) to avoid unexpected surprises. For instance, select field1, field2, field3 from table is way more explicit and readable than select * from table.

Hive tblproperties ("skip.header.line.count"="1") not working with select distinct

We have a little problem with our tblproperties ("skip.header.line.count"="1").
If we do a basic select like select * from tableabc we do not get back this header. But once we do a select distinct columnname from tableabc we get the header back!
Of course we do not want this for obvious reasons.
Did somebody else also have this issue? If so did you find a fix for this?
Thx in advance
-----Update 20/06/2018-----
Hive2 version: 2.1
running on: Azure HDInsight hive interactive query cluster
This is a very small data set already, 48 records (with header included)
Create Statement:
-----------------------------------
--sap_0bill_typea--
-----------------------------------
CREATE EXTERNAL TABLE IF NOT EXISTS ext.test_type_in
(
test_type string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
STORED AS TEXTFILE
LOCATION 'adl://{adlslocation}data/data2/test'
tblproperties ("skip.header.line.count"="1")
Select statement:
select * from test_type_in;
Distinct statement
select distinct test_type from test_type_in ORDER BY test_type;
I cannot show the exact statement because of NDA so i changed those values to test.

How to get input file name as column in AWS Athena external tables

I have external tables created in AWS Athena to query S3 data, however, the location path has 1000+ files. So I need the corresponding filename of the record to be displayed as a column in the table.
select file_name , col1 from table where file_name = "test20170516"
In short, I need to know INPUT__FILE__NAME(hive) equivalent in AWS Athena Presto or any other ways to achieve the same.
You can do this with the $path pseudo column.
select "$path" from table
If you need just the filename, you can extract it with regeexp_extract().
To use it in Athena on the "$path" you can do something like this:
SELECT regexp_extract("$path", '[^/]+$') AS filename from table;
If you need the filename without the extension, you can do:
SELECT regexp_extract("$path", '[ \w-]+?(?=\.)') AS filename_without_extension from table;
Here is the documentation on Presto Regular Expression Functions

BigQuery command line tool - append to table using query

Is it possible to append the results of running a query to a table using the bq command line tool? I can't see flags available to specify this, and when I run it it fails and states "table already exists"
bq query --allow_large_results --destination_table=project:DATASET.table "SELECT * FROM [project:DATASET.another_table]"
BigQuery error in query operation: Error processing job '':
Already Exists: Table project:DATASET.table
Originally BigQuery did not support the standard SQL idiom
INSERT foo SELECT a,b,c from bar where d>0;
and you had to do it their way with --append_table
But according to #Will's answer, it works now.
Originally with bq, there was
bq query --append_table ...
The help for the bq query command is
$ bq query --help
And the output shows an append_table option in the top 25% of the output.
Python script for interacting with BigQuery.
USAGE: bq.py [--global_flags] <command> [--command_flags] [args]
query Execute a query.
Examples:
bq query 'select count(*) from publicdata:samples.shakespeare'
Usage:
query <sql_query>
Flags for query:
/home/paul/google-cloud-sdk/platform/bq/bq.py:
--[no]allow_large_results: Enables larger destination table sizes.
--[no]append_table: When a destination table is specified, whether or not to
append.
(default: 'false')
--[no]batch: Whether to run the query in batch mode.
(default: 'false')
--destination_table: Name of destination table for query results.
(default: '')
...
Instead of appending two tables together, you might be better off with a UNION ALL which is sql's version of concatenation.
In big query the comma or , operation between two tables as in SELECT something from tableA, tableB is a UNION ALL, NOT a JOIN, or at least it was the last time I looked.
Just in case someone ends up finding this question in Google, BigQuery has evolved a lot since this post and now it does support Standard.
If you want to append the results of a query to a table using the DML syntax feature of the Standard version, you could do something like:
INSERT dataset.Warehouse (warehouse, state)
SELECT *
FROM UNNEST([('warehouse #1', 'WA'),
('warehouse #2', 'CA'),
('warehouse #3', 'WA')])
As presented in the docs.
For the command line tool it follows the same idea, you just need to add the flag --use_legacy_sql=False, like so:
bq query --use_legacy_sql=False "insert into dataset.table (field1, field2) select field1, field2 from table"
According to the current documentation (March 2018): https://cloud.google.com/bigquery/docs/loading-data-local#appending_to_or_overwriting_a_table_using_a_local_file
You should add:
--noreplace or --replace=false

How to know location about partition in hive?

If I write a hive sql like
ALTER TABLE tbl_name ADD PARTITION (dt=20131023) LOCATION 'hdfs://path/to/tbl_name/dt=20131023;
How can I query this location about partition later? Because I found there is some data in location but I can't query them, hive sql like
SELECT data FROM tbl_name where dt=20131023;
Do a describe on the partition instead of the full table.
This will show the linked location if it's an external table.
describe formatted tbl_name partition (dt='20131023')
show table extended like 'tbl_name' partition (dt='20131023');
Show Tables/Partitions Extended
SHOW TABLE EXTENDED will list information for all tables matching the given regular expression. Users cannot use regular expression for table name if a partition specification is present. This command's output includes basic table information and file system information like totalNumberFiles, totalFileSize, maxFileSize, minFileSize, lastAccessTime, and lastUpdateTime. If partition is present, it will output the given partition's file system information instead of table's file system information.
If you have multiple nested partitions, the syntax is:
describe formatted table_name partition (day=123,hour=2);
If you want to know the location of files you're reading, use
SELECT INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE FROM <table> WHERE <part_name> = '<part_key>'
Then you get
hdfs:///user/hive/warehouse/<db>/<table>/<part_name>=<part_key>/000000_0.snappy, 0
hdfs:///user/hive/warehouse/<db>/<table>/<part_name>=<part_key>/000000_1.snappy, 0
This is the format of the command I use to get the exact HDFS location of a specific partition in a specific table:
show table extended like flight_context_fused_record partition(date_key='20181013', partition_id='P-DUK2nESsv', custom_partition_1='ZMP');
In the command above, the partition spec consists of three separate fields. Your example may have more or less.
See results below. Notice the "location:" field shows the HDFS folder location.
hive (nva_test)> show table extended like flight_context_fused_record partition(date_key='20181013', partition_id='P-DUK2nESsv', custom_partition_1='ZMP');
OK
tableName:flight_context_fused_record
owner:nva-prod
location:hdfs://hdp1-ha/tmp/vfisher/cms-context-acquisition-2019-06-13/FlightContextFusedRecord/2018/10/13/ZMP/P-DUK2nESsv
inputformat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
outputformat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
columns:struct columns { string primary_key, string facility, string position, i32 dalr_channel, i64 start_time_unix_millis, i64 end_time_unix_millis, string foreign_key_to_audio_segment, struct<on_frequency_flight_list:list<struct<acid:string,ac_type:string>>,transfer_list:list<struct<primary_key:string,acid:string,data_id:string,ac_type:string,from_facility:string,from_position:string,transition_time:i64,transition_time_start:i64,transtition_time_end:i64,to_facility:string,to_position:string,source:string,source_info:string,source_time:i64,confidence:double,confidence_description:string,uuid:string>>,source_list:list<string>,domain:string,domains:list<string>> flight_context}
partitioned:true
partitionColumns:struct partition_columns { i32 date_key, string partition_id, string custom_partition_1}
totalNumberFiles:1
totalFileSize:247075687
maxFileSize:247075687
minFileSize:247075687
lastAccessTime:1561122938361
lastUpdateTime:1561071155639
The generic form of the command (taking out my specific values and putting in argument specifiers) looks like this:
show table extended like <your table name here> partition(<your partition spec here>);
you can simply do this:
DESC FORMATTED tablename PARTITION (yr_no='y2019');
OR
DESC EXTENDED tablename PARTITION (yr_no='y2019');
You can get the location of the Hive partitions on HDFS by running any of the following Hive commands.
DESCRIBE FORMATTED tbl_name PARTITION(dt=20131023);
SHOW TABLE EXTENDED LIKE tbl_name PARTITION(dt=20131023);
Alternatively, you can also get by running HDFS list command
hdfs dfs -ls <your Hive store location>/<tablename>
Link: Hive show or list all partitions
Thanks,
NNK
You can get this info via Hive Metastore Thrift protocol, e.g. with hmsclient library:
Hive cli:
hive> create table test_table_with_partitions(f1 string, f2 int) partitioned by (dt string);
OK
Time taken: 0.127 seconds
hive> alter table test_table_with_partitions add partition(dt=20210504) partition(dt=20210505);
OK
Time taken: 0.152 seconds
Python cli:
>>> with client as c:
... partition = c.get_partition_by_name(db_name='default',
tbl_name='test_table_with_partitions',
part_name='dt=20210504')
...
>>> partition.sd.location
'hdfs://hdfs.master.host:8020/user/hive/warehouse/test_table_with_partitions/dt=20210504'