Column names when exporting ORC files from hive server 2 using beeline - hive

I am facing a problem where exporting results from hive server 2 to ORC files show some kind of default column names (e.g. _col0, _col1, _col2) instead of the original ones created in hive. We are using pretty much default components from HDP-2.6.3.0.
I am also wondering if the below issue is related:
https://issues.apache.org/jira/browse/HIVE-4243
Below are the steps we are taking:
Connecting:
export SPARK_HOME=/usr/hdp/current/spark2-client
beeline
!connect jdbc:hive2://HOST1:2181,HOST2:2181,HOST2:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
Creating test table and inserting sample values:
create table test(str string);
insert into test values ('1');
insert into test values ('2');
insert into test values ('3');
Running test query:
select * from test;
+-----------+--+
| test.str |
+-----------+--+
| 1 |
| 2 |
| 3 |
+-----------+--+
Exporting as ORC:
insert overwrite directory 'hdfs://HOST1:8020/tmp/test' stored as orc select * from test;
Getting the results:
hdfs dfs -get /tmp/test/000000_0 test.orc
Checking the results:
java -jar orc-tools-1.4.1-uber.jar data test.orc
Processing data file test.orc [length: 228]
{"_col0":"1"}
{"_col0":"2"}
{"_col0":"3"}
java -jar orc-tools-1.4.1-uber.jar meta test.orc
Processing data file test.orc [length: 228]
Structure for test.orc
File Version: 0.12 with HIVE_13083
Rows: 2
Compression: SNAPPY
Compression size: 262144
Type: struct<_col0:string>
Stripe Statistics:
Stripe 1:
Column 0: count: 2 hasNull: false
Column 1: count: 2 hasNull: false min: 1 max: 3 sum: 2
File Statistics:
Column 0: count: 2 hasNull: false
Column 1: count: 2 hasNull: false min: 1 max: 3 sum: 2
Stripes:
Stripe: offset: 3 data: 11 rows: 2 tail: 60 index: 39
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 28
Stream: column 1 section DATA start: 42 length 5
Stream: column 1 section LENGTH start: 47 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 228 bytes
Padding length: 0 bytes
Padding ratio: 0%
Looking at the results I can see _col0 as the column name while expecting the original str.
Any ideas on what I am missing?
Update
I noticed that the connection from beeline was going to hive 1.x, and not 2.x as wanted. I changed the connection to the Hive Server 2 Interactive URL:
Connected to: Apache Hive (version 2.1.0.2.6.3.0-235)
Driver: Hive JDBC (version 1.21.2.2.6.3.0-235)
Transaction isolation: TRANSACTION_REPEATABLE_READ
And tried again with the same sample. It even prints out the schema correctly:
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test.str, type:string, comment:null)], properties:null)
But still no luck in getting it to the ORC file.

Solution
You need to enable Hive LLAP (Interactive SQL) in Ambari, then change the connection string you are using. For example, my connection became jdbc:hive2://.../;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2
Note the additional "-hive2" at the end of the URL. Here is a tutorial vid from hortonworks.
"Proof"
After connecting to the updated Hive endpoint, I ran
create table t_orc(customer string, age int) stored as orc;
insert into t_orc values('bob', 12),('kate', 15);
Then
~$ hdfs dfs -copyToLocal /apps/hive/warehouse/t_orc/000000_0 ~/tmp/orc/hive2.orc
~$ orc-metadata tmp/orc/hive2.orc
{ "name": "tmp/orc/hive2.orc",
"type": "struct<customer:string,age:int>",
"rows": 2,
"stripe count": 1,
"format": "0.12", "writer version": "HIVE-13083",
"compression": "zlib", "compression block": 262144,
"file length": 305,
"content": 139, "stripe stats": 46, "footer": 96, "postscript": 23,
"row index stride": 10000,
"user metadata": {
},
"stripes": [
{ "stripe": 0, "rows": 2,
"offset": 3, "length": 136,
"index": 67, "data": 23, "footer": 46
}
]
}
Where orc-metadata is a tool distributed by the ORC repo on github.

You have to set this in hive script or hive-shell otherwise put it in a .hiverc file in your main directory or any of the other hive user properties files.
set hive.cli.print.header=true;

Related

Azure Data Factory: ErrorCode=TypeConversionFailure,Exception occurred when converting value : ErrorCode: 2200

Can somoene let me know why Azure Data Factory is trying to convert a value from String to type Double.
I am getting the error:
{
"errorCode": "2200",
"message": "ErrorCode=TypeConversionFailure,Exception occurred when converting value '+44 07878 44444' for column name 'telephone2' from type 'String' (precision:255, scale:255) to type 'Double' (precision:15, scale:255). Additional info: Input string was not in a correct format.",
"failureType": "UserError",
"target": "Copy Table to EnrDB",
"details": [
{
"errorCode": 0,
"message": "'Type=System.FormatException,Message=Input string was not in a correct format.,Source=mscorlib,'",
"details": []
}
]
}
My Sink looks like the following:
I don't have any mapping set
The column setting for the the field 'telephone2' is as follows:
I changed the 'table option' to none, however I got the following error:
{
"errorCode": "2200",
"message": "Failure happened on 'Source' side. ErrorCode=SqlOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=A database operation failed with the following error: 'Internal system error occurred.\r\nStatement ID: {C2C38377-5A14-4BB7-9298-28C3C351A40E} | Query hash: 0x2C885D2041993FFA | Distributed request ID: {6556701C-BA76-4D0F-8976-52695BBFE6A7}. Total size of data scanned is 134 megabytes, total size of data moved is 102 megabytes, total size of data written is 0 megabytes.',Source=,''Type=System.Data.SqlClient.SqlException,Message=Internal system error occurred.\r\nStatement ID: {C2C38377-5A14-4BB7-9298-28C3C351A40E} | Query hash: 0x2C885D2041993FFA | Distributed request ID: {6556701C-BA76-4D0F-8976-52695BBFE6A7}. Total size of data scanned is 134 megabytes, total size of data moved is 102 megabytes, total size of data written is 0 megabytes.,Source=.Net SqlClient Data Provider,SqlErrorNumber=75000,Class=17,ErrorCode=-2146232060,State=1,Errors=[{Class=17,Number=75000,State=1,Message=Internal system error occurred.,},{Class=0,Number=15885,State=1,Message=Statement ID: {C2C38377-5A14-4BB7-9298-28C3C351A40E} | Query hash: 0x2C885D2041993FFA | Distributed request ID: {6556701C-BA76-4D0F-8976-52695BBFE6A7}. Total size of data scanned is 134 megabytes, total size of data moved is 102 megabytes, total size of data written is 0 megabytes.,},],'",
"failureType": "UserError",
"target": "Copy Table to EnrDB",
"details": []
}
Any more thoughts
The issue was resolved by changing the column DataType on the database to match the DataType recorded in Azure Data Factory i.e StringType

Reading an EXCEL file from bucket into Bigquery

I'm trying to upload this data into from my bucket into bigquery, but it's complaining.
My file is an excel file.
ID A B C D E F Value1 Value2
333344 ALG A RETAIL OTHER YIPP Jun 2019 123 4
34563 ALG A NON-RETAIL OTHER TATS Mar 2019 124 0
7777777777 - E RETAIL NASAL KHPO Jul 2020 1,448 0
7777777777 - E RETAIL SEVERE ASTHMA PZIFER Oct 2019 1,493 162
From python I call the file as follow:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
table_id = "project.dataset.my_table"
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField('ID','STRING'),
bigquery.SchemaField('A','STRING'),
bigquery.SchemaField('B','STRING'),
bigquery.SchemaField('C','STRING'),
bigquery.SchemaField('D','STRING'),
bigquery.SchemaField('E','STRING'),
bigquery.SchemaField('F','STRING'),
bigquery.SchemaField('Value1','STRING'),
bigquery.SchemaField('Value2','STRING'),
],
skip_leading_rows=1
)
uri = "gs://bucket/folder/file.xlsx"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Wait for the job to complete.
table = client.get_table(table_id)
print("Loaded {} rows to table {}".format(table.num_rows, table_id))
I am getting the following error, and its complaining about a line that it's not even there.
BadRequest: 400 Error while reading data, error message: CSV table references column position 8, but line starting at position:660 contains only 1 columns.
I thought the problem was the data type, as I had selected ID as integer and value1 and value2 as Integer too and F as timestamp, so now I'm trying everything as String, and I still get the error.
My file is only 4 lines in this test I'm doing
Excel files are not supported by BigQuery.
A few workaround solutions:
Upload a CSV version of your file into your bucket (a simple bq load command will do, cf here),
Read the Excel file with Pandas in your python script and insert the rows in BQ with the to_gbq() function,
Upload your Excel file in your Google Drive, make a spreadsheet version out of it and make an external table linked to that spreadsheet.
Try specifying field_delimiter in the LoadJobConfig
Your input file seems like TSV.
So you need to set field delimiter to '\t' like this,
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField('ID','STRING'),
bigquery.SchemaField('A','STRING'),
bigquery.SchemaField('B','STRING'),
bigquery.SchemaField('C','STRING'),
bigquery.SchemaField('D','STRING'),
bigquery.SchemaField('E','STRING'),
bigquery.SchemaField('F','STRING'),
bigquery.SchemaField('Value1','STRING'),
bigquery.SchemaField('Value2','STRING'),
],
skip_leading_rows=1,
field_delimiter='\t'
)

Why Reducer required in a Hive Insert

Question is related to working of MapReduce job when we fire a insert into statement from hive command line.
While inserting records into a hive table: As there is no aggregations involved while insert into the internal hive table, why reducer is also invoked. It should only a mapper job only.
What is the role of reducer here.
insert into table values (1,1);
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO : 2021-04-28 10:30:26,487 Stage-1 map = 0%, reduce = 0%
INFO : 2021-04-28 10:30:30,604 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.96 sec
INFO : 2021-04-28 10:30:36,774 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.35 sec
INFO : MapReduce Total cumulative CPU time: 3 seconds 350 msec
hive> set hive.merge.mapfiles;
hive.merge.mapfiles=true
hive> set hive.merge.mapredfiles;
hive.merge.mapredfiles=false
hive> set mapreduce.job.reduces;
mapreduce.job.reduces=-1
explain insert into test values (10,14);
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
Stage-4
Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
Stage-2 depends on stages: Stage-0
Stage-3
Stage-5
Stage-6 depends on stages: Stage-5
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: _dummy_table
Row Limit Per Split: 1
Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: array(const struct(10,14)) (type: array<struct<col1:int,col2:int>>)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
UDTF Operator
Statistics: Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
function name: inline
Select Operator
expressions: col1 (type: int), col2 (type: int)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Select Operator
expressions: _col0 (type: int), _col1 (type: int)
outputColumnNames: i, j
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: compute_stats(i, 'hll'), compute_stats(j, 'hll')
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 848 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 848 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,bitvector:binary>), _col1 (type: struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,bitvector:binary>)
Reduce Operator Tree:
Group By Operator
aggregations: compute_stats(VALUE._col0), compute_stats(VALUE._col1)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 880 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 880 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-7
Conditional Operator
Stage: Stage-4
Move Operator
files:
hdfs directory: true
destination:<path>
Stage: Stage-0
Move Operator
tables:
replace: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-2
Stats Work
Basic Stats Work:
Column Stats Desc:
Columns: i, j
Column Types: int, int
Table: db.test.test
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: db.test
Stage: Stage-5
Map Reduce
Map Operator Tree:
TableScan
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: db.test
Stage: Stage-6
Move Operator
files:
hdfs directory: true
destination: <PATH>
Time taken: 5.123 seconds, Fetched: 121 row(s)
It seems you have statistics auto gathering enabled:
SET hive.stats.autogather=true;
And reducer is calculating statistics
Reduce Operator Tree:
Group By Operator
aggregations: **compute_stats**(VALUE._col0), compute_stats(VALUE._col1)
mode: mergepartial

pig latin - not showing the right record numbers

I have written a pig script for wordcount which works fine. I could see the results from pig script in my output directory in hdfs. But towards the end of my console, I see the following:
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_local1695568121_0002 1 1 0 0 0 0 0 0 0 0 words_sorted SAMPLER
job_local2103470491_0003 1 1 0 0 0 0 0 0 0 0 words_sorted ORDER_BY /output/result_pig,
job_local696057848_0001 1 1 0 0 0 0 0 0 0 0 book,words,words_agg,words_grouped GROUP_BY,COMBINER
Input(s):
Successfully read 0 records from: "/data/pg5000.txt"
Output(s):
Successfully stored 0 records in: "/output/result_pig"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local696057848_0001 -> job_local1695568121_0002,
job_local1695568121_0002 -> job_local2103470491_0003,
job_local2103470491_0003
2014-07-01 14:10:35,241 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
As you can see, the job is success. but not the Input(s) and output(s). Both of the them say successfully read/stored 0 records and the counter values are all 0.
why the value is zero. These should not be zero.
I am using hadoop2.2 and pig-0.12
Here is the script:
book = load '/data/pg5000.txt' using PigStorage() as (lines:chararray);
words = foreach book generate FLATTEN(TOKENIZE(lines)) as word;
words_grouped = group words by word;
words_agg = foreach words_grouped generate group as word, COUNT(words);
words_sorted = ORDER words_agg BY $1 DESC;
STORE words_sorted into '/output/result_pig' using PigStorage(':','-schema');
NOTE: my data is present in /data/pg5000.txt and not in default directory which is /usr/name/data/pg5000.txt
EDIT: here is the output of printing my file to console
hadoop fs -cat /data/pg5000.txt | head -10
The Project Gutenberg EBook of The Notebooks of Leonardo Da Vinci, Complete
by Leonardo Da Vinci
(#3 in our series by Leonardo Da Vinci)
Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.
This header should be the first thing seen when viewing this Project
Gutenberg file. Please do not remove it. Do not change or edit the
cat: Unable to write to output stream.
Please correct the following line
book = load '/data/pg5000.txt' using PigStorage() as (lines:chararray);
to
book = load '/data/pg5000.txt' using PigStorage(',') as (lines:chararray);
I am assuming the delimiter as comma here use the one which is used to separate the records in your file. This will solve the issue
Also note --
If no argument is provided, PigStorage will assume tab-delimited format. If a delimiter argument is provided, it must be a single-byte character; any literal (eg: 'a', '|'), known escape character (eg: '\t', '\r') is a valid delimiter.

Printing numeric values < NF in awk?

I ran a bunch of tests using pgbench, logging the results:
run-params: 1 1 1
transaction type: SELECT only
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 90 s
number of transactions actually processed: 280465
tps = 3116.254233 (including connections establishing)
tps = 3116.936248 (excluding connections establishing)
run-params: 1 1 2
transaction type: SELECT only
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 90 s
number of transactions actually processed: 505943
tps = 5621.570463 (including connections establishing)
tps = 5622.811538 (excluding connections establishing)
run-params: 10000 10 3
transaction type: SELECT only
scaling factor: 10000
query mode: simple
number of clients: 10
number of threads: 1
duration: 90 s
number of transactions actually processed: 10
tps = 0.012268 (including connections establishing)
tps = 0.012270 (excluding connections establishing)
I want to extract the values for graphing. Trying to learn AWK at the same time. Here's my AWK program:
/run-params/ { scaling = $2 ; clients = $3 ; attempt = $4 }
/^tps.*excluding/ { print $scaling "," $clients "," $attempt "," $3 }
When I run that, I get the following output:
$ awk -f b.awk -- b.log
tps,tps,tps,3116.936248
tps,tps,=,5622.811538
,,0.012270,0.012270
Which is not what I want.
I understand when scaling = 1, the 1 references field 1, which in this case happens to be tps. When scaling = 10000, because there aren't 10000 fields on the line, then null is returned. I did try assigning scaling and friends using "" $2, to no avail.
How does one use / report numeric values in a subsequent action block?
Simply drop the $ in front of scaling, etc. That is, scaling is a variable reference, $scaling is a field reference.