When I do show create table, I see the following delimiter:
ROW FORMAT DELIMITED FIELDS TERMINATED BY '
and when I do describe extended table_name, I see:
parameters:{serialization.format, field.delim})
So is there a way to identify what the delimiter is for the existing table showing the above?
ROW FORMAT DELIMITED - that line is telling hive that each new line in a file is a new row
FIELDS TERMINATED BY - that parameter is telling hive by what character should be delimited each row. If none is set the default will be used which is ctrl-A
Related
I ran this in AWS Athena:
CREATE EXTERNAL TABLE IF NOT EXISTS `nina-nba-database`.`nina_nba_test` (
`Data` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = 'nina'
) LOCATION 's3://nina-gray/'
TBLPROPERTIES ('has_encrypted_data'='false');
However when I try to select the table using the syntax below:
SELECT * FROM "nina-nba-database"."nina_nba_table" limit 10;
It gives me this error:
HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns
This query ran against the "layla-nba-database" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: b96e4344-5bbe-4eca-9da4-70be11f8e87d
Would anyone be able to help?
The input.regex in your query doesn't look like valid one. The specified regex group while creating the table becomes a new column. So if you want to read data inside a column as new column you can specify the valid regex, to understand more about regex you can refer to Regex SerDe examples from this aws documentation. Or if your use case to just read columnar data you can create the table specifying proper delimiter, For example if your data is comma separated you can specify the delimiter as
...
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
...
have a look at this example for more details.
I am facing strange issue.I tried with tab delimiter both in file and in table definition and comma as well.
But in both cases it reads the decimal values as NULL.But when I define this fields as INT it works fine.
Sample data with comma delimited values:
1,22.334
2,445.322
3,999.233
defined this table as
create table x(ID INT,SAL DECIMAL(3,3)) row format delimited fields terminated by '\t' location '\tmp\data\'
similarly for comma delimited file
create table x(ID INT,SAL DECIMAL(3,3)) row format delimited fields terminated by ',' location '\tmp\data\'
But in both cases it is reading decimal values as NULL?what is the issue
First thing is Decimal datatype doesn't not accept comma in data.
Second problem is you have to increase the decimal(3,3) to minimum decimal(7,3) for the sample data provided.
As decimal (3,3) cannot hold any of 3 values.
As your raw data contains comma in data,
You have to load the into table with all columns as string datatype .
Later use regular expression to remove the comma in data and load into second level hive table with decimal datatype.
I have two input files which are semicolon delimited. I loaded these files into two tables. Both tables contain the information on books. I joined both the tables on ISBN field. For creating these tables I used the below query to skip header and to read semi colon delimited files:-
Create table books (ISBN STRING,BookTitle STRING,BookAuthor STRING,YearOfPublication STRING,Publisher STRING,ImageURLS STRING,ImageURLM STRING,ImageURLL STRING) row format delimited fields terminated by '\;' lines terminated by '\n' tblproperties ("skip.header.line.count"="1");
Now when I am trying the below query but I am not getting the desired output:-
SELECT a.BookRating, COUNT(BookTitle)
FROM Books b
JOIN Rating a
on (b.ISBN = a.ISBN)
WHERE b.YearOfPublication = 2002
GROUP BY a.BookRating;
I am not getting anything. It just shows OK on the terminal after the query runs completely. Please let me know what can be done. Thanks in advance.
Your DDL script is not proper.
You have mentioned
row format delimited fields terminated by '\;'
But actually it should be
row format delimited fields terminated by ';'
Try this and let me know
YearOfPublication is a string so you need to change it to
WHERE b.YearOfPublication = '2002'
I am exporting a csv file into hive table.
about the csv file : column values are enclosed within double-quotes , seperated by comma .
Sample record from csv
"4","good"
"3","not bad"
"1","very worst"
I created a hive table with the following statement,
create external table currys(review_rating string,review_comment string ) row format fields delimited by ',';
Table created .
now I loaded the data using the command load data local inpath and it was successful.
when I query the table,
select * from currys;
The result is :
"4" "good"
"3" "not bad"
"1" "very worst"
instead of
4 good
3 not bad
1 very worst
records are inserted with double-quotes which shouldnt be.
Please let me know how to get rid of this double quote .. any help or guidance is highly appreciated...
Thanks beforehand!
Are you using any serde? If so, then you can write a regex command in the SERDE PROPERTIES to remove the quotes.
Or you can use the csv-serde from here and define the quote character.
I have inconsistent log file which I would like to partition with Hive using dynamic partitioning. File example:
20/06/13 20:21:42.637 FLW CPTView::OnInitialUpdate nRemoveAppShareQSize0=50000\n
20/06/13 20:21:42.638 FLW \n
BandwidthGlobalSettings:Old Bandwidth common defines\n
Sometimes log file contains line which started with some word different from date. Each line delimited with \n.
I am running commands:
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages_temp (date STRING,time STRING,severity STRING,message STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\040' LOCATION '/examples/hive/tmp';
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages_partitioned (time STRING,severity STRING,message STRING) PARTITIONED BY (date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\040' LOCATION '/examples/hive/partitions';
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM log_messages_temp pvs INSERT OVERWRITE TABLE log_messages_partitioned PARTITION(date) SELECT pvs.time, pvs.severity, pvs.message, pvs.date;
As a result two dynamic partitions were created: date=20/06/13 and date=BandwidthGlobalSettings:Old
I would like to define to Hive to ignore lines started with not date string.
How can I do this? Or maybe exists another solution?
Thanks.
I think you can write a UDF that will use regular expression to take only date format(ex: 20/06/13) and discard all others like "BandwidthGlobalSettings:Old" . You can use this UDF in your last query while inserting into final table.
I hope this explanation helps your requirement.