I am new to Hadoop. I am trying to create an EXTERNAL table in Hive.
The following is the query I am using:
CREATE EXTERNAL TABLE stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'hdfs:///data/stocks'
I am getting an error:
' ParseException cannot recognize input near 'exchange' 'STRING' ',' in column specification.
What am I missing? I tried reading the command - I don't think I am missing anything.
Because exchange is a keyword in hive, so you can't use exchange to be your column name. If you want to use it just add backticks around exchange
Exchange is reserved keyword in Hive So try to use different keyword in place of that-
Create table Stocks (exchange1 String, stock_symbol String, stock_date String, stock_price_open double, stock_price_high double, stock_price_low do
uble, stock_price_close double, stock_volume double, stock_price_adj_close double) row format delimited fields terminated by ",";
Related
I am just playing around with Athena, and I tried following this link
https://awsfeed.com/whats-new/big-data/use-ml-predictions-over-amazon-dynamodb-data-with-amazon-athena-ml
Create an Athena table with geospatial data of neighborhood boundaries
I followed the code based on the sample plus looking at the picture.
However, this is where I ran into issues and had to change the code to this based on the error messages Athena was giving me. Now the current error is mismatched input 'STORED'. Expecting: <EOF
FROM WEBSITE -
CREATE EXTERNAL TABLE <table name
"objectid" int,
"nh_code" int,
"nh_name" string,
"shapearea" double,
"shapelen" double,
"bb_west" double,
"bb_south" double,
"bb_east" double,
"bb_north" double,
"shape" string,
"cog_longitude" double,
"cog_latitude" double)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
I kept getting errors around ROW FORMAT and have tweaked it below
WITH (ROW = DELIMITED
,FIELDS = '\t'
,LINES = '\n'
)
STORED INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
The error messages started at ROW and I've edited above. Now the error code relates to STORED so perhaps the changes I made are necessary. I am not sure. I am not very good with Athena so I was just following the guide and was hoping it would work. Any suggestions on what I am doing wrong?
Thanks.
You have a syntax error in your SQL, the first line should be:
CREATE EXTERNAL TABLE table_name (
There is a stray < in your example, table names can't have spaces, and there should be a ( to start the list of columns.
So i had this problem when adding a CSV file to my HQL code and run it on HDFS.
i found that when inserting data it get Nulls in partition parts and some columns gets delete i tried many different ways to insert data but still i gets this weird symbols and lost columns it is like that it cant read the CSV file ,
here is a Pic
enter image description here and here is the code`
CREATE database covid_db;
use covid_db;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=500;
set hive.exec.max.dynamic.partitions.pernode=500;
CREATE TABLE IF NOT EXISTS covid_db.covid_staging
(
Country STRING,
Total_Cases DOUBLE,
New_Cases DOUBLE,
Total_Deaths DOUBLE,
New_Deaths DOUBLE,
Total_Recovered DOUBLE,
Active_Cases DOUBLE,
Serious DOUBLE,
Tot_Cases DOUBLE,
Deaths DOUBLE,
Total_Tests DOUBLE,
Tests DOUBLE,
CASES_per_Test DOUBLE,
Death_in_Closed_Cases STRING,
Rank_by_Testing_rate DOUBLE,
Rank_by_Death_rate DOUBLE,
Rank_by_Cases_rate DOUBLE,
Rank_by_Death_of_Closed_Cases DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED by ','
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_HDFS_LZ'
tblproperties ("skip.header.line.count"="1", "serialization.null.format" = "''");
CREATE EXTERNAL TABLE IF NOT EXISTS covid_db.covid_ds_partitioned
(
Country STRING,
Total_Cases DOUBLE,
New_Cases DOUBLE,
Total_Deaths DOUBLE,
New_Deaths DOUBLE,
Total_Recovered DOUBLE,
Active_Cases DOUBLE,
Serious DOUBLE,
Tot_Cases DOUBLE,
Deaths DOUBLE,
Total_Tests DOUBLE,
Tests DOUBLE,
CASES_per_Test DOUBLE,
Death_in_Closed_Cases STRING,
Rank_by_Testing_rate DOUBLE,
Rank_by_Death_rate DOUBLE,
Rank_by_Cases_rate DOUBLE,
Rank_by_Death_of_Closed_Cases DOUBLE
)
PARTITIONED BY (COUNTRY_NAME STRING)
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_HDFS_PARTITIONED';
FROM
covid_db.covid_staging
INSERT INTO TABLE covid_db.covid_ds_partitioned PARTITION(COUNTRY_NAME)
SELECT *,Country WHERE Country is not null;
CREATE EXTERNAL TABLE covid_db.covid_final_output
(
TOP_DEATH STRING,
TOP_TEST STRING
)
PARTITIONED BY (COUNTRY_NAME STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED by ','
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_FINAL_OUTPUT';
`
1st: You are checking file contents, and partition column is not stored in the file, it is stored in the metadata. Also dynamically created partition are directories in the format key=value. So, the last column you see in the file is not the partition column, it is Rank_by_Death_of_Closed_Cases.
2nd: You did not specify delimiter in second table DDL as well as NULL format. The default delimiter is '\001' (Ctrl-A). You can specify delimiter, for example TAB (\t) and desired NULL:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
NULL DEFINED AS ''
STORED AS TEXTFILE;
But better do not redefine NULL format if you want to be able to distinguish NULLs and empty strings.
I have the following SQL:
#standardSql
CREATE OR REPLACE TABLE batch_report
(
date DATE,
memberId STRING OPTIONS(description="xxxx member ID"),
variables ARRAY<STRUCT<
id STRING,
datatype STRING,
effectiveDate TIMESTAMP,
values ARRAY<STRUCT<
id STRING,
value STRING
>
>,
isSensitive BOOLEAN,
name STRING
>
>
)
PARTITION BY date
OPTIONS (
partition_expiration_days=62, -- two months
description="Stores the raw response from the xxxx batch endpoint"
)
when running this via the CLI using bq query --dataset=dev < create_batch_report.sql it will give me the following error message:
Incompatible table partitioning specification. Expected partitioning specification none, but input partitioning specification is
interval(type:day,field:date)
However, when running it in the BigQuery console and supplying the dataset name in the CREATE OR REPLACE TABLE statement, it will execute correctly. Is this a bug, if so how do I get around it?
When running via the CLI, I modified the first line to include the dataset rather than passing it in using the dataset flag. This caused it to execute correctly. I've modified the SQL to be:
#standardSql
CREATE OR REPLACE TABLE {ENVIRONMENT}.batch_report
(
date DATE,
memberId STRING OPTIONS(description="xxxx member ID"),
variables ARRAY<STRUCT<
id STRING,
datatype STRING,
effectiveDate TIMESTAMP,
values ARRAY<STRUCT<
id STRING,
value STRING
>
>,
isSensitive BOOLEAN,
name STRING
>
>
)
PARTITION BY date
OPTIONS (
partition_expiration_days=62, -- two months
description="Stores the raw response from the xxxx batch endpoint"
)
and execute it via CLI with:
sed s/"{ENVIRONMENT}"/${ENVIRONMENT}/g create_batch_report.sql | \
bq query
For all my CSV sources, I set the extractor to:
USING Extractors.Csv(silent:true,skipFirstNRows:1);
- silent is set to true to ignore bad rows
- skipFirstNRows is set to 1 to skip the header row
Yet oddly, I still get this error:
HEX: "223122" Invalid character when converting input record.
Position: line 2, column index: 7, column name: "IncludeOnCheck".
Invalid character when attempting to convert column data.
Data (sample row & row in question)
29,1,10,DC Tax,DC Tax,0.100000,0.00,1,1,1,2014-07-12 21:34:52.4200000 +00:00,NULL,NULL,0,-1,0,0,0,NULL,NULL,NULL,1031,NULL,0,0
33,4,10,Amenities,Amenities,1.000000,0.00,1,0,1,2014-07-12 21:34:54.1330000 +00:00,NULL,NULL,0,-1,0,0,0,NULL,NULL,NULL,1031,NULL,0,0
Column Definition
EXTRACT AncillaryAmountTypeID int,
AncillaryAmountCategoryID int,
CustomerID int,
CheckTitle string,
ReportTitle string,
Percentage decimal,
FixedAmount decimal,
IncludeOnCheck bool,
AutoCalculate bool,
StoreAtCheckLevel bool?,
DateTimeModified DateTime?,
CheckTitleToken Guid?,
ReportTitleToken Guid?,
DeletedFlag bool,
MaxUsageQty int?,
ApplyToBasePriceOnly bool?,
Exclusive bool,
IsItem bool,
MinValue decimal,
MaxValue decimal,
ItemGroupID int?,
LocationID int,
ApplicationOrder int?,
RequiresReason bool,
Exemptable bool?
Questions
Why am I getting conversion errors when I specified silent is true,
which should ignore any bad rows, right?
The character it tried to convert was "1", and into a boolean. Is U-SQL or ADLA not able to understand or convert 1 and 0 into booleans?
Yes, I also observed this behavior, it does not convert 0 or 1 into bool automatically. If you want to do that then EXTRACT it as int and then convert it into bool using Convert.ToBoolean method.
I think silent switch only works when there is a mismatch between provided schema and the schema of an actual data.
Here is the documentation for the silent parameter: It only ignores conversion errors if your target type is nullable. Otherwise it still errors.
Also, we are following C# semantics on conversion as described here.
CREATE TABLE DowJones (quarter int, stock string, StockDate date, open double, high double, low double, close double, volume double, percent_change_price double, percent_change_volume_over_last_wk double, previous_weeks_volume double, next_weeks_open double, next_weeks_close double, percent_change_next_weeks_price double, days_to_next_dividend int, percent_return_next_dividend double) row format delimited fields terminated by ‘,’;
Error I get:
Error while compiling statement: FAILED: ParseException line 1:431 mismatched input ',' expecting StringLiteral near 'by' in table row format's field separator [ERROR_STATUS]
New to SQL, so apologies in advance if it's a really obvious fix.
Try like this...
row format delimited fields terminated by '\;'
Let us know