How should I work with NULL values in RegexSerDe?
I have file with data:
cat MOS/ex1.txt
123,dwdjwhdjwh,456
543,\N,956
I have the table:
CREATE TABLE mos.stations (usaf string, wban STRING, name string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*)"
);
I successfully loaded the data from file to table:
LOAD DATA LOCAL INPATH '/home/hduser/MOS/ex1.txt' OVERWRITE INTO TABLE mos.stations;
Simple select works fine:
hive> select * from mos.stations;
123dwdjwhdjwh456
543\N956
And next ends with error:
select * from mos.stations where wban is null;
[Hive Error]: Query returned non-zero code: 9, cause: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
What is wrong?
I see a couple of possible issues:
1) It may not having anything to do with null handling at all. The first query doesn't actually spawn an M/R job while the second one does so it might be a simple classpath issue where RegexSerde is not being seen by the M/R tasks because its jar is not in the classpath of the tasktracker. You'll need to find where the hive-contrib jar on your system lives and then make hive aware of it via something like:
add jar /usr/lib/hive/lib/hive-contrib-0.7.1-cdh3u2.jar
Note, your path and jar name may be different. You can run the above through hive right before your query.
2) Another issue might be that the RegexSerde doesn't really deal with "\N" the same way as the default LazySimpleSerde. Judging by the output you are getting in the first query (where it returns a literal "\N") that could be the case. What happens if you query where wban='\\N'? or where wban='\N' (I forget if you need to double escape).
Finally, one word of caution about RegexSerde. While its really handy, its slow as molasses going uphill in January compared to the default serde. If the dataset is large and you plan to run a lot of queries against it, its best to pre-process so that you don't need the RegexSerde. Otherwise, your going to pay a penalty for every query. The same datset above looks like it would be fine with the default serde.
Related
I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times I am noticing that it creates file with other names like 0000003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 000000_0? Below is an example of how my code looks like, where tab_content is a hive external table.
INSERT OVERWRITE TABLE tab_content
PARTITION(datekey)
select * from source
Better do not do this and modify your program to accept any number of files in the directory.
Each reducer (or mapper if it runs on map-only) creates it's own file. These reducers do know nothing about each other, they named during creation. Files are marked as 000001_0,000002_0. But it can be 000001_1 also if attempt number 0 has failed and attempt number 1 has succeeded. Also if table is partitioned and there is no distribute by partition key at the end, each reducer will create it's own file in each partition.
You can force it to work on a single final reducer (it can be done for example if you add order by clause or setting set mapred.reduce.tasks = 1;). But bear in mind that this solution is not scalable, because too many data will cause performance problems on single reducer. Also What will happen if attempt 0 has failed and it was restarted and attempt 1 succeeded? It will create 000001_1 instead of 000001_0.
I have to translate long Teradata scripts (10,000 lines long) into Impala. I never done this before with Impala.
The tools I’ve got to work with are impala shell or hue.
I’ve not seen an example of Impala code that’s more than 50 lines long either in impala shell or hue. Can someone point me to an example of impala script in either impala shell or hue that's at least 500 lines long?
I can handle the syntax change,I don’t need advice on that. I’m looking for gotchas or traps in writing long code into these tools.
You need to create an external table with a source data to your file (as it's shown in Impala tutorial).
-- The EXTERNAL clause means the data is located outside the central location
-- for Impala data files and is preserved when the associated Impala table is dropped.
-- We expect the data to already exist in the directory specified by the LOCATION clause.
CREATE EXTERNAL TABLE tab1
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE,
col_3 TIMESTAMP
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/cloudera/sample_data/tab1';
Then you can easily move your data whenever you want using INSERT construction.
INSERT INTO table2
SELECT * FROM tab1;
When creating a table how do you deal with a timestamp in csv file that has the following syntax - MM/DD/YY HH:MI? Here's an example: 1/1/16 19:00
I have tried the following script in PostgreSQL:
create table timetable (
time timestamp
);
copy table from '<path>' delimiter ',' CSV;
But, I receive an error message saying:
ERROR: ERROR: invalid input syntax for type timestamp:
"visit_datetime" Where: COPY air_reserve, line 16, column
visit_datetime: "visit_datetime"
One solution I have considered is first creating the timestamp column in char then run a separate query that converts it to the appropriate timestamp datatype using the function call 'to_char(time, MM/DD/YY HH:MI). But, I'm looking for a solution that would enable to load the data in the correct datatype in a single query.
You may find a datestyle that enables you to load the data you have, but sooner or later someone will deliver to you something that doesn't fit.
The solution you have considered is probably the best.
We use this as a standard pattern for loading data warehouses. We take today's data, load it into a staging table using varchar columns for any data that will not load directly into its target data type. We then run whatever scripts we need to to get the data into a good state, raising warnings for anything that is broken in a way we haven't seen before. Then we add the cleaned version of today's data into the table containing cleaned data for all previous days.
We don't mind if this takes several steps; we put them all in a script and run it as an automated job.
I'm working on documenting the techniques we use. You can see the beginnings of this at http://www.thedatastudio.net.
I'm trying to insert around 80 million records created using MATLAB into Vertica Database table. I wanted to know if we can call COPY LOCAL statement in MATLAB as a regular sql statement using exec(conn, sql). For test purpose, I tried with a dat file having around 4 million records as following:
sqlstmnt = 'COPY schema.table_name (FK_CUSTOMER_ID,FK_RUN_START_DATE_ID,FK_RUN_END_DATE_ID,FK_TRAVEL_ID,FK_ORIGIN_ID,FK_DEST_ID,FK_SEGMENT_ID,SEGMENT_PERCENTAGE,LAST_UPDATED) FROM LOCAL ''/my/file/full/path/test1.dat''';
results = exec(conn,sqlstmnt);
But it gave an error in results.Message like:
[Vertica]JDBC A ResultSet was expected but not generated from query "COPY schema.table_name(FK_CUSTOMER_ID,FK_RUN_START_DATE_ID,FK_RUN_END_DATE_ID,FK_TRAVEL_ID,FK_ORIGIN_ID,FK_DEST_ID,FK_SEGMENT_ID,SEGMENT_PERCENTAGE,LAST_UPDATED) FROM LOCAL '/my/file/full/path/test1.dat'". Query not executed.
I have the data in the '.dat' file in the order in which the columns are mentioned in COPY LOCAL.
I could not find any helpful resource explaining this error.
I have this test1.dat file which I'm able to insert using COPY from vsql but since I run my codes in MATLAB with many iterations,each iteration producing about a million records, I would want to insert them during each iteration. Any help will be really great.
COPY command return ResultSet that includes the amount of loaded data , i see two main options
1) results =exec(conn,sqlstmnt);
2)results = runsqlscript(conn,'nameOfSQLScriptthatIncludeTheCopyCommand.sql')
I hope you will find it useful
Thanks
I just finish reviewing you’re your input sample data .
i see major problem with the mapping of the input csv to the target table .
Main issues are :
1) Lines are broken into 2 lines ( you should prefer having one sample per line and avoid brock it into 2 lines )
Eg : "1,20150101,0,2,2573,2714,1,8.147237e-01
50,48,49,54,45,48,51,-28 12:11:46"
2) when you define data types on vertica table ,eg: timestamp the data on the csv must reflect to it ( what you have is "-28 12:11:46" , this will not work )
After you fix all this issues , make sure you test it using vsql , then go and try it with matlab
I hope you will find it useful.
Is it possible in hive to create a table and have it saved locally at the same time?
When I get data for my analyses, I usually create temporary tables to track eventual
mistakes in the queries/scripts. Some of these are just temporary tables, while others contain the data that I actually need for my analyses.
What I do usually is using hive -e "select * from db.table" > filename.tsv to get the data locally; however when the tables are big this can take quite some time.
I was wondering if there is some way in my script to create the table and save it locally at the same time. Probably this is not possible, but I thought it is worth asking.
Honestly doing it the way you are is the best way out of the two possible ways but it is worth noting you can preform a similar task in an .hql file for automation.
Using syntax like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp' select * from table;
You can run a query and store it somewhere in the local directory (as long as there is enough space and correct privileges)
A disadvantage to this is that with a pipe you get the data stored nicely as '|' delimitation and new line separated, but this method will store the values in the hive default '^b' I think.
A work around is to do something like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
But this is only in Hive 0.11 or higher