I dump a Hive table as follows
hive -e 'select * from sometable' | sed 's/[\t]/,/g' > /tmp/us.csv
I drop a destination table and create it again based on a model table.
drop table blargh; create table blargh like modeltable;
modeltable is partitioned on a field called mkt_cd so now blargh is too.
I run a script against /tmp/us.csv and modify the timestamp field, adjusting it to be now() in the format YYYY-mm-DD HH:MM:SS and I write out a file new.csv. The old CSV /tmp/us.csv had a timestamp in this format which was old and we needed to refresh it.
Finally I try to load with the new CSV file:
hive> load data inpath '/path/to/new/new.csv' into table blargh partition(mkt_cd);
FAILED: NullPointerException null
This error occurs even if I take the "head" (a few rows) of new.csv as my input. In the editor, it looks good and there are no empty lines. In addition, this error also occurs if I simply use the original CSV /tmp/us.csv file before I changed it. What could be a reason here?
Related
My question is somewhat similar to the below post. I want to download some data from a hive table using select query. But because the data is large, I want to write it as an external table in a given path. so that I can create a csv file. Uses the below code
create external table output(col1 STRING, col2STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '{outdir}/output'
INSERT OVERWRITE TABLE output
Select col1, col2 from atable limit 1000
This works fine, and create a file in 0000_ format, which can be copied as a csv file.
But my question is how to ensure that the output will always have a single file? If there is no partition defined, will it always be single file? What is the rule it uses to split files?
Saw few similar questions like below. But it discuss hdfs file access.
How to point to a single file with external table
I know the below alternative, but I use a hive connection object to execute queries from a remote node.
hive -e ' selectsql; ' | sed 's/[\t]/,/g' > outpathwithfilename
You can set the below property before doing the overwrite
set mapreduce.job.reduces=1;
Note: If the hive engine doesn't allow to be modified at runtime, then whitelist the parameter by setting below property in hive-site.xml
hive.security.authorization.sqlstd.confwhitelist.append=|mapreduce.job.|mapreduce.map.|mapreduce.reduce.*
I need to load an s3 data in hive table. This s3 location is dynamic and is stored in another static s3 location.
The dynamic s3 location which I want to load in hive table has path format
s3://s3BucketName/some-path/yyyy-MM-dd
and the static location has data format
{"datasetDate": "datePublished", "s3Location": "s3://s3BucketName/some-path/yyyy-MM-dd"}
Is there a way to read this data in hive? I searched about this a lot but could not find anything.
You can read JSON data from your static location file, parse s3Location field and pass it as a parameter to your add partition clause.
Possible way to read json is using Hive. You can use some other means for the same.
Example using Hive.
create table data_location(location_info string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://s3BucketName/some-static-location-path/';
Then get the location in the shell script and pass it as a parameter to ADD partition statement.
For example you have table named target_table partitioned by datePublished. You can add partitions like this:
#!/bin/bash
data_location=$(hive -e "set hive.cli.print.header=false; select get_json_object(location_info,'$.s3Location') from data_location")
#get partition name
partition=$(basename ${data_location})
#Create partition in your target table:
hive -e "ALTER TABLE TARGET_TABLE ADD IF NOT EXISTS PARTITION (datePublished='${partition}') LOCATION '${data_location}'"
If you do not want partitioned table, then you can use
ALTER TABLE SET LOCATION instead of adding partition:
hive -e "ALTER TABLE TARGET_TABLE SET LOCATION='${data_location}'"
If it is only the last subfolder name is dynamic (which is date) and base directory is always the same, like s3://s3BucketName/some-path/, only yyyy-MM-dd is changing, you can create table once with location s3://s3BucketName/some-path/ and issue RECOVER PARTITIONS statement. In this case you do not need to read the content of file with location specification. Just schedule RECOVER PARTITIONS to get new partition attached on daily basis.
I am working on a hive solution wherein I need to append some values to the high volume files. So instead of appending them, I am trying using map-reduce method
The approach is below
Table creation:
create external table demo_project_data(data string) PARTITIONED BY (business_date string, src_sys_file_nm string, prd_typ_cd string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
LOCATION '/user/hive/warehouse/demo/project/data';
hadoop fs -mkdir -p /user/hive/warehouse/demo/project/data/business_date='20180707'/src_sys_file_nm='a_b_c_20180707_1.dat.gz'/prd_typ_cd='abcd'
echo "ALTER TABLE demo_project_data ADD IF NOT EXISTS PARTITION(business_date='20180707',src_sys_file_nm='a ch_ach_fotp_20180707_1.dat.gz',prd_typ_cd='ach')
LOCATION '/user/hive/warehouse/demo/project/data/business_date='20180707'/src_sys_file_nm='a_b_c_20180707_1.dat.gz'/prd_typ_cd='abcd';"|hive
hadoop fs -cp /apps/tdi/data/a_b_c_20180707_1.dat.gz /user/hive/warehouse/demo/project/data/business_date='20180707'/src_sys_file_nm='a_b_c_20180707_1.dat.gz'/prd_typ_cd='abcd'
echo "INSERT OVERWRITE DIRECTORY '/user/20180707' select *,'~karthick~kb~demo' from demo_project_data where src_sys_file_nm='a_b_c_20180707_1.dat.gz' and business_date='20180707' and prd_typ_cd='abcd';"|hive
I have some amount of data in the file but I dont see any results in the above query. The files are properly copied under the correct location.
What is that I am making wrong? Query has no issues
Also I will be looping over multiple dates. I would like to know if this is the right way to do it.
You can Use below command to fetch the results from the partition -
MSCK REPAIR TABLE <tablename>;
Refer,
MSCK REPAIR TABLE:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
Yesterday I installed Cloudera QuickStart VM 5.8. After the import operation of files from the database by HUE, in some tables there were a NULL value (the entire column). In previous steps data display them properly as they should be imported.
First Pic.
Second Pic.
can you run the command describe formatted table_name in hive shell and see what is the field delimiter and then go to the warehouse directory and see if the delimiter in the data and in the table definition is same.i am sure it will not be same thats why you see null.
i am assuming you have imported the data in the default warehouse directory.
then you can do one of the following
1) delete your hive table and create it again with correct delimiter as it is in the actual data ( row format delimited fields terminated by "your delimitor" and give location as your data file
or
2) delete the data that is imported and run sqoop import again and give the fields-terminated-by " the delimitor in the hive table definition"
Once check datatype of second(col_1) and third(col_2) in original database from where your exporting.
This can not be case of missing delimiter, else fourth(col_3) would not have populated correctly, which is correct.
I have loaded a web file to a table using serde in hive. i am able to view the table data. now i want to copy the data to a new table. If i run a new table
-Create table new_xxx as select * from XXX;
- the job is failing.
Error in the log file:
Execution error,return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
Run time Exception:error in configuring object.
Since you are using serde to load web data into the 1st table it will serialize and deserialize the table data while insert and select. So, in the second table to which you are trying to insert data should also be aware of the serde used.
use the following syntax it might help you.
CREATE TABLE new_table_XX ROW FORMAT SERDE "org.apache.hadoop.hive.serde" AS SELECT .....