I am trying to load an input data file using Hive.
Consider I have the following input in a text file:
"10"
Is it possible to load the input without quotation: as an integer?
You can use the following third party CSV Serde in the following way.
add jar path/to/csv-serde.jar;
create table table_name (a string, b string, ...)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
stored as textfile
;
Here is the link: https://github.com/ogrodnek/csv-serde.git
Related
Using databricks with SQL, I have to import my csv dataset into a table and analyse data using it.
My problem is after I imported csv dataset, all column are String type, but some of these need to be Numeric. How can I solve?
How can I define the column types of a csv file? I tried converting file in xlsx and setting numeric type but then it's not possible to convert again in csv (or I don't know how).
Thanks for helping
PS: databricks wants just csv file and not xlsx or similar.
If you are using Databricks on Azure, when you select "Create table with UI" there should be options for you to choose a data type for each column as in the screenshot A below.
If you are importing table by some Python Spark codes, there should be an option, infer_schema, for you to set. If it is set to "true", all columns that contain only numeric will have appropriate numeric data types.
file_location = "/FileStore/shared_uploads/xxx/dbo_project.csv"
file_type = "csv"
infer_schema = "true"
first_row_is_header = "false"
delimiter = ","
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
Screenshot A
I have to get the filename with each row so i used
data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray);
But in data.csv some columns have comma(,) in content as well so to handle comma issue i used
data = LOAD 'data.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage()AS (filename:chararray);
But I didn't get any option to use -tagFile option with CSVExcelStorage.
Please let me know how can i use CSVExcelStorage and -tagFile option at once?
Thanks
I got the way to perform both operation(get the file name in each row and replace delimiter if it appears in column content)
data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);
/*replace comma(,) if it appears in column content*/
replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');
/*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;
Once data is loaded properly without comma , i am free to perform any operation.
Detailed use case is available at my blog
You can't use -tagefile with CSVExcelStorage since CSVExcelStorage does not have -tagFile option.The workaround is to change the delimiter of the file and use PigStorage with the new delimiter and -tagFile or replace the comma in your data.
I am new to pig latin and i tried this schema on my data,
A = LOAD 'data' USING PigStorage(',') AS (f1:int, f2:int, B:bag{T:tuple(t1:int,t2:int)});
My sample data is
10,1,{(2,4),(5,6)}
10,3,{(1,3),(6,9)}
On performing \d A the output on my terminal is :
(10,1,)
(10,3,)
Please tell me what am i doing wrong.
The sample data you have is not in the correct format.Your load statement is using ',' as the field separator.However the tuples in the bag are also separated by ',' and hence the data is not loaded correctly.
One way to fix this is to choose a different delimiter for the fields.For example tab,pipe,semicolons.
Using Tabs as field separator and comma as tuple separator
10 1 {(2,4),(5,6)}
10 3 {(1,3),(6,9)}
Script for tab delimited fields with the schema
A = LOAD 'test8.txt' using PigStorage('\t') AS (f1:int, f2:int, B:bag{T:tuple(t1:int,t2:int)});
DUMP A;
Output
Alternatively, you can load the sample data without specifying the fields
10,1,{(2,4),(5,6)}
10,3,{(1,3),(6,9)}
Script for load without schema but with ',' as field separator
A = LOAD '/test8.txt' USING PigStorage(',');
DUMP A;
Output
I am just trying to load an unstructured input file and add the filename. So what I want to get is two fields :
filename:chararray, inputrow:chararray.
I can load the filename if I have a field delimiter using pigstorage(';','-tagfile') but I do not want to delimit fields at this point I just want the string and the filename. How can I do this ?
B
The way to load in files without applying a delimiter, is to choose a delimiter that does not (cannot) occur in the file.
For example, if your file is separated by ; and cannot contain tabs \t you could do:
pigstorage('\t','-tagfile')
I am trying to export a table with XMLType field from DB2 to a csv.
And I found that inside the csv file, the relational field in table can output the values correctly.
But the value of the XMLType field is a pointer to an exported XML file.
The csv file content:
1349714,,2,<XDS FIL='result.csv.001.xml' OFF='0' LEN='7013' />,2014-01-22-16.38.58.314000
You can see that the 4th field value is a pointer to a XML file.
May I know the command to include the XML content when exporting to a csv file in DB2??
For now, I'm using this cmd to do export:
EXPORT TO result.csv OF DEL MODIFIED BY NOCHARDEL SELECT col1, col2, coln FROM dbtable;
Thanks Buddy.
You need to convert XML to a character data type, e.g. using XMLSERIALIZE(yourxmlcolumn AS VARCHAR(32672)). Keep in mind that both the VARCHAR data type and the delimited export format have limitations on the value length (32672 and 32700 bytes respectively), so if your serialized XML fragment is longer than that it will be truncated.