Load substring in Hive data input

Load substring in Hive data input - hive

I am trying to load an input data file using Hive.
Consider I have the following input in a text file:
"10"
Is it possible to load the input without quotation: as an integer?

You can use the following third party CSV Serde in the following way.
add jar path/to/csv-serde.jar;
create table table_name (a string, b string, ...)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
stored as textfile
;
Here is the link: https://github.com/ogrodnek/csv-serde.git

Related

How to change data type in csv column

Using databricks with SQL, I have to import my csv dataset into a table and analyse data using it.
My problem is after I imported csv dataset, all column are String type, but some of these need to be Numeric. How can I solve?
How can I define the column types of a csv file? I tried converting file in xlsx and setting numeric type but then it's not possible to convert again in csv (or I don't know how).
Thanks for helping
PS: databricks wants just csv file and not xlsx or similar.

If you are using Databricks on Azure, when you select "Create table with UI" there should be options for you to choose a data type for each column as in the screenshot A below.
If you are importing table by some Python Spark codes, there should be an option, infer_schema, for you to set. If it is set to "true", all columns that contain only numeric will have appropriate numeric data types.
file_location = "/FileStore/shared_uploads/xxx/dbo_project.csv"
file_type = "csv"
infer_schema = "true"
first_row_is_header = "false"
delimiter = ","
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
Screenshot A

How to use -tagFile option with CSVExcelStorage in Pig

I have to get the filename with each row so i used
data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray);
But in data.csv some columns have comma(,) in content as well so to handle comma issue i used
data = LOAD 'data.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage()AS (filename:chararray);
But I didn't get any option to use -tagFile option with CSVExcelStorage.
Please let me know how can i use CSVExcelStorage and -tagFile option at once?
Thanks

I got the way to perform both operation(get the file name in each row and replace delimiter if it appears in column content)
data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);
/*replace comma(,) if it appears in column content*/
replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');
/*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;
Once data is loaded properly without comma , i am free to perform any operation.
Detailed use case is available at my blog

You can't use -tagefile with CSVExcelStorage since CSVExcelStorage does not have -tagFile option.The workaround is to change the delimiter of the file and use PigStorage with the new delimiter and -tagFile or replace the comma in your data.

Not getting the ouptut according to defined schema in Apache Pig

I am new to pig latin and i tried this schema on my data,
A = LOAD 'data' USING PigStorage(',') AS (f1:int, f2:int, B:bag{T:tuple(t1:int,t2:int)});
My sample data is
10,1,{(2,4),(5,6)}
10,3,{(1,3),(6,9)}
On performing \d A the output on my terminal is :
(10,1,)
(10,3,)
Please tell me what am i doing wrong.

The sample data you have is not in the correct format.Your load statement is using ',' as the field separator.However the tuples in the bag are also separated by ',' and hence the data is not loaded correctly.
One way to fix this is to choose a different delimiter for the fields.For example tab,pipe,semicolons.
Using Tabs as field separator and comma as tuple separator
10 1 {(2,4),(5,6)}
10 3 {(1,3),(6,9)}
Script for tab delimited fields with the schema
A = LOAD 'test8.txt' using PigStorage('\t') AS (f1:int, f2:int, B:bag{T:tuple(t1:int,t2:int)});
DUMP A;
Output
Alternatively, you can load the sample data without specifying the fields
10,1,{(2,4),(5,6)}
10,3,{(1,3),(6,9)}
Script for load without schema but with ',' as field separator
A = LOAD '/test8.txt' USING PigStorage(',');
DUMP A;
Output

PIG LOAD filename

I am just trying to load an unstructured input file and add the filename. So what I want to get is two fields :
filename:chararray, inputrow:chararray.
I can load the filename if I have a field delimiter using pigstorage(';','-tagfile') but I do not want to delimit fields at this point I just want the string and the filename. How can I do this ?
B

The way to load in files without applying a delimiter, is to choose a delimiter that does not (cannot) occur in the file.
For example, if your file is separated by ; and cannot contain tabs \t you could do:
pigstorage('\t','-tagfile')

How the exported csv file from DB2 can include the content of XML

I am trying to export a table with XMLType field from DB2 to a csv.
And I found that inside the csv file, the relational field in table can output the values correctly.
But the value of the XMLType field is a pointer to an exported XML file.
The csv file content:
1349714,,2,<XDS FIL='result.csv.001.xml' OFF='0' LEN='7013' />,2014-01-22-16.38.58.314000
You can see that the 4th field value is a pointer to a XML file.
May I know the command to include the XML content when exporting to a csv file in DB2??
For now, I'm using this cmd to do export:
EXPORT TO result.csv OF DEL MODIFIED BY NOCHARDEL SELECT col1, col2, coln FROM dbtable;
Thanks Buddy.

You need to convert XML to a character data type, e.g. using XMLSERIALIZE(yourxmlcolumn AS VARCHAR(32672)). Keep in mind that both the VARCHAR data type and the delimited export format have limitations on the value length (32672 and 32700 bytes respectively), so if your serialized XML fragment is longer than that it will be truncated.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Load substring in Hive data input - hive

I am trying to load an input data file using Hive. Consider I have the following input in a text file: "10" Is it possible to load the input without quotation: as an integer?

You can use the following third party CSV Serde in the following way. add jar path/to/csv-serde.jar; create table table_name (a string, b string, ...) row format serde 'com.bizo.hive.serde.csv.CSVSerde' stored as textfile ; Here is the link: https://github.com/ogrodnek/csv-serde.git

Related

How to change data type in csv column

How to use -tagFile option with CSVExcelStorage in Pig

Not getting the ouptut according to defined schema in Apache Pig

PIG LOAD filename

How the exported csv file from DB2 can include the content of XML

Categories

Resources