I have data something like this in CSV file:
1|abc|"Hello,
how are you"|pqr
2|xyz|I am fine|tuv
3|hjd|what abt you|klf
You can see we have a multiline record in CSV.Although I can load this in hive table. But it will not show me the correct result.
How can I handle multiline records to load in hive
Hive is suitable for structured dataset, where the column sequence and delimiters are well defined. In your case, the EOL character is row delimiter and can occur in the dataset as well. Due to this, the data is semi structured.
You have few options:
If the file is being generated by any other program, you should
change the row delimiter to something else than a newline. The best
practice is to use Ctrl+A for column delimiter and Ctrl+B for row delimiter.
If option 1 is not feasible, Write a map-reduce with custom record reader implementation, where record reader could determine the boundary of one record (i.e. logic to pick one complete record). Reformat the record in the map reduce program to output records delimited with Ctrl+A (columns delimiter) and Ctrl+B (row delimiter). Now, you may load the file in hive.
Related
I am trying to format text data stored in my sqlite file, so when queried it will return with formatted data, more specifically indents.
I've read that this should be done in the front end application but it seems that would would be taxing to format hundreds of different, nonuniform entries. (these entries do share the same column value in the table though)
I have bullets in almost all of my entries but cannot indent them to nest properly.
is there anyway execute a sql statement to the column to add indent behavior like so
alksdjf ;lkasjdfl kajsl;dk fja;lsk djfal;ksjdfl;aksjd fl;kasjdfslkdfjals;kdjf;laksjdflkajs;dlkfja;lskjdfl;aksjdflkasjd;lfkjasl;kdfjl;
the text lines up with each other past the bullet
I've been given some csv files that I want to turn into tables in a SQL database. However, the genius who created the files used comma delimiters, even though several data fields contain commas. So when I try to BCP the data into the database, I get a whole bunch of errors.
Is there a way that I can escape the commas that aren't field separators? At the moment I'm tempted to write a script to manually replace every comma in each file with a pipe, and then go through and manually change the affected rows back.
The only way to fix this is to write a script or program that fixes the data.
If the bad data is limited to a single field the process should be trivial:
You consume the row from either side by the count of good delimiters and replace with a new unique delimiter and what remains is the column with the extra old delimiters that you would just leave as is.
If you have two bad fields straddling good fields, you would need some kind of advanced logic, for instance I had XML data with delimiters, I had to parse the XML until I found a terminating tag and then process the other delimiters as needed.
Import from CSV fails if there are more than one records in the csv file, In this sample file, the data is delimited by single space ASCII value. Problem is every record has 'single space' even after the last column value, now when the system encounters this last 'single space' in each line .. its assuming as another column value and its not moving forward to the next record (as its unable to find the new line character).
Is there any to specify to ignore single space after last column value in each line?
Any way to consider this last single space on each line as newline character ?
I have thousands of rows so its impossible to manually replace last single space ASCII value with some end of line character?
On other note any good ETL tool which can help easily to move raw data into Cassandra to avoid above kind of problems?
Error message
$COPY sensors_data(samplenumber,magx,magy,magz,accx,accy,accz,gyror,gyrop,gyroy,lbutton,rbutton) FROM '/home/swift/cassandra/input-data/FallFromDesk1.csv' WITH DELIMITER=' ';
Record #0 (line 1) has the wrong number of fields (13 instead of 12).
Note
The above commands works perfectly if there is only 1 row in the .csv file or if we manually remove the single space after the last column value on each row.
Kindly help me out.
I was trying to save the output of a Hive query on HDFS but the data got changed. Any idea?
See below the data and the changed one. remove the space before the file name :)
[[Correct]: i.stack.imgur.com/ DLNTT.png
[[Messed up]: i.stack.imgur.com/ 7WIO3.png
Any feedback would be appreciated.
Thanks in advance.
It looks like you are importing an array into Hive which is one of the available complex types. Internally, Hive separates the elements in an array with the ASCII character 002. If you consult an ascii table, you can see that this is the non printable character "start of text". It looks like your terminal does actually print the non-printable character, and by comparing the two images you can see that 002 does indeed separate every item of your array.
Similarly, Hive will separate every column in a row with ASCII 001, and it will separate map keys/values and structure fields/values with ASCII 003. These values were chosen because they are unlikely to show up in your data. If you want to change this, you can manually specify delimiters using ROW FORMAT in you create table statement. Be careful though, if you switch the collection items terminator to something like , then any commas in your input will look like collection terminators to Hive.
Unless you need to store the data in human readable form and are sure there is a printable character that will not collide with your terminators, I would leave them as is. If you need to read the HDFS files you can always hadoop fs -cat /exampleWarehouseDir/exampleTable/* | tr '\002' '\t' to display array items as separated with tabs. If you write a MapReduce or Pig job against the Hive tables, just be aware what your delimiters are. Learning how to write and read Hive tables from MapReduce was how I learned about these terminators in first place. And if you are doing all of your processing in Hive, you shouldn't ever have to worry about what the terminators are (unless they show up in your input data).
Now this would explain why you would see ASCII 002 popping up if you were reading the file contents off of HDFS, but it looks like you are seeing it from the Hive Command Line Interface which should be aware of the collection terminators (and therefore use them to separate elements of the array instead of printing them). My best guess there is you have specified the schema wrong and the column of the table results is a string where you meant to make it an array. This would explain why it went ahead and printed the ASCII 002's instead of using them as collection terminators.
I would like to be able to produce a file by running a command or batch which basically exports a table or view (SELECT * FROM tbl), in text form (default conversions to text for dates, numbers, etc are fine), tab-delimited, with NULLs being converted to empty field (i.e. a NULL colum would have no space between tab characters, with appropriate line termination (CRLF or Windows), preferably also with column headings.
This is the same export I can get in SQL Assistant 12.0, but choosing the export option, using tab delimiter, setting my NULL value to '' and including column headings.
I have been unable to find the right combination of options - the closest I have gotten is by building a single column with CAST and '09'XC, but the rows still have a leading 2-byte length indicator in most settings I have tried. I would prefer not to have to build large strings for the various different tables.
To eliminate the 2-byte in the FastExport output:
.EXPORT OUTFILE &dwoutfile MODE RECORD FORMAT TEXT;
And your SELECT must generate a fixed length export field e.g. CHAR(n). So you will inflate the size of the file and end up with a delimited but fixed length export file.
The other option is if you are in a UNIX/Linux environment you can post process the file and strip the leading two bytes or write an ASXMOD in C to do it as the records are streamed to the file.