Escaping delimiter in BQ - google-bigquery

I have a ton of files which are delimited by |, however, they have | as values in the fields as well. the | in the data has been escaped with \ but I don't think BQ is picking it up, is this something I can fix without having to open every single file, and updating? there are 2-3000 files and are all zipped, so doing it one by one is not at all practical.

Read each row as a whole line (CSV, with a weird character delimiter).
Parse in BigQuery - either via REGEX or JavaScript UDF.
I describe a similar approach here:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6

Related

Removing or preserving unmatched quotes in Dataweave

We're currently reading in a client's tab-delimited file row by row and using Dataweave to handle the transforming of the data to models for persisting to the database.
The issue we're having is that single double-quotes are causing problems with the mapping to the models.
Is there a way to handle unmatched double-quotes in Dataweave? We have a short term option of removing the offending quotes or removing all quotes entirely.
The other, preferred option is to preserve the data as is, single double-quotes and all, so the database data matches the original source data.
Can I achieve either of these results in Dataweave alone?
Many thanks.
Unmatched double quote will cause issue, if possible you can changed it to \"unmatched so that it can passes as it is to downstream. Other options which you mentioned will alter the source data. Use of escape character prevent data alteration.

Reading SQL CLOB via a script

I have a lengthy (with about 1000 characters) clob object stored in a table. I need to read this value through a bash script. How can I do this?
I've tried using a normal SELECT query. But then the output comes as multiple lines. I cannot merge them as it does not produce the exact text in the database in special cases (e.g. if there is a space at the end of single line)
e.g.
abcd
efg
hijk
If I merged the lines with sed ':a;N;$!ba;s/\n//g;', this becomes abcdefghijk when the actual text is abcdefg hijk.
What is the best approach for doing what I'm trying to do here.
Since I couldn't find a way around using the above method, I managed to do what I want using a different approach.
Since the space character was the problem, and since I am the one who is inserting those clobs, instead of inserting the text directly, I've first base64 encoded the text and inserted the encoded text into the table.
I could use the same SELECT query after this. I had to perform a base64 decode on the select output to get the original text.

Importing File WIth Field Terminators In Data

I've been given some csv files that I want to turn into tables in a SQL database. However, the genius who created the files used comma delimiters, even though several data fields contain commas. So when I try to BCP the data into the database, I get a whole bunch of errors.
Is there a way that I can escape the commas that aren't field separators? At the moment I'm tempted to write a script to manually replace every comma in each file with a pipe, and then go through and manually change the affected rows back.
The only way to fix this is to write a script or program that fixes the data.
If the bad data is limited to a single field the process should be trivial:
You consume the row from either side by the count of good delimiters and replace with a new unique delimiter and what remains is the column with the extra old delimiters that you would just leave as is.
If you have two bad fields straddling good fields, you would need some kind of advanced logic, for instance I had XML data with delimiters, I had to parse the XML until I found a terminating tag and then process the other delimiters as needed.

Writing on HDFS messed the data

I was trying to save the output of a Hive query on HDFS but the data got changed. Any idea?
See below the data and the changed one. remove the space before the file name :)
[[Correct]: i.stack.imgur.com/ DLNTT.png
[[Messed up]: i.stack.imgur.com/ 7WIO3.png
Any feedback would be appreciated.
Thanks in advance.
It looks like you are importing an array into Hive which is one of the available complex types. Internally, Hive separates the elements in an array with the ASCII character 002. If you consult an ascii table, you can see that this is the non printable character "start of text". It looks like your terminal does actually print the non-printable character, and by comparing the two images you can see that 002 does indeed separate every item of your array.
Similarly, Hive will separate every column in a row with ASCII 001, and it will separate map keys/values and structure fields/values with ASCII 003. These values were chosen because they are unlikely to show up in your data. If you want to change this, you can manually specify delimiters using ROW FORMAT in you create table statement. Be careful though, if you switch the collection items terminator to something like , then any commas in your input will look like collection terminators to Hive.
Unless you need to store the data in human readable form and are sure there is a printable character that will not collide with your terminators, I would leave them as is. If you need to read the HDFS files you can always hadoop fs -cat /exampleWarehouseDir/exampleTable/* | tr '\002' '\t' to display array items as separated with tabs. If you write a MapReduce or Pig job against the Hive tables, just be aware what your delimiters are. Learning how to write and read Hive tables from MapReduce was how I learned about these terminators in first place. And if you are doing all of your processing in Hive, you shouldn't ever have to worry about what the terminators are (unless they show up in your input data).
Now this would explain why you would see ASCII 002 popping up if you were reading the file contents off of HDFS, but it looks like you are seeing it from the Hive Command Line Interface which should be aware of the collection terminators (and therefore use them to separate elements of the array instead of printing them). My best guess there is you have specified the schema wrong and the column of the table results is a string where you meant to make it an array. This would explain why it went ahead and printed the ASCII 002's instead of using them as collection terminators.

How do you add new line breaks to a csv file using sql?

I am currently outputting the results of an sql query to a CSV file with a header using the copy command in Postgresql. What I want to do is add a line break between the header and the start of the CSV data so that it is more readable.
Is there an easy way to do this by just altering my query? Thanks for your help!
No, there isn't. Do it in a script that post-processes the output. PostgreSQL's COPY isn't supposed to produce pretty-printed readable output, it's there to produce reliable, consistent output.
Actually, it is possible, although it is a bit of a hack:
Rename the final column of the output using an AS clause in the query to add a newline character to the end of the column. You have to be sure to double-quote the column name in order to preserve the non-standard character, and you can't use that column in the FORCE QUOTE clause of COPY.
I was able to get it to work as you requested with this query:
copy (select typname, typlen, typinput, typoutput as "typoutput^L" from pg_type limit 10) to stdout with csv header;
The first three lines of output on my system look like this:
typname,typlen,typinput,typoutput
bool,1,boolin,boolout
Please note that the "^L" in this output is my actually using the key combination in a Linux console in the PostgreSQL 9.2 psql client. Trying to use the string "^L" doesn't work. Also, this doesn't work either, although it gets you closer (note the line break):
copy (select typname, typlen, typinput, typoutput as "typoutput
" from pg_type limit 10) to stdout with csv header;
This ends up with the first two lines looking like this:
typname,typlen,typinput,"typoutput
"
Which isn't quite correct.
You'll just need to tinker in your specific environment on how to get that newline character added but hopefully that's enough info here to start.