Data between quotes and field separator - google-bigquery

In the example given below, the last line is not uploaded. I get an error:
Data between close double quote (") and field separator:
This looks like a bug since all the data between pipe symbol should be treated as a single field.
Schema: one:string,two:string,three:string,four:string
Upload file:
This | is | test only | to check quotes
second | line | "with quotes" | no text
third line | with | "start quote" and | a word after quotes
The first and second line above is processed. But not the third.
Update:
Can some please explain why does the following work except the third line?
This | is | test only | to check quotes
second | line | "with quotes" | no text
third line | with | "start quote" and | a word after quotes
forth line | enclosed | {"GPRS","MCC_DETECTED":false,"MNC_DETECTED":false} | how does this work?
fifth line | with | {"start quote"} and | a word after quotes
There can be some fancy explanation to this. From the end user perspective this is absurd.

From the CSV RFC4180 page: "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
You probably want to do this:
This | is | test only | to check quotes
second | line | "with quotes" | no text
third line | with | " ""start quote"" and " | a word after quotes
More about our CSV input format here.

Using --quote worked perfectly.
bq load
--source_format CSV --quote ""
--field_delimiter \t
--max_bad_records 10
-E UTF-8
destination table
Source files

API V2
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.quote
bq command
--quote: Quote character to use to enclose records. Default is ". To indicate no quote character at all, use an empty string.

Try this as an alternative:
Load the MySQL backup files into a Cloud SQL instance.
Read the data in BigQuery straight out of MySQL.
Longer how-to:
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229

You can use the other flags also while uploading the data. I used the bq tool with following flags
bq load -F , --source_format CSV --skip_leading_rows 1 --max_bad_records 1 --format csv -E UTF-8 yourdatset gs://datalocation.

Try loading every time with bq shell.
I had to load 1100 columns. While trying with the console with all the error options, it threw lot many errors. Ignoring the errors in the console means loosing records.
Hence tried with the shell and succeeded loading all the records.
Try the following:
bq load --source_format CSV --quote "" --field_delimiter \t --allow_jagged_rows --ignore_unknown_values --allow_quoted_newlines --max_bad_records 10 -E UTF-8 {dataset_name}.{table_name} gs://{google_cloud_storage_location}/* {col_1}:{data_type1},{col_2}:{data_type2}, ....
References:
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#bigquery_load_table_gcs_csv-cli
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#csv-options

Related

Getting Error as "Regex: syntax error in subpattern name (missing terminator)." in SPLUNK

I have been extracting fields in Splunk and this looks to be working fine for all headers but for the header l-s-m, I am getting the error as "syntax error in subpattern name (missing terminator)."
I have done similar for other headers and all works but this is the only header with "hypen" sign that is giving this error, I have tried multiple times but this is not helping.
Headers:
Content-Type: application/json
Accept: application/json,application/problem json
l-m-n: txxxmnoltr
Accept-Encoinding:gzip
Regex I am trying is "rex field=u "l-m-n: (?<l-m-n>.*)" in SPLUNK. Could you please guide me here?
rex cannot extract into a field name with hyphens. However, you can solve this with rename
| rex field=u "l-m-n: (?<lmn>.*)" | rename lmn AS "l-m-n"
In general, I would avoid the use of hyphens in a field name, as it can be mistaken for a minus. If you want to use the field l-m-n, you will need to quote it everywhere, like 'l-m-n' . I would strongly suggest you stick with using the field name lmn.
Try running the following to see what I mean
| makeresults | eval l-m-n=10 | eval l=1 | eval m=1 | eval n=1 | eval result_noquote=l-m-n | eval result_quoted='l-m-n'

Copy csv & json data from S3 to Redshift

I have data like below format from s3 bucket.
"2010-9","aws cloud","{"id":1,"name":"test"}"
"2010-9","aws cloud1","{"id":2,"name":"test2"}"
I want to copy data in database like below.
Table
year | env | desc
2010-9 | aws cloud |{"id":1,"name":"test"}
2010-9 | aws cloud1 |{"id":2,"name":"test2"}
I have written this command but not working. Could you please help me?
copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
delimiter ','
IGNOREHEADER 1
REMOVEQUOTES
IGNOREBLANKLINES
manifest;
You are almost there - you just need to escape the double quotes inside the 3rd field (desc). Per the
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example: "aaa","b""bb","ccc"
This is per rfc-4180 - https://www.ietf.org/rfc/rfc4180.txt
I've also loaded json into a text field in Redshift and then used the json functions to parse the field. Works great.

Escape all special characters in Presto

Uploading data(CSV) to S3 and then to Presto. But due to problems with the data inside the files we have problems uploading from S3 to Presto.
The metadata are correctly formed but because of problems in column B, they are failing.
A;B;DATE
EPA;Ørsted Energy Sales & Distribution;2019-01-11 12:10:13
EPA;De MARIA GærfaPepeer A/S; 2019-02-12 12:10:13
EPA;Scan Convert A/S; 2019-02-11 11:10:12
EPA;***Mega; 2019-02-11 11:10:13
EPA;sAYSlö-SähAAdkö Oy; 2019-02-11 11:11:11
We are adding replacement formulas in previous step (Informatica Cloud), to add \ and read the values correctly.
Is there a list of characters we should look for and add the \ ?
The problem is that according to standard if Your B column could contain separator then You should add quotation on that column. If there are quotations inside ( what for 99% can happen ) then You should add escape character before.
A;B;DATE
EPA;"company";01/01/2000
EPA;"Super \"company\""; 01/01/2000
EPA,"\"dadad\" \;"; 01/01/2000
I had similar problem, it's quite easy to solve it with regular expression :
In Your scenario You can search for :
(^EPA;) and replace it with: $1" ==> s/(^EPA;)/$1"/g
(;[0-9]{1,2}/[0-9]{1,2}) and replace it with: "$1 ==> s/\s*(;[0-9]{1,2}/[0-9]{1,2})/"$1/g
Final step would be global backslashes enrichment:
s/([^;"]|;")(")([^;\n])/$1\\$2$3/g
Please take a look on that:
https://fullouterjoin.wordpress.com/2019/04/05/dealing-with-broken-csv-strings-with-missing-escape-characters-powercenter/

Remove all occurrences of a list of words vim

Having a document whose first line is foo,bar,baz,qux,quux, is there a way to store these words in a variable as a list ['foo','bar','baz','qux','quux']and remove all their occurrences in a document with vim?
Like a command :removeall in visual mode highlighting the list:
foo,bar,baz,qux,quux
hello foo how are you
doing foo bar baz qux
good quux
will change the text to:
hello how are you
doing good
A safer way is to write a function, check each part of your "list", if there is something needs to be escaped. then do the substitution (removing). A dirty & quick way to do it with your input is with this mapping:
nnoremap <leader>R :s/,/\|/g<cr>dd:%s/\v<c-r>"<c-h>//g<cr>
then in Normal mode, when you go to the line, which contains deletion parts and must be CSV format, press <leader>R you will get expected output.
The substitution would fail if that line has regex special chars, like /, *, . or \ etc.
Something like this one liner should work:
:for f in split(getline("."), ",") | execute "%s/" . f | endfor | 0d
Note that you'll end up with a lot of trailing spaces.
edit
This version of the command above takes care of those pesky trailing spaces (but not the one on line 2 of your sample text):
:for f in split(getline("."), ",") | execute "%s/ *" . f | endfor | 0d
Result:
hello how are you
doing
good

escaping characters for substitution into a PDF

Can anyone tell me the set of control characters for a PDF file, and how to escape them? I have a (non-deflated (inflated?)) PDF document that I would like to edit the text in, but I'm afraid of accidentally making some control sequence using parentheses and stuff.
Thanks.
Okay, I think I found it. On page 15 of the PDF 1.7 spec (PDF link), it appears that the only characters I need to worry about are the parentheses and the backslash.
Sequence | Meaning
---------------------------------------------
\n | LINE FEED (0Ah) (LF)
\r | CARRIAGE RETURN (0Dh) (CR)
\t | HORIZONTAL TAB (09h) (HT)
\b | BACKSPACE (08h) (BS)
\f | FORM FEED (FF)
\( | LEFT PARENTHESIS (28h)
\) | RIGHT PARENTHESIS (29h)
\\ | REVERSE SOLIDUS (5Ch) (Backslash)
\ddd | Character code ddd (octal)
Hopefully this was helpful to someone.
You likely already know this, but PDF files have an index at the end that contains byte offsets to everything in the document. If you edit the doc by hand, you must ensure that the new text you write has exactly the same number of characters as the original.
If you want to extract PDF page content and edit that, it's pretty straightforward. My CAM::PDF library lets you do it programmatically or via the command line:
use CAM::PDF;
my $pdf = CAM::PDF->new($filename);
my $page_content = $pdf->getPageContent($pagenum);
# ...
$pdf->setPageContent($pagenum, $page_content)l
$pdf->cleanoutput($out_filename);
or
getpdfpage.pl in.pdf 1 > page1.txt
setpdfpage.pl in.pdf page1.txt 1 out.pdf