How to escape special characters in CNTK Text Reader? - cntk

For a reader such as this reading a file in CTF format,
query = StreamDef(field='S0', shape=vocab_size, is_sparse=True),
intent = StreamDef(field='S1', shape=num_intents, is_sparse=True),
slot_labels = StreamDef(field='S2', shape=num_labels, is_sparse=True)
How do I escape special charachter such as "|" if it is a token?
I am getting an warning for the line in the middle where the token itself is "|"
48155 |S0 196:1 |# - |S2 0:1 |# None
48155 |S0 18217:1 |# | |S2 0:1 |# None
48155 |S0 3152:1 |# Cindy |S2 0:1 |# None
I can remove these when creating the CFT file, but I was wondering how we can handle this.
Thanks

The pipe can be escaped by appending the hash symbol to it: |# this is a CTF comment with an escaped pipe: '|#'

You can also map pipes to another word or character that does not appear in your corpus. I typically replace pipes with .

Related

Awk - How to escape the | in sub?

I'd like to substitue a string, which contains a |
My STDIN :
13|Test|123|6232
14|Move|126|6692
15|Test|123|6152
I'd like to obtain :
13|Essai|666|6232
14|Move|126|6692
15|Essai|666|6152
I tried like this
{sub("|Test|123","|Essai|666") ;} {print;}
But I think the | is bothers me.... I really need to replace the complete string WITH the |.
How should I do to get this result ?
Many thanks for you precious help
You can use
awk '{sub(/\|Test\|123\|/,"|Essai|666|")}1' file
See the online demo.
Note:
/\|Test\|123\|/ is a regex that matches |Test|123| substring
sub(/\|Test\|123\|/,"|Essai|666|") - replaces the first occurrence of the regex pattern in the whole record (since the input is omitted, $0 is assumed)
1 triggers the default print action, no need to explicitly call print here.

Need solution for break line issue in string

I have below string which has enter character coming randomely and fields are separated by ~$~ and end with ##&.
Please help me to merge broken line into one.
In below string enter character is occured in address field (4/79A)
-------Sting----------
23510053~$~ABC~$~4313708~$~19072017~$~XYZ~$~CHINNUSAMY~$~~$~R~$~~$~~$~~$~42~$~~$~~$~~$~~$~28022017~$~
4/79A PQR Marg, Mumbai 4000001~$~TN~$~637301~$~Owns~$~RAT~$~31102015~$~12345~$~##&
Thanks in advance.
Rupesh
Seems to be a (more or less) duplicate of https://stackoverflow.com/a/802439/3595749
Note, you should ask to your client to remove the CRLF signs (rather than aplying the code below).
Nevertheless, try this:
cat inputfile | tr -d '\n' | sed 's/##&/##\&\n/g' >outputfile
Explanation:
tr is to remove the carriage return,
sed is to add it again (only when ##& is encountred). s/##&/##\&\n/g is to substitute "##&" by "##&\n" (I add a carriage return and "&" must be escaped). This applies globally (the "g" letter at the end).
Note, depending of the source (Unix or Windows), "\n" must be replaced by "\r\n" in some cases.

Remove all occurrences of a list of words vim

Having a document whose first line is foo,bar,baz,qux,quux, is there a way to store these words in a variable as a list ['foo','bar','baz','qux','quux']and remove all their occurrences in a document with vim?
Like a command :removeall in visual mode highlighting the list:
foo,bar,baz,qux,quux
hello foo how are you
doing foo bar baz qux
good quux
will change the text to:
hello how are you
doing good
A safer way is to write a function, check each part of your "list", if there is something needs to be escaped. then do the substitution (removing). A dirty & quick way to do it with your input is with this mapping:
nnoremap <leader>R :s/,/\|/g<cr>dd:%s/\v<c-r>"<c-h>//g<cr>
then in Normal mode, when you go to the line, which contains deletion parts and must be CSV format, press <leader>R you will get expected output.
The substitution would fail if that line has regex special chars, like /, *, . or \ etc.
Something like this one liner should work:
:for f in split(getline("."), ",") | execute "%s/" . f | endfor | 0d
Note that you'll end up with a lot of trailing spaces.
edit
This version of the command above takes care of those pesky trailing spaces (but not the one on line 2 of your sample text):
:for f in split(getline("."), ",") | execute "%s/ *" . f | endfor | 0d
Result:
hello how are you
doing
good

Data between quotes and field separator

In the example given below, the last line is not uploaded. I get an error:
Data between close double quote (") and field separator:
This looks like a bug since all the data between pipe symbol should be treated as a single field.
Schema: one:string,two:string,three:string,four:string
Upload file:
This | is | test only | to check quotes
second | line | "with quotes" | no text
third line | with | "start quote" and | a word after quotes
The first and second line above is processed. But not the third.
Update:
Can some please explain why does the following work except the third line?
This | is | test only | to check quotes
second | line | "with quotes" | no text
third line | with | "start quote" and | a word after quotes
forth line | enclosed | {"GPRS","MCC_DETECTED":false,"MNC_DETECTED":false} | how does this work?
fifth line | with | {"start quote"} and | a word after quotes
There can be some fancy explanation to this. From the end user perspective this is absurd.
From the CSV RFC4180 page: "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
You probably want to do this:
This | is | test only | to check quotes
second | line | "with quotes" | no text
third line | with | " ""start quote"" and " | a word after quotes
More about our CSV input format here.
Using --quote worked perfectly.
bq load
--source_format CSV --quote ""
--field_delimiter \t
--max_bad_records 10
-E UTF-8
destination table
Source files
API V2
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.quote
bq command
--quote: Quote character to use to enclose records. Default is ". To indicate no quote character at all, use an empty string.
Try this as an alternative:
Load the MySQL backup files into a Cloud SQL instance.
Read the data in BigQuery straight out of MySQL.
Longer how-to:
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229
You can use the other flags also while uploading the data. I used the bq tool with following flags
bq load -F , --source_format CSV --skip_leading_rows 1 --max_bad_records 1 --format csv -E UTF-8 yourdatset gs://datalocation.
Try loading every time with bq shell.
I had to load 1100 columns. While trying with the console with all the error options, it threw lot many errors. Ignoring the errors in the console means loosing records.
Hence tried with the shell and succeeded loading all the records.
Try the following:
bq load --source_format CSV --quote "" --field_delimiter \t --allow_jagged_rows --ignore_unknown_values --allow_quoted_newlines --max_bad_records 10 -E UTF-8 {dataset_name}.{table_name} gs://{google_cloud_storage_location}/* {col_1}:{data_type1},{col_2}:{data_type2}, ....
References:
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#bigquery_load_table_gcs_csv-cli
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#csv-options

What character encoding should I use for a HTTP header?

I'm using a "fun" HTML special-character (✰)(see http://html5boilerplate.com/ for more info) for a Server HTTP-header and am wondering if it is "allowed" per spec.
Using the Network Tab in the dev tools in Chrome on Windows Xp Pro SP 3 I see the ✰ just fine.
In IE8 the ✰ is not rendered correctly.
The w3.org HTML validator does not render it correctly (displays "â°" instead).
Now, I'm not too keen on character encodings ... and frankly I don't really care too much about them; I just blindly use UTF-8 cus I'm told to. :-)
Is the disparity caused by bugs in the different parsers/browses/engines/(whatever-they-are-called)?
Is there a spec for this or maybe a list of allowed characters for an HTTP-header "value"?
In short: Only ASCII is guaranteed to work. Some non-ASCII bytes are allowed for backwards compatibility, but are not supposed to be displayable.
HTTPbis gave up and specified that in the headers there is no useful encoding besides ASCII:
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.
Previously, RFC 2616 from 1999 defined this:
Words of *TEXT MAY contain characters from character sets other than ISO-
8859-1 [22] only when encoded according to the rules of RFC 2047 [14].
and RFC 2047 is the MIME encoding, so it'd be:
=?UTF-8?Q?=E2=9C=B0?=
but I don't think that many (if any) clients support it.
Please read comments first, this answer likely draws wrong conclusions from the right sources, needs edit.
You can use any printable ASCII chars, and no special chars like ✰ (Which is not ASCII)
Tip: you can encode anything in JSON.
Edit: may not be obvious at first, the character encoding defined in the header only applies for the response body, not for the header itself. (As it would cause a chicken-&-egg problem.)
I'd like to sum up all the relevant definitions as per the spec linked by Penchant.
message-header = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS )
So, we are after field-value.
LWS = [CRLF] 1*( SP | HT )
CRLF = CR LF
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>
LWS stands for Linear White Space. Essentially, LWS is Space or Tab, but you can break your field-value into multiple lines by starting a new line before a Space or Tab.
Let's simplify it to this:
field-value = <any field-content or Space or Tab>
Now we are after field-content.
field-content = <the OCTETs making up the field-value
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
OCTET = <any 8-bit sequence of data>
TEXT = <any OCTET except CTLs,
but including LWS>
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
TEXT is the most general and includes all the rest -so forget about the rest-.
Here is the US-ASCII charset (= ASCII)
As you can see, all printable ASCII chars are allowed.