'unicodeescape' codec can't decode bytes in position 0-1: malformed \N character escape - dataframe

I am trying to push data from Databricks into SQL. However, I get the following error:
I noticed in the file that I am processing that one of the columns has the following as a value:
I have tried to filter out the records by using the following:
df = df.filter(df.COLUMN != "\N")
However, when the above runs, I get the error message idenitfied above. Is there some way to filter out values that have an escape character in them?
I would really appreciate any help. Thank you.

As the error suggested, you need to escape the backslash \
df.filter(df.value != "\\N")

Related

How to remove new line characters from data rows in Presto/AWS Athena?

I'm querying some tables on Athena (Presto SAS) and then downloading the generated CSV file to use locally. Opening the file, I realised the data contains new line characters that doesn't appear on AWS interface, only in the CSV and need to get rid of them. Tried using the function replace(string, search, replace) → varchar to skip the newline char replacing \n for \\n without success:
SELECT
p.recvepoch, replace(p.description, '\n', '\\n') AS description
FROM
product p
LIMIT 1000
How can I achieve that?
The problem was that the underlying table data doesn't actually contains \n anywhere, instead, the actual newline character, which is represented by char(10). I was able to achieve the expected behaviour using the replace function passing it as parameter:
SELECT
p.recvepoch, replace(p.description, chr(10), '\n') AS description
FROM
product p
LIMIT 1000

BigQuery - Illegal Escape Sequence

I'm having an issue matching regular expression in BigQuery. I have the following line of code that tries to identify user agents:
when regexp_contains((cs_user_agent), '^AppleCoreMedia\/1\.(.*)iPod') then "iOS App - iPod"
However, BigQuery doesn't seem to like escape sequences for some reason and I get this error that I can't figure out:
Syntax error: Illegal escape sequence: \/ at [4:63]
This code works fine in a regex validator I use, but BigQuery is unhappy with it and I can't figure out why. Thanks in advance for the help
Use regexp_contains((cs_user_agent), r'^AppleCoreMedia\/1\.(.*)iPod')

Loading data from csv file using pandas

File "<ipython-input-10-1de27a02fcfe>", line 1
df = pd.read_csv('C:\Users\niharika\Downloads\questions-data.csv')
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
Even though I place r in front of C, changing \ to / and also \\. Nothing worked to me.could someone please help me.
just try
df = pd.read_csv('C:\\Users\\niharika\\Downloads\\questions-data.csv')
Try adding an r in front of the string like this
df = pd.read_csv(r'C:\Users\niharika\Downloads\questions-data.csv')
This case seems similar to yours
Error reading csv file unicodeescape

Unexpected token \n in JSON when parsing with Elm.Json

Actually I'm working with Elm but I have few issues with the json parsing in this language, the error that give me the compiler is:
Err "Given an invalid JSON: Unexpected token \n in JSON at position 388"
What I need to do is this:
example
At the char_meta I want its something like this:
[("Biographical Information", [("Japanese Name", "緑谷出久"), ...]), ...]
Here the code:
Ellie link
PD: The only constant keys are character_name, lang, summary and char_meta, they keys inside of char_meta are dynamic (thats why I use keyvaluepair) and the length its always different of this array (sometimes its empty)
Thanks, hope can help me.
EDIT:
The Ellie link now redirect to the fixed code
The issue is that elm (or JS once transcoded) interprets the \n and \" sequences when parsing the string literal, and they are replaced with an actual new line and double quotes respectively, which results in invalid JSON.
If you want to have the JSON inline in the code, you need to escape the 5 \s by doubling them (\\n and \\").
This only applies for literals, you won't have the issue if you load JSON from the network for instance.

How to escape delimiter found in value - pig script?

In pig script, I would like to find a way to escape the delimiter character in my data so that it doesn't get interpreted as extra columns. For example, if I'm using colon as a delimiter, and I have a column with value "foo:bar" I want that string interpreted as a single column without having the loader pick up the comma in the middle.
You can try http://pig.apache.org/docs/r0.12.0/func.html#regex-extract-all
A = LOAD 'somefile' AS (s:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(s, '(.*) : (.*)'));
The regex might have to be adapted.
It seems Pig takes the Input as the string its not so intelligent to identify how what is data or what is not.
The pig Storage works on the Strong Tokenizer. So if u want to do something like
a = LOAD '/abc/def/file.txt' USING PigStorage(':');
It doesn't seems to be solving your problem. But if we can write our own PigStorage() Method possibly we could come across some solution.
I will try posting the Code to resolve this.
you can use STRSPLIT(string, regex, limit); for the column split based on the delimiter.