dbt Error : Encountered an error: 'utf-8' codec can't decode byte 0xa0 in position 441: invalid start byte - dbt

I have upgraded my dbt version to 1.0.0 yesterday night and ran few connection test. It went well . Now when i am running the my first dbt example model , i am getting below error , even though i have not changed any code in this default example model.
Same error i am getting while running dbt seed command also for a csv dataset . The csv is utf-8 encoded and no special character in it .
I am using python 3.9
Could anyone suggest what is the issue ?
Below is my first dbt model sql

After lots of back and forth, I figured out the issue. This is more like fundamental concept issue.
Every time we execute dbt run, dbt will scan through the entire project directory ( including seeds directory even though it is not materializing the seed ) [Attached screenshot below].
If it finds any csv it also parsed it .
In case of above error, I had a csv file which looks follows :
If we see the highlighted line it contains some symbol character which dbt (i.e python) was not able to parse it causing above error.
This symbol was not visible earlier in excel or notepad++.
It could be the issue with Snowflake python connector that #PeterH has pointed out .
As temporary solution , for now we are manually removing these character from Data file.

I’d leave this as a comment but I don’t have the rep yet…
This appears to be related to a recently-opened issue.
https://github.com/dbt-labs/dbt-snowflake/issues/66
Apparently it’s something to do with the snowflake python adapter.
Since you’re seeing the error from a different context, it might be helpful for you to post in that issue that you’re seeing this outside of query preview.

Related

AzureSynapse Lookup UserErrorFileNotFound with Wildcard path

I am facing an odd issue where my lookup is returning a filenotfound error when I use a wildcard path. If I specify and exact file path, the lookup runs without error. However, if I replace the filename with a *, I get a filenotfound error.
The file is Data_643.json, located in my Azure Data Lake Storage Gen2, under the labournavigatorfile system. The exact file path is:
labournavigatorfile/raw_data/Scraped/HeadHunter/Saudi_Arabia/Data_643.json.
If I put this exact path into the Integration dataset configuration, the pipeline runs without issue. However, as soon as I replace the 'Data_643.json' with a '*', the pipeline crashes with a filenotfound error.
What am I doing wrong? Many Thanks for any support. This must be something very simple that I am missing.
Exact path works:
Wildcrad path throws error:
I have 3 files in my container as file1.json, file2.json, file3.json as shown below:
The following is how I configured my dataset to read using wildcard with configuration same as in the image provided in the question.
When I used this in lookup I got the same error:
To overcome this, go to your lookup activity. When you want to use wildcards to read a file/files, check the wildcard file path option. Then specify the folder structure and use wildcard where required. The following is an image for reference.
The following is the debug output when I run the pipeline (Each of my files had 10 rows):

"Error loading data: 42000" in Pentaho PDI MonetDB bulk Loader step

I want to insert data from a large CSV file to MonetDB. I can't use MonetDB "mclient" because this procedure must run inside a Pentaho Server application within a Docker container. MonetDB is inside a Docker container too.
Here's my very simple transformation:
When I test the transformation, I always get the following error message:
2021/03/20 22:37:37 - MonetDB bulk loader.0 - Error loading data: 42000!COPY INTO: record separator contains '\r\n' but in the input stream, '\r\n' is being normalized into '\n'
Does anyone have any idea what is happening?
Thank you!
This is related to line endings. Pentaho issues a COPY INTO statement,
COPY INTO <table> FROM <file>
USING DELIMITERS ',', '\r\n'
Here, \r\n means DOS/Windows line endings. Since the Oct2020 release, MonetDB always normalizes the line endings from DOS/Windows to Unix style \n when loading data. Before, it used to sometimes normalize and sometimes not. However, normalizing to \n means that looking for \r\n would yield one giant line containing the whole file, hence the error message.
I will submit a patch to MonetDB to automatically replace the USING '\r\n' with '\n'.
This will fix it in the longer term.
In the short term I have no good solution to offer. I have no experience with Pentaho, but looking at the source code, it seems Pentaho is uses the system property lines.separator, which is \r\n on Windows.
This means, if you have access to a Mac or Linux machine to run Pentaho on, that will work as line.separator is \n there. Otherwise, maybe you can ask the Pentaho people if the JVM can be started with something like java -Dline.separator="\n" as a workaround, see also this Stack Overflow question.
Otherwise, we'll have to use a patched version of either Pentaho, the MonetDB JDBC driver or MonetDB. I could send you a patched version of the JDBC driver that automagically replaces the '\r\n' with '\n' before sending the query to the server but you have to figure out yourself how to get Pentaho to use this JDBC driver instead of the default one.

Row larger than the maximum allowed size

I have successfully imported many gzipped JSON files on several occasions. For the two files BQ import choked. Both files reported the same error:
File: 0 / Offset:0 / Line:1 / Column:20971521, Row larger than the maximum allowed size
Now I've read about the row limit of 20MB and I understand that the number above is 20MB +1 but what really bugs me is that the meaning is totally off. My GZs have millions of JSONs (each on a new line). I have written a script to measure the longest line (longest JSON) in the failed GZ file and found it to be 103571 bytes. Why is the BQ import choking then?
I have inspected the longest JSON and it looks perfectly normal. How should I interpret the error? How can I fix it?
Why is BQ thinking the import is on line 1, column 20971521 when there are millions of lines in the file?
All your investigations are correct, but you must check your file as new lines are not identified, and BQ seas all the import as a large line.
That's why it reports column 20971521 for the problem.
You should try importing a sample from the file.
Some of the answers here gave me an idea so I went on a tried it. It appears as if for some strange reason BQ didn't like line endings so I wrote a quick script to rewrite the original input file to use line endings. Automagically the import worked!
This is utterly strange considering I already imported many GBs of data with pure line endings.
I am happy that it worked but I could never guess why. I hope this helps someone else.

doxygen latex make fails for input encoding error

I have a git repo project in eclipse which I have been documenting using doxygen (v1.8.4).
If I run the latex make ion a fresh clone of the project it runs fine and the PDF is made.
However, if I then run a doxy build, which completes OK, then attempt to run the latex make, it fails for
! Package inputenc Error: Keyboard character used is undefined
(inputenc) in inputencoding `utf8'.
See the inputenc package documentation for explanation.
Type H <return> for immediate help.
...
I have tried switching the encoding of the doxyfile by setting DOXYFILE_ENCODING to ISO-8859-1 with no change in the result... How can I fix this?? Thanks.
EDIT: I have used no non-UTF-8 chars as far as I know in my files, the file referenced before the error is very short and definitely doesn't have non-UTF-8 chars in it. I've even tried clearing my latex output dir and building from scratch with no luck...
EDIT: Irealised that the doxy build only appears to run correctly. It doesnt show any errors, but it should, for example run DOT and build about 10 graphs. The console output says Running dot, but it doesn't say generating graph (n/x) like it should when it actually makes the graphs...
Short answer: So by a slow process of elimination I found that this was caused by a single apostrophe in a file that had appeared to be already built and made without error!!
Long answer: Firstly I used used the project properties to flip the encoding from the default Cp1252 to UTF-8. Then I started removing files one-by-one until rebuilding and remaking after each removal, until the make ran successfully. I re-added all files, but deleted the content in the most recently removed file and tested the make - to confirm it was this file and only this file that caused the issue. the make ran fine. So I pasted the content back into the empty file, and started deleting smaller and smaller sections of the file, again rebuilding and remaking each time until I was left with a good make without the apostrophe and a bad one with it... I simply retyped the apostrophe (as this would then force it to be a UTF-8 char) and success!! Such an annoying bug!
Dude you made it a hard way. Why not use python to do the work for you:
f = open(fn,"rb")
data = f.read()
f.close()
for i in range(len(data)):
ch = data[i]
if(ch > 0x7F): # non ASCII character
print("char: %c, idx: %d, file: %s"%(ch,i,fn))
str2 = str(data[i-30:i+30])#.decode("utf-8")
print("txt: %s" % (str2))

Inconsistent Behavior In A Batch File's For Statement

I've done very little with batch files but I'm trying to track down a strange bug I've been encountering on a legacy system.
I have a number of .exe files in particular folder. This script is supposed to duplicate them with a different file name.
Code From Batch File
for %%i in (*.exe) do copy \\networkpath\folder\%%i \\networkpath\folder\%%i.backup.exe
(Note: The source and destination folders are THE SAME)
Example Of Desired Behavior:
File1.exe --> Becomes --> File1.exe.backup.exe
File2.exe --> Becomes --> File2.exe.backup.exe
Now first, let me say that this is not the approach I would take. I know there are other (potentially more straight forward) ways to do this. I also know that you might wonder WHY on earth we care about creating a FileX.exe.backup.exe. But this script has been running for years and I'm told the problem only started recently. I'm trying to pinpoint the problem, not rewrite the code (even if it would be trivial).
Example Buggy Output:
File1.exe.backup.exe
File1.exe.backup.exe.backup.exe
File1.exe.backup.exe.backup.exe.backup.exe
File1.exe.backup.exe.backup.exe.backup.exe.backup.exe
File1.exe.backup.exe.backup.exe.backup.exe.backup.exe.backup.exe
File1.exe.backup.exe.backup.exe.backup.exe.backup.exe.backup.exe.backup.exe
etc...
File2.exe.backup.exe
File2.exe.backup.exe.backup.exe
File2.exe.backup.exe.backup.exe.backup.exe
File2.exe.backup.exe.backup.exe.backup.exe.backup.exe
File2.exe.backup.exe.backup.exe.backup.exe.backup.exe.backup.exe
File2.exe.backup.exe.backup.exe.backup.exe.backup.exe.backup.exe.backup.exe
Not knowing anything about batch files, I looked at this and figured that the condition of the for statement was being re-evaluated after each iteration - creating a (near) infinite loop of copying (I can see that, eventually, the copy will fail when the names get too long).
This would explain the behaviour I'm seeing. And when cleaned the directory in question so that it had only the original File1.exe file and ran the script it produced the bug code. The problem is that I CANNOT replicate the behaviour anywhere else!?!
When I create a folder locally with a few .exe files and run the script - I get the expected output. And yes, if I run it again, I get one instance of 'File1.exe.backup.exe.backup.exe' (and each time I run it again, it increases in length by one). But I cannot get it to enter the near-infinite loop case.
It's been driving me crazy.
The bug is occurring on a networked location - so I've tried to recreate it on one - but again, no success. Because it's a shared network location, I wondered if it could have something to do with other people accessing or modifying files in the folder and even introduced delays and wrote a tiny program to perform actions in the same folder - but without any success.
The documentation I can find on the 'for' statement doesn't really help, but all of the tests I've run seem to suggest that the in (*.exe) section is only evaluated once at the beginning of execution.
Does anyone have any suggestions for what might be going on here?
I agree with Andriy M's comment - it looks to be related to Windows 7 Batch Script 'For' Command Error/Bug
The following change should fix the problem:
for /f "eol=: delims=" %%i in ('dir /b *.exe') do copy \\networkpath\folder\%%i \\networkpath\folder\%%i.backup.exe
Any file that starts with a semicolon (highly unlikely, but it can happen) would be skipped with the default EOL of semicolon. To be safe you should set EOL to some character that could never start a file name (or any path). That is why I chose the colon - it cannot appear in a folder or file name, and can only appear after a drive letter. So it should always be safe.
Copy supports wildcard characters also in target path. You can use
copy \\networkpath\folder\*.exe \\networkpath\folder\*.backup.exe