How to remove illegal characters so a dataframe can write to Excel

How to remove illegal characters so a dataframe can write to Excel - pandas

I am trying to write a dataframe to an Excel spreadsheet using ExcelWriter, but it keeps returning an error:
openpyxl.utils.exceptions.IllegalCharacterError
I'm guessing there's some character in the dataframe that ExcelWriter doesn't like. It seems odd, because the dataframe is formed from three Excel spreadsheets, so I can't see how there could be a character that Excel doesn't like!
Is there any way to iterate through a dataframe and replace characters that ExcelWriter doesn't like? I don't even mind if it simply deletes them.
What's the best way or removing or replacing illegal characters from a dataframe?

Based on Haipeng Su's answer, I added a function that does this:
dataframe = dataframe.applymap(lambda x: x.encode('unicode_escape').
decode('utf-8') if isinstance(x, str) else x)
Basically, it escapes the unicode characters if they exist. It worked and I can now write to Excel spreadsheets again!

The same problem happened to me. I solved it as follows:
install python package xlsxwriter:
pip install xlsxwriter
replace the default engine 'openpyxl' with 'xlsxwriter':
dataframe.to_excel("file.xlsx", engine='xlsxwriter')

try a different excel writer engine solved my problem.
writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')

If you don't want to install another Excel writer engine (e.g. xlsxwriter), you may try to remove these illegal characters by looking for the pattern which causes the IllegalCharacterError error to be raised.
Open cell.py which is found at /path/to/your/python/site-packages/openpyxl/cell/, look for check_string function, you'll see it is using a defined regular expression pattern ILLEGAL_CHARACTERS_RE to find those illegal characters. Trying to locate its definition you'll see this line:
ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
This line is what you need to remove those characters. Copy this line to your program and execute the below code before your dataframe is written to Excel:
dataframe = dataframe.applymap(lambda x: ILLEGAL_CHARACTERS_RE.sub(r'', x) if isinstance(x, str) else x)
The above line will remove those characters in every cell.
But the origin of these characters may be a problem. As you say, the dataframe comes from three Excel spreadsheets. If the source Excel spreadsheets contains those characters, you will still face this problem. So if you can control the generation process of source spreadsheets, try to remove these characters there to begin with.

I was also struggling with some weird characters in a data frame when writing the data frame to html or csv. For example, for characters with accent, I can't write to html file, so I need to convert the characters into characters without the accent.
My method may not be the best, but it helps me to convert unicode string into ascii compatible.
# install unidecode first
from unidecode import unidecode
def FormatString(s):
if isinstance(s, unicode):
try:
s.encode('ascii')
return s
except:
return unidecode(s)
else:
return s
df2 = df1.applymap(FormatString)
In your situation, if you just want to get rid of the illegal characters by changing return unidecode(s) to return 'StringYouWantToReplace'.
Hope this can give me some ideas to deal with your problems.

You can use built-in strip() method for python strings.
for each cell:
text = str(illegal_text).strip()
for entire data frame:
dataframe = dataframe.applymap(lambda t: str(t).strip())

If you're still struggling to clean up the characters, this worked well for me:
import xlwings as xw
import pandas as pd
df = pd.read_pickle('C:\\Users\\User1\\picked_DataFrame_notWriting.df')
topath = 'C:\\Users\\User1\\tryAgain.xlsx'
wb = xw.Book(topath)
ws = wb.sheets['Data']
ws.range('A1').options(index=False).value = df
wb.save()
wb.close()

Related

Is it possible to read a csv with `\r\n` line terminators in pandas?

I'm using pandas==1.1.5 to read a CSV file. I'm running the following code:
import pandas as pd
import csv
csv_kwargs = dict(
delimiter="\t",
lineterminator="\r\n",
quoting=csv.QUOTE_MINIMAL,
escapechar="!",
)
pd.read_csv("...", **csv_kwargs)
It raises the following error: ValueError: Only length-1 line terminators supported.
Pandas documentation confirms that line terminators should be length-1 (I suppose single character).
Is there any way to read this CSV with Pandas or should I read it some other way?
Note that the docs suggest length-1 for C parsers, maybe I can plugin some other parser?
EDIT: Not specifying the line terminator raises a parse error in the middle of the file. Specifically ParserError: Error tokenizing data., it expects the correct number of fields but gets too many.
EDIT2: I'm confident the kwargs above were used to created the csv file I'm trying to read.

The problem might be in the escapchar, since ! is a common text character.
Python's csv module defines a very strict use of escapechar:
A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False.
but it's possible that pandas interprets it differently:
One-character string used to escape other characters.
It's possible that you have a row that contains something like:
...\t"some important text!"\t...
which would escape the quote character and continue parsing text into that column.

I need to replace non-ASCII characters in pandas data frame column in python 2.7

This question was asked many times, but non of the solutions worked for me.
The data frame was pulled from a third party excel file with 'UTF-8' encoding:
pd.read_excel(file, encoding = 'UTF-8', sheet_name = worksheet)
But I still have characters like " â€™ " instead of " ' " in some lines.
On the top of the code I have the following
# -*- encoding: utf-8 -*-
The following line does not throw errors, but do not change anything in the data:
df['text'] = df['text'].str.replace("â€™","'")
I tried with dictionary (which has the same core), like
repl_dict = {"â€™": "'"}
for k,v in repl_dict.items():
df.loc[df.text.str.contains(k), 'text'] =
df.text.str.replace(pat=k,repl=v)
and tried many other approaches including regex, but nothing worked.
When I tried:
def replace_apostrophy(text):
return text.replace("â€™","'")
df['text'] = df['text'].apply(lambda x: replace_apostrophy(x))
I received the following error -
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
When I tried:
df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text))
I got the following error -
TypeError: normalize() argument 2 must be unicode, not float
The text has also emojis that afterwords I need to count somehow.
Can someone give me a good advice?
Thank you very much!

I have found a solution myself. It might look clumsy, but works perfectly in my case:
df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text).encode('ascii','backslashreplace'))
I had to replace nan values prior to run that code.
That operation gives me ascii symbols only that can be easily replaced:
def replace_apostrophy(text):
return text.replace("a\u0302\u20acTM","'")
Hope this would help someone.

'UnicodeEncodeError' when using 'set_dataframe'

When using set_dataframe to update my Google Sheets via pygsheets and pandas, I get error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 13: ordinal not in range(128)
This is due to utf-8 marks over some text, e.g.,: "señor"
This happens on executing:
wks.set_dataframe(df, start='A10')
Pandas to_csv accepts an encoding parameter similar to encoding="utf-8", may I suggests set_dataframe does the same?
wks.set_dataframe(df, start='A10', encoding="utf-8")
I see there's a ticket opened 10 days ago here but is there a workaround?

Solution:
I ran into the same issue, and I think that, more than a bug in the pygsheets module, it would be a limitation as you clearly point out.
What I did to solve this issue was:
def encodeDataFrame(df, encoding='UTF-8'):
if df is not None and not df.empty:
for column, series in df.items():
if type(series.at[0]) == unicode:
try:
encodedSeries = series.str.encode(encoding)
df[column] = encodedSeries
except Exception as e:
print 'Could not encode column %s' % column
raise
And you can call the function this way:
encodeDataFrame(df)
wks.set_dataframe(df, start='A10')
This might no longer be a good solution because of a change that was made in pygsheets to avoid this issue. See EDIT section below
Explanation:
You solve the issue by encoding the unicode values yourself, before sending them to the set_dataframe function.
This problem comes up whenever you try to use the Worksheet.set_dataframe function, using a dataframe that contains unicode characters that cannot be encoded in ascii (like accents, and many other).
The exception is thrown because the set_dataframe function attempts to cast the unicode values into str values (using the default encoding). For Python 2, the default encoding is ascii and when a character out of the range of ascii is found, the exception is thrown.
Some people have suggested reloading the sys module to circumvent this problem, but here is explained why you should not do it
The other solution I would think of would be to use the pygsheets module in Python 3, where this should no longer be a problem because the default encoding for Python 3 is UTF-8 (see docs)
Bonus:
Ask yourself:
1) Is Unicode an encoding?
2) What is an encoding?
If you hesitated with any of those questions, you should read this article, which gave me the knowledge needed to think of this solution. The time invested was completely worth it.
For more information, you can try this article which links to the previous one at the end.
EDIT:
A change was made 2 days ago (07/26/19) to pygsheets that is supposed to fix this. It looks like the intention is to avoid encoding into the str type, but I would think that this change might try decoding strings into the unicode type from the default ascii encoding, which could also lead to trouble. When/If this change is released it is probably going to be better not to encode anything and pass values as unicode to the set_dataframe function.
EDIT 2:
This change has now been released in version 2.0.2. If my predictions were true, using the encodeDataFrame function I suggested will result in a UnicodeDecodeError because the Worksheet.set_dataframe function will attempt to decode the str values with the default ascii encoding. So the best way to use the function will be not to have any str values in your dataframe. If you have them, decode them to unicode before calling the set_dataframe function. You could have a mirror version of the
function I suggested. It would look something like this:
def decodeDataFrame(df, encoding='UTF-8'):
if df is not None and not df.empty:
for column, series in df.items():
if type(series.at[0]) == str:
try:
decodedSeries = series.str.decode(encoding)
df[column] = decodedSeries
except Exception as e:
print 'Could not decode column %s' % column
raise

pandas.errors.ParserError: Error tokenizing data. C error: Expected 7 fields in line 3, saw 11 [duplicate]

I'm trying to use pandas to manipulate a .csv file but I get this error:
pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 12
I have tried to read the pandas docs, but found nothing.
My code is simple:
path = 'GOOG Key Ratios.csv'
#print(open(path).read())
data = pd.read_csv(path)
How can I resolve this? Should I use the csv module or another language ?
File is from Morningstar

you could also try;
data = pd.read_csv('file1.csv', on_bad_lines='skip')
Do note that this will cause the offending lines to be skipped.
Edit
For Pandas < 1.3.0 try
data = pd.read_csv("file1.csv", error_bad_lines=False)
as per pandas API reference.

It might be an issue with
the delimiters in your data
the first row, as #TomAugspurger noted
To solve it, try specifying the sep and/or header arguments when calling read_csv. For instance,
df = pandas.read_csv(filepath, sep='delimiter', header=None)
In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indices for each field {0,1,2,...}.
According to the docs, the delimiter thing should not be an issue. The docs say that "if sep is None [not specified], will try to automatically determine this." I however have not had good luck with this, including instances with obvious delimiters.
Another solution may be to try auto detect the delimiter
# use the first 2 lines of the file to detect separator
temp_lines = csv_file.readline() + '\n' + csv_file.readline()
dialect = csv.Sniffer().sniff(temp_lines, delimiters=';,')
# remember to go back to the start of the file for the next time it's read
csv_file.seek(0)
df = pd.read_csv(csv_file, sep=dialect.delimiter)

The parser is getting confused by the header of the file. It reads the first row and infers the number of columns from that row. But the first two rows aren't representative of the actual data in the file.
Try it with data = pd.read_csv(path, skiprows=2)

This is definitely an issue of delimiter, as most of the csv CSV are got create using sep='/t' so try to read_csv using the tab character (\t) using separator /t. so, try to open using following code line.
data=pd.read_csv("File_path", sep='\t')

I had this problem, where I was trying to read in a CSV without passing in column names.
df = pd.read_csv(filename, header=None)
I specified the column names in a list beforehand and then pass them into names, and it solved it immediately. If you don't have set column names, you could just create as many placeholder names as the maximum number of columns that might be in your data.
col_names = ["col1", "col2", "col3", ...]
df = pd.read_csv(filename, names=col_names)

Your CSV file might have variable number of columns and read_csv inferred the number of columns from the first few rows. Two ways to solve it in this case:
1) Change the CSV file to have a dummy first line with max number of columns (and specify header=[0])
2) Or use names = list(range(0,N)) where N is the max number of columns.

I had this problem as well but perhaps for a different reason. I had some trailing commas in my CSV that were adding an additional column that pandas was attempting to read. Using the following works but it simply ignores the bad lines:
data = pd.read_csv('file1.csv', error_bad_lines=False)
If you want to keep the lines an ugly kind of hack for handling the errors is to do something like the following:
line = []
expected = []
saw = []
cont = True
while cont == True:
try:
data = pd.read_csv('file1.csv',skiprows=line)
cont = False
except Exception as e:
errortype = e.message.split('.')[0].strip()
if errortype == 'Error tokenizing data':
cerror = e.message.split(':')[1].strip().replace(',','')
nums = [n for n in cerror.split(' ') if str.isdigit(n)]
expected.append(int(nums[0]))
saw.append(int(nums[2]))
line.append(int(nums[1])-1)
else:
cerror = 'Unknown'
print 'Unknown Error - 222'
if line != []:
# Handle the errors however you want
I proceeded to write a script to reinsert the lines into the DataFrame since the bad lines will be given by the variable 'line' in the above code. This can all be avoided by simply using the csv reader. Hopefully the pandas developers can make it easier to deal with this situation in the future.

The following worked for me (I posted this answer, because I specifically had this problem in a Google Colaboratory Notebook):
df = pd.read_csv("/path/foo.csv", delimiter=';', skiprows=0, low_memory=False)

You can try;
data = pd.read_csv('file1.csv', sep='\t')

I came across the same issue. Using pd.read_table() on the same source file seemed to work. I could not trace the reason for this but it was a useful workaround for my case. Perhaps someone more knowledgeable can shed more light on why it worked.
Edit:
I found that this error creeps up when you have some text in your file that does not have the same format as the actual data. This is usually header or footer information (greater than one line, so skip_header doesn't work) which will not be separated by the same number of commas as your actual data (when using read_csv). Using read_table uses a tab as the delimiter which could circumvent the users current error but introduce others.
I usually get around this by reading the extra data into a file then use the read_csv() method.
The exact solution might differ depending on your actual file, but this approach has worked for me in several cases

I've had this problem a few times myself. Almost every time, the reason is that the file I was attempting to open was not a properly saved CSV to begin with. And by "properly", I mean each row had the same number of separators or columns.
Typically it happened because I had opened the CSV in Excel then improperly saved it. Even though the file extension was still .csv, the pure CSV format had been altered.
Any file saved with pandas to_csv will be properly formatted and shouldn't have that issue. But if you open it with another program, it may change the structure.
Hope that helps.

I've had a similar problem while trying to read a tab-delimited table with spaces, commas and quotes:
1115794 4218 "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", ""
1144102 3180 "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", "g__Bacillus", ""
368444 2328 "k__Bacteria", "p__Bacteroidetes", "c__Bacteroidia", "o__Bacteroidales", "f__Bacteroidaceae", "g__Bacteroides", ""
import pandas as pd
# Same error for read_table
counts = pd.read_csv(path_counts, sep='\t', index_col=2, header=None, engine = 'c')
pandas.io.common.CParserError: Error tokenizing data. C error: out of memory
This says it has something to do with C parsing engine (which is the default one). Maybe changing to a python one will change anything
counts = pd.read_table(path_counts, sep='\t', index_col=2, header=None, engine='python')
Segmentation fault (core dumped)
Now that is a different error.
If we go ahead and try to remove spaces from the table, the error from python-engine changes once again:
1115794 4218 "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae",""
1144102 3180 "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae","g__Bacillus",""
368444 2328 "k__Bacteria","p__Bacteroidetes","c__Bacteroidia","o__Bacteroidales","f__Bacteroidaceae","g__Bacteroides",""
_csv.Error: ' ' expected after '"'
And it gets clear that pandas was having problems parsing our rows. To parse a table with python engine I needed to remove all spaces and quotes from the table beforehand. Meanwhile C-engine kept crashing even with commas in rows.
To avoid creating a new file with replacements I did this, as my tables are small:
from io import StringIO
with open(path_counts) as f:
input = StringIO(f.read().replace('", ""', '').replace('"', '').replace(', ', ',').replace('\0',''))
counts = pd.read_table(input, sep='\t', index_col=2, header=None, engine='python')
tl;dr
Change parsing engine, try to avoid any non-delimiting quotes/commas/spaces in your data.

Use delimiter in parameter
pd.read_csv(filename, delimiter=",", encoding='utf-8')
It will read.

The dataset that I used had a lot of quote marks (") used extraneous of the formatting. I was able to fix the error by including this parameter for read_csv():
quoting=3 # 3 correlates to csv.QUOTE_NONE for pandas

As far as I can tell, and after taking a look at your file, the problem is that the csv file you're trying to load has multiple tables. There are empty lines, or lines that contain table titles. Try to have a look at this Stackoverflow answer. It shows how to achieve that programmatically.
Another dynamic approach to do that would be to use the csv module, read every single row at a time and make sanity checks/regular expressions, to infer if the row is (title/header/values/blank). You have one more advantage with this approach, that you can split/append/collect your data in python objects as desired.
The easiest of all would be to use pandas function pd.read_clipboard() after manually selecting and copying the table to the clipboard, in case you can open the csv in excel or something.
Irrelevant:
Additionally, irrelevant to your problem, but because no one made mention of this: I had this same issue when loading some datasets such as seeds_dataset.txt from UCI. In my case, the error was occurring because some separators had more whitespaces than a true tab \t. See line 3 in the following for instance
14.38 14.21 0.8951 5.386 3.312 2.462 4.956 1
14.69 14.49 0.8799 5.563 3.259 3.586 5.219 1
14.11 14.1 0.8911 5.42 3.302 2.7 5 1
Therefore, use \t+ in the separator pattern instead of \t.
data = pd.read_csv(path, sep='\t+`, header=None)

Error tokenizing data. C error: Expected 2 fields in line 3, saw 12
The error gives a clue to solve the problem " Expected 2 fields in line 3, saw 12", saw 12 means length of the second row is 12 and first row is 2.
When you have data like the one shown below, if you skip rows then most of the data will be skipped
data = """1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4"""
If you dont want to skip any rows do the following
#First lets find the maximum column for all the rows
with open("file_name.csv", 'r') as temp_f:
# get No of columns in each line
col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
### Generate column names (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(max(col_count))]
import pandas as pd
# inside range set the maximum value you can see in "Expected 4 fields in line 2, saw 8"
# here will be 8
data = pd.read_csv("file_name.csv",header = None,names=column_names )
Use range instead of manually setting names as it will be cumbersome when you have many columns.
Additionally you can fill up the NaN values with 0, if you need to use even data length. Eg. for clustering (k-means)
new_data = data.fillna(0)

For those who are having similar issue with Python 3 on linux OS.
pandas.errors.ParserError: Error tokenizing data. C error: Calling
read(nbytes) on source failed. Try engine='python'.
Try:
df.read_csv('file.csv', encoding='utf8', engine='python')

In my case the separator was not the default "," but Tab.
pd.read_csv(file_name.csv, sep='\\t',lineterminator='\\r', engine='python', header='infer')
Note: "\t" did not work as suggested by some sources. "\\t" was required.

I believe the solutions,
,engine='python'
, error_bad_lines = False
will be good if it is dummy columns and you want to delete it.
In my case, the second row really had more columns and I wanted those columns to be integrated and to have the number of columns = MAX(columns).
Please refer to the solution below that I could not read anywhere:
try:
df_data = pd.read_csv(PATH, header = bl_header, sep = str_sep)
except pd.errors.ParserError as err:
str_find = 'saw '
int_position = int(str(err).find(str_find)) + len(str_find)
str_nbCol = str(err)[int_position:]
l_col = range(int(str_nbCol))
df_data = pd.read_csv(PATH, header = bl_header, sep = str_sep, names = l_col)

Although not the case for this question, this error may also appear with compressed data. Explicitly setting the value for kwarg compression resolved my problem.
result = pandas.read_csv(data_source, compression='gzip')

Simple resolution: Open the csv file in excel & save it with different name file of csv format. Again try importing it spyder, Your problem will be resolved!

The issue is with the delimiter. Find what kind of delimiter is used in your data and specify it like below:
data = pd.read_csv('some_data.csv', sep='\t')

I came across multiple solutions for this issue. Lot's of folks have given the best explanation for the answers also. But for the beginners I think below two methods will be enough :
import pandas as pd
#Method 1
data = pd.read_csv('file1.csv', error_bad_lines=False)
#Note that this will cause the offending lines to be skipped.
#Method 2 using sep
data = pd.read_csv('file1.csv', sep='\t')

Sometimes the problem is not how to use python, but with the raw data.
I got this error message
Error tokenizing data. C error: Expected 18 fields in line 72, saw 19.
It turned out that in the column description there were sometimes commas. This means that the CSV file needs to be cleaned up or another separator used.

An alternative that I have found to be useful in dealing with similar parsing errors uses the CSV module to re-route data into a pandas df. For example:
import csv
import pandas as pd
path = 'C:/FileLocation/'
file = 'filename.csv'
f = open(path+file,'rt')
reader = csv.reader(f)
#once contents are available, I then put them in a list
csv_list = []
for l in reader:
csv_list.append(l)
f.close()
#now pandas has no problem getting into a df
df = pd.DataFrame(csv_list)
I find the CSV module to be a bit more robust to poorly formatted comma separated files and so have had success with this route to address issues like these.

following sequence of commands works (I lose the first line of the data -no header=None present-, but at least it loads):
df = pd.read_csv(filename,
usecols=range(0, 42))
df.columns = ['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14']
Following does NOT work:
df = pd.read_csv(filename,
names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'],
usecols=range(0, 42))
CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54
Following does NOT work:
df = pd.read_csv(filename,
header=None)
CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54
Hence, in your problem you have to pass usecols=range(0, 2)

use
pandas.read_csv('CSVFILENAME',header=None,sep=', ')
when trying to read csv data from the link
http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
I copied the data from the site into my csvfile. It had extra spaces so used sep =', ' and it worked :)

I had a similar case as this and setting
train = pd.read_csv('input.csv' , encoding='latin1',engine='python')
worked

Check if you are loading the csv with the correct separator.
df = pd.read_csv(csvname, header=0, sep=",")

I had a dataset with prexisting row numbers, I used index_col:
pd.read_csv('train.csv', index_col=0)

How to mark strings in Pandas Dataframe

I have a Pandas Dataframe where each column is Series of strings. The strings represent the path to files that might be or not be physically present on the hard drive.
I would like to mark the path pointing to not existent files, i.e. by coloring the string or its background.
Unfortunately I can't use Styler.applymap(func) because func should take a scalar input, not a string.
Also I just got that anyway Styler wouldn't really work for me because I use the Pycharm python console or the terminal on Ubuntu, not Jupyter
What can I do?
Example function for checking the existence of a file and returning a color
def color_not_existent_file(path):
color = ('red' if not os.path.exists(path) else 'green')
return 'color: {}'.format(color)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to remove illegal characters so a dataframe can write to Excel - pandas

Based on Haipeng Su's answer, I added a function that does this: dataframe = dataframe.applymap(lambda x: x.encode('unicode_escape'). decode('utf-8') if isinstance(x, str) else x) Basically, it escapes the unicode characters if they exist. It worked and I can now write to Excel spreadsheets again!

The same problem happened to me. I solved it as follows: install python package xlsxwriter: pip install xlsxwriter replace the default engine 'openpyxl' with 'xlsxwriter': dataframe.to_excel("file.xlsx", engine='xlsxwriter')

try a different excel writer engine solved my problem. writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')

You can use built-in strip() method for python strings. for each cell: text = str(illegal_text).strip() for entire data frame: dataframe = dataframe.applymap(lambda t: str(t).strip())

Related

Is it possible to read a csv with `\r\n` line terminators in pandas?

I need to replace non-ASCII characters in pandas data frame column in python 2.7

'UnicodeEncodeError' when using 'set_dataframe'

pandas.errors.ParserError: Error tokenizing data. C error: Expected 7 fields in line 3, saw 11 [duplicate]

How to mark strings in Pandas Dataframe

Categories

Resources