How to mark strings in Pandas Dataframe - pandas

I have a Pandas Dataframe where each column is Series of strings. The strings represent the path to files that might be or not be physically present on the hard drive.
I would like to mark the path pointing to not existent files, i.e. by coloring the string or its background.
Unfortunately I can't use Styler.applymap(func) because func should take a scalar input, not a string.
Also I just got that anyway Styler wouldn't really work for me because I use the Pycharm python console or the terminal on Ubuntu, not Jupyter
What can I do?
Example function for checking the existence of a file and returning a color
def color_not_existent_file(path):
color = ('red' if not os.path.exists(path) else 'green')
return 'color: {}'.format(color)

Related

read_csv pandas, encoding issue

I have a csv-file with a list of keywords that I want to use for some filtering of texts.
I saved the csv-file, and tried to open it in my notebook using pd.from_csv('file.csv', encoding = 'UTF-8')
This didn't work even though I specified the encoding to this encoding type.
After some searching, I found some different encodings, I decided to go for
keywords = pd.read_csv('file.csv', encoding = 'latin1')
gets me the actual keywords, but when inspecting the words, I get that the spaces are passed as follows:
['falsification\xa0',
'détournement\xa0de\xa0subsides\xa0',
'parachutes\xa0dorés\xa0',...]
about the csv-file: it has two columns of keywords, one column in dutch, the other one in French. The issue with the spaces persists even when I use other encodings like

pandas to_latex() escaping "\\newline" for use in LaTex

I am struggling with inexplicable behavior differences between Windows and Mac handling of \newline in a jinja template string.
I have a string, mystring = '123 \\newline 456', that is inside a dataframe. When I use pandas to_latex() on a Mac, this gets converted to 123 \textbackslashnewline. But on Windows, it gets converted to 123 \textbackslash newline
The difference between the two complicates the work-around, which is to replace \textbackslash with an empty string. On the Windows version I need to include the space and on the Mac version I need to not include the space.
Is there a better way to incorporate \newline into the table? I tried r'\newline' and I tried just '\newline' and neither work. The reason the latter doesn't work is that the IDE I am using (PyCharm) interprets the single slash as a line break rather than as part of the string. The double-slash that I use is simply meant to get around that, but then to_latex() interprets the double slash as \textblackslash.
I discovered that the issue was really with pandas to_latex() command and have edited the original question.
For the Windows version, pandas was replacing \\newline with '\textbackslash newline'. But on the Mac, it was replacing it with '\textbackslashnewline' without the space. I was then using:
.replace('textbackslash', '')
which worked on the Mac but failed on Windows because of the extra space. What I cannot figure out though is why there is a difference between the Mac and Windows to_latex() functions.
Note that I am aware of the escape=False option in to_latex(), however that messes up instances where I want the parts of the string escaped (e.g. \$).

How to remove illegal characters so a dataframe can write to Excel

I am trying to write a dataframe to an Excel spreadsheet using ExcelWriter, but it keeps returning an error:
openpyxl.utils.exceptions.IllegalCharacterError
I'm guessing there's some character in the dataframe that ExcelWriter doesn't like. It seems odd, because the dataframe is formed from three Excel spreadsheets, so I can't see how there could be a character that Excel doesn't like!
Is there any way to iterate through a dataframe and replace characters that ExcelWriter doesn't like? I don't even mind if it simply deletes them.
What's the best way or removing or replacing illegal characters from a dataframe?
Based on Haipeng Su's answer, I added a function that does this:
dataframe = dataframe.applymap(lambda x: x.encode('unicode_escape').
decode('utf-8') if isinstance(x, str) else x)
Basically, it escapes the unicode characters if they exist. It worked and I can now write to Excel spreadsheets again!
The same problem happened to me. I solved it as follows:
install python package xlsxwriter:
pip install xlsxwriter
replace the default engine 'openpyxl' with 'xlsxwriter':
dataframe.to_excel("file.xlsx", engine='xlsxwriter')
try a different excel writer engine solved my problem.
writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
If you don't want to install another Excel writer engine (e.g. xlsxwriter), you may try to remove these illegal characters by looking for the pattern which causes the IllegalCharacterError error to be raised.
Open cell.py which is found at /path/to/your/python/site-packages/openpyxl/cell/, look for check_string function, you'll see it is using a defined regular expression pattern ILLEGAL_CHARACTERS_RE to find those illegal characters. Trying to locate its definition you'll see this line:
ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
This line is what you need to remove those characters. Copy this line to your program and execute the below code before your dataframe is written to Excel:
dataframe = dataframe.applymap(lambda x: ILLEGAL_CHARACTERS_RE.sub(r'', x) if isinstance(x, str) else x)
The above line will remove those characters in every cell.
But the origin of these characters may be a problem. As you say, the dataframe comes from three Excel spreadsheets. If the source Excel spreadsheets contains those characters, you will still face this problem. So if you can control the generation process of source spreadsheets, try to remove these characters there to begin with.
I was also struggling with some weird characters in a data frame when writing the data frame to html or csv. For example, for characters with accent, I can't write to html file, so I need to convert the characters into characters without the accent.
My method may not be the best, but it helps me to convert unicode string into ascii compatible.
# install unidecode first
from unidecode import unidecode
def FormatString(s):
if isinstance(s, unicode):
try:
s.encode('ascii')
return s
except:
return unidecode(s)
else:
return s
df2 = df1.applymap(FormatString)
In your situation, if you just want to get rid of the illegal characters by changing return unidecode(s) to return 'StringYouWantToReplace'.
Hope this can give me some ideas to deal with your problems.
You can use built-in strip() method for python strings.
for each cell:
text = str(illegal_text).strip()
for entire data frame:
dataframe = dataframe.applymap(lambda t: str(t).strip())
If you're still struggling to clean up the characters, this worked well for me:
import xlwings as xw
import pandas as pd
df = pd.read_pickle('C:\\Users\\User1\\picked_DataFrame_notWriting.df')
topath = 'C:\\Users\\User1\\tryAgain.xlsx'
wb = xw.Book(topath)
ws = wb.sheets['Data']
ws.range('A1').options(index=False).value = df
wb.save()
wb.close()

Getting wrong zero values with numpy fromfile when reading binary files

I am trying to read a binary file with Python. This is the code I use:
fb = open(Bin_File, "r")
a = numpy.fromfile(fb, dtype=numpy.float32)
However, I get zero values at the end of the array. For example, for a case where nrows=296 and ncol=439 and as a result, len(a)=296*439, I get zero values for a[-922:]. I know these values should be noData (-9999 in this example) from a trusted piece of code in R. Does anybody know why I am getting these non-sense zeros?
P.S: I am not sure it is related on not, but len(a) is nrows*ncols+2! I have to get rid of these two using a = a[0:-2] so that when I reshape them into rows and columns using a_reshape = a.reshape(nrows, ncols) I don't get an error.
When opening a file for reading as binary you should use the mode "rb" instead of "r".
Here is some background from the docs. On linux machines you don't need the "b" but it wont hurt. On Windows machines you must use "rb" for binary files.
Also note that the two extra entries you're getting is a common bug/feature when using the "unformatted" binary output format of Fortran. Each write statement given in this mode will produce a record that is surrounded by two 4 byte blocks.
These blocks represent integers that list the number of bytes in the block of unformatted data. For example, [223] [223 bytes of data] [223].

Reading just 1 column from a file using NumPy's loadtxt() function

I want to read in data from multiple files that I want to use for plotting (matplotlib).
I found a function loadtxt() that I could use for this purpose. However, I only want to read in one column from each file.
How would I do this?
The following command works for me if I read in at least 2 columns, for example:
numpy.loadtxt('myfile.dat', usecols=(2,3))
But
numpy.loadtxt('myfile.dat', usecols=(3))
would throw an error.
You need a comma after the 3 in order to tell Python that (3,) is a tuple. Python interprets (3) to be the same value as the int 3, and loadtxt wants a sequence-type argument for usecols.
numpy.loadtxt('myfile.dat', usecols=(3,))