Saving ipython variable/s to a text file - variables

I have a few lists and arrays in ipython which I should like to save to a text file so that I can then use them in another context. How can this be done?

Look at the %store magic function
important = ['item', 42, 'list']
%store important
... time passes, sessions restarted
%store -r
%store
Stored variables and their in-db values:
important -> ['test', 42, 'list']
Or, look to pickle.

Related

'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte [duplicate]

I'm running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error...
File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
data = pd.read_csv(filepath, names=fields)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
return parser.read()
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
ret = self._engine.read(nrows)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
data = self._reader.read(nrows)
File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid continuation byte
The source/creation of these files all come from the same place. What's the best way to correct this to proceed with the import?
read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf-8" for reading, and generally utf-8 for to_csv.
You can also use one of several alias options like 'latin' or 'cp1252' (Windows) instead of 'ISO-8859-1' (see python docs, also for numerous other encodings you may encounter).
See relevant Pandas documentation,
python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.
To detect the encoding (assuming the file contains non-ascii characters), you can use enca (see man page) or file -i (linux) or file -I (osx) (see man page).
Simplest of all Solutions:
import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')
Alternate Solution:
Sublime Text:
Open the csv file in Sublime text editor or VS Code.
Save the file in utf-8 format.
In sublime, Click File -> Save with encoding -> UTF-8
VS Code:
In the bottom bar of VSCode, you'll see the label UTF-8. Click it. A popup opens. Click Save with encoding. You can now pick a new encoding for that file.
Then, you could read your file as usual:
import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')
and the other different encoding types are:
encoding = "cp1252"
encoding = "ISO-8859-1"
Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. So there is no one size fits all method but different ways depending on the actual use case.
You know the encoding, and there is no encoding error in the file.
Great: you have just to specify the encoding:
file_encoding = 'cp1252' # set file_encoding to the file encoding (utf8, latin1, etc.)
pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code):
pd.read_csv(input_file_and_path, ..., encoding='latin1')
You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence:
file_encoding = 'utf8' # set file_encoding to the file encoding (utf8, latin1, etc.)
input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
pd.read_csv(input_fd, ...)
with open('filename.csv') as f:
print(f)
after executing this code you will find encoding of 'filename.csv' then execute code as following
data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"
there you go
This is a more general script approach for the stated question.
import pandas as pd
encoding_list = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737'
, 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862'
, 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950'
, 'cp1006', 'cp1026', 'cp1125', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254'
, 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr'
, 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2'
, 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2'
, 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9'
, 'iso8859_10', 'iso8859_11', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab'
, 'koi8_r', 'koi8_t', 'koi8_u', 'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2'
, 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32'
, 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig']
for encoding in encoding_list:
worked = True
try:
df = pd.read_csv(path, encoding=encoding, nrows=5)
except:
worked = False
if worked:
print(encoding, ':\n', df.head())
One starts with all the standard encodings available for the python version (in this case 3.7 python 3.7 standard encodings).
A usable python list of the standard encodings for the different python version is provided here: Helpful Stack overflow answer
Trying each encoding on a small chunk of the data;
only printing the working encoding.
The output is directly obvious.
This output also addresses the problem that an encoding like 'latin1' that runs through with ought any error, does not necessarily produce the wanted outcome.
In case of the question, I would try this approach specific for problematic CSV file and then maybe try to use the found working encoding for all others.
Please try to add
import pandas as pd
df = pd.read_csv('file.csv', encoding='unicode_escape')
This will help. Worked for me. Also, make sure you're using the correct delimiter and column names.
You can start with loading just 1000 rows to load the file quickly.
Try changing the encoding.
In my case, encoding = "utf-16" worked.
df = pd.read_csv("file.csv",encoding='utf-16')
In my case, a file has USC-2 LE BOM encoding, according to Notepad++.
It is encoding="utf_16_le" for python.
Hope, it helps to find an answer a bit faster for someone.
Try specifying the engine='python'.
It worked for me but I'm still trying to figure out why.
df = pd.read_csv(input_file_path,...engine='python')
In my case this worked for python 2.7:
data = read_csv(filename, encoding = "ISO-8859-1", dtype={'name_of_colum': unicode}, low_memory=False)
And for python 3, only:
data = read_csv(filename, encoding = "ISO-8859-1", low_memory=False)
You can always try to detect the encoding of the file first, with chardet or cchardet or charset-normalizer:
from pathlib import Path
import chardet
filename = "file_name.csv"
detected = chardet.detect(Path(filename).read_bytes())
# detected is something like {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
encoding = detected.get("encoding")
assert encoding, "Unable to detect encoding, is it a binary file?"
df = pd.read_csv(filename, encoding=encoding)
Struggled with this a while and thought I'd post on this question as it's the first search result. Adding the encoding="iso-8859-1" tag to pandas read_csv didn't work, nor did any other encoding, kept giving a UnicodeDecodeError.
If you're passing a file handle to pd.read_csv(), you need to put the encoding attribute on the file open, not in read_csv. Obvious in hindsight, but a subtle error to track down.
I am posting an answer to provide an updated solution and explanation as to why this problem can occur. Say you are getting this data from a database or Excel workbook. If you have special characters like La Cañada Flintridge city, well unless you are exporting the data using UTF-8 encoding, you're going to introduce errors. La Cañada Flintridge city will become La Ca\xf1ada Flintridge city. If you are using pandas.read_csv without any adjustments to the default parameters, you'll hit the following error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte
Fortunately, there are a few solutions.
Option 1, fix the exporting. Be sure to use UTF-8 encoding.
Option 2, if fixing the exporting problem is not available to you, and you need to use pandas.read_csv, be sure to include the following paramters, engine='python'. By default, pandas uses engine='C' which is great for reading large clean files, but will crash if anything unexpected comes up. In my experience, setting encoding='utf-8' has never fixed this UnicodeDecodeError. Also, you do not need to use errors_bad_lines, however, that is still an option if you REALLY need it.
pd.read_csv(<your file>, engine='python')
Option 3: solution is my preferred solution personally. Read the file using vanilla Python.
import pandas as pd
data = []
with open(<your file>, "rb") as myfile:
# read the header seperately
# decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
# read the rest of the data
for line in myfile:
row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
data.append(row)
# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)
Hope this helps people encountering this issue for the first time.
Another important issue that I faced which resulted in the same error was:
_values = pd.read_csv("C:\Users\Mujeeb\Desktop\file.xlxs")
^This line resulted in the same error because I am reading an excel file using read_csv() method. Use read_excel() for reading .xlxs
You can try this.
import csv
import pandas as pd
df = pd.read_csv(filepath,encoding='unicode_escape')
I have trouble opening a CSV file in simplified Chinese downloaded from an online bank,
I have tried latin1, I have tried iso-8859-1, I have tried cp1252, all to no avail.
But pd.read_csv("",encoding ='gbk') simply does the work.
This answer seems to be the catch-all for CSV encoding issues. If you are getting a strange encoding problem with your header like this:
>>> f = open(filename,"r")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('\ufeffid', '1'), ... ])
Then you have a byte order mark (BOM) character at the beginning of your CSV file. This answer addresses the issue:
Python read csv - BOM embedded into the first key
The solution is to load the CSV with encoding="utf-8-sig":
>>> f = open(filename,"r", encoding="utf-8-sig")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('id', '1'), ... ])
Hopefully this helps someone.
I am posting an update to this old thread. I found one solution that worked, but requires opening each file. I opened my csv file in LibreOffice, chose Save As > edit filter settings. In the drop-down menu I chose UTF8 encoding. Then I added encoding="utf-8-sig" to the data = pd.read_csv(r'C:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig").
Hope this helps someone.
I am using Jupyter-notebook. And in my case, it was showing the file in the wrong format. The 'encoding' option was not working.
So I save the csv in utf-8 format, and it works.
Try this:
import pandas as pd
with open('filename.csv') as f:
data = pd.read_csv(f)
Looks like it will take care of the encoding without explicitly expressing it through argument
Check the encoding before you pass to pandas. It will slow you down, but...
with open(path, 'r') as f:
encoding = f.encoding
df = pd.read_csv(path,sep=sep, encoding=encoding)
In python 3.7
Sometimes the problem is with the .csv file only. The file may be corrupted.
When faced with this issue. 'Save As' the file as csv again.
0. Open the xls/csv file
1. Go to -> files
2. Click -> Save As
3. Write the file name
4. Choose 'file type' as -> CSV [very important]
5. Click -> Ok
In my case, I could not manage to overcome this issue using any method provided before. Changing the encoder type to utf-8, utf-16, iso-8859-1, or any other type somehow did not work.
But instead of using pd.read_csv(filename, delimiter=';'), I used;
pd.read_csv(open(filename, 'r'), delimiter=';')
and things seem working just fine.
You can try with:
df = pd.read_csv('./file_name.csv', encoding='gbk')
Pandas does not automatically replace the offending bytes by changing the encoding style. In my case, changing the encoding parameter from encoding = "utf-8" to encoding = "utf-16" resolved the issue.

How to open and read a .gz file in Nim (preferably line by line)

I just sat down to write my first Nim script to parse a .vcf (Variant Call Format) file. This file format stores genetic mutations from sequencing data.
For scripting languages, I 'grew up' on Perl and later migrated to Python, but I would love to use a language with the speed that Nim offers. I realize Nim is still young, but I couldn't even find a clear example for how to open and read a .gz (gzip) file (preferably line by line).
Can anyone provide a simple example to open and read a gzip file using Nim, line by line?
In Python, I'm accustomed to the following (uber-simple) code:
import gzip
my_file = gzip.open('my_file.vcf.gz', 'w')
for line in my_file:
# do something
my_file.close()
I have seen related questions, but they're not clear. The posts are also relatively old and I hope/suspect something better has come about. Here's what I've found:
Read gzip-compressed file line by line
File, FileStream, and GZFileStream
Reading files from tar.gz archive in Nim
Really appreciate it.
P.S. I also think it would be useful if someone created a Nim tag in StackOverflow. I do not have the reputation to create tags.
Just in case you need to handle VCF rather than .gz, there's a nice wrapper for htslib written by Brent Pedersen:
https://github.com/brentp/hts-nim
You need to install the htslib in your system, and then require the library in your .nimble file with requires "hts", or install the library with nimble install hts. If you are going to do NGS analysis in Nim you'll need it.
The code you need:
import hts
var v:VCF
doAssert open(v, "myfile.vcf.gz")
# Here you have the VCF file loaded in v, and can access the headers through
# v.header property
for record in v:
# Here you get a Record object per line, e.g. extract the Ref and Alts:
echo v.REF, " ", v.ALT
v.close()
Be sure to follow the docs, because some things differ from python, specially when getting the INFO and FORMAT fields.
Checkout the whole Brent repo. It has plenty of wrappers, code samples and utilities to handle NGS problems (e.g. an ultrafast coverage tool utility called Mosdepth).
Per suggestion from Maurice Meyer, I looked at the tests for the Nim zip package. It turned out to be quite simple. This is my first Nim script, so my apologies if I didn't follow convention, etc.
import zip/gzipfiles # Import zip package
block:
let vcf = newGzFileStream("my_file.vcf.gz") # Open gzip file
defer: outFile.close() # Close file (like a 'final' statement in 'try' block)
var line: string # Declare line variable
# Loop over each line in the file
while not vcf.atEnd():
line = vcf.readLine()
# Cure disease with my VCF file
To install the zip package, I simply ran because it is already in the Nim package library:
> nimble refresh
> nimble install zip
I tried to use Nim some time ago to parse a fastq or fastq.gz file.
The code should be available here:
https://gitlab.pasteur.fr/bli/qaf_demux/blob/master/Nim/src/qaf_demux.nim
I don't remember exactly how this works, but apparently, I did an import zip/gzipfiles and used newGZFileStream on the input file name to obtain a Stream from which lines can be read using .readLine() in this piece of code:
proc fastqParser(stream: Stream): iterator(): Fastq =
result = iterator(): Fastq =
var
nameLine: string
nucLine: string
quaLine: string
while not stream.atEnd():
nameLine = stream.readLine()
nucLine = stream.readLine()
discard stream.readLine()
quaLine = stream.readLine()
yield [nameLine, nucLine, quaLine]
It is used in something that amounts to this piece of code:
let inputFqs = fastqParser(newGZFileStream($inFastqFilename))
Hopefully you can adapt this to your case.
My .nimble file has a requires "zip#head". I suppose this triggers the installation of zip/gzipfiles.

GIMP Script.Fu script to batch convert JPEG to PNG

Can someone give me the script I would need to run to batch convert many *.jpeg files to *.png in Script.Fu in GIMP?
Currently I am spending way too much time manually exporting every image and it's a waste of time.
I can't install anything right now so can't use alternative applications.
Alright, after a lot of trials and errors I finally figured out how to convert one file format to another using only GIMP.
This is the Script-Fu script for conversion to PNG:
(
let* ((filename "{{filename}}")
(output "{{output}}")
(image (car (gimp-file-load 1 filename filename)))
(drawable (car (gimp-image-get-active-layer image))))
(file-png-save-defaults 1 image drawable output output)
)
Where {{filename}} is input file that needs to be converted (a jpeg file, for example), {{output}} is the output file that you need (it can be simply the same file name but with PNG extension)
How to run it: it can probably be improved
gimp -i -n -f -d --batch "{{one-line script-fu}}"
More about command line options you can find in GIMP online documentation.
The place that needs to be changed is {{one-line script-fu}} and it has to be... one-line! You can probably do all of this in one file using cmd (in case if you use Windows), but for me it was easier to use Python, so here's the script for it:
import subprocess, os
def convert_to_png(file_dds):
#Loads the command to run gimp cli (second code block)
#Note: remove "{{one-line script-fu}}" and leave one space after the --batch
with open("gimp-convert.bat", "r") as f:
main_script = f.read()
#Prepares the Script-Fu script to be run, replacing necessary file names and makes it one-line (the firs code block)
with open("gimp-convert-png.fu", "r") as f:
script = f.read().replace("\n", " ").replace("{{filename}}", file_dds) \
.replace("{{output}}", file_dds[:-3]+"PNG").replace("\\", "\\\\").replace("\"", "\\\"")
subprocess.run(main_script + " \"" + script + "\" --batch \"(gimp-quit 1)\"",
cwd = os.getcwd(),
shell = True)
And you should get your file converted to PNG!
I needed this for my texture upscale project, all of the code below you can find here.
Tested with GIMP 2.10
The real solution is to use ImageMagicks convert, as simple as magick convert some.jpeg some.png. There must be a "portable" version somewhere that you can use off a USB key.
Otherwise with Gimp, a much less manual way that doesn't need for a new script, since it uses an existing script:
get/install ofn-export-layers
File>Open the first JPEG
File>Open as layers more Jpegs. You can select several/all jpegs in one call (actual number limited by available RAM mostly). Once this is done you have many Jpegs stacked in the same image
File>Export all layers, making sure the name pattern you use ends in .png (the doc that comes with the script explains how that works).

Display output from another python script in jupyter notebook

I run a loop in my jupyter notebook that references another python file using the execfile command.
I want to be able to see all the various prints and outputs from the file I call from execfile. However, I don't see any of the pandas dataframe printouts. E.g. if I just say 'df' I don't see the output of the table of the dataframe. However, I will see 'print 5'.
Can someone help me what options I need to set to enable this to be viewed?
import pandas as pd
list2loop =['a','b','c','d']
for each_item in list2loop:
execfile("test_file.py")
where 'test_file.py' is:
df=pd.DataFrame([each_item])
df
print 3
The solution is simply using the %run magic instead of execfile (whatever execfile is).
Say you have a file test.py:
#test.py
print(test_input)
Then you can simply do
for test_input in (1, 2, 3):
%run -i test.py
The -i tells Python to run the file in IPython's name space, thus the script knows about all your variables, and variables defined in your script are in your name space afterwards. If you explicitly call sys.exit in your script, you have to additionally use -e.

Anyone know how to make a self-contained Awk/Gawk program on Windows

I'm using an awk script to do some reasonably heavy parsing that could be useful to repeat in the future but I'm not sure if my unix-unfriendly co-workers will be willing to install awk/gawk in order to do the parsing. Is there a way to create a self-contained executable from my script?
I'm not aware of a way to make a self-contained binary using AWK. However, if you like AWK, chances seem good that you might like Python, and there are several ways to make a self-contained Python program. For example, Py2Exe.
Here's a quick example of Python:
# comments are introduced by '#', same as AWK
import re # make regular expressions available
import sys # system stuff like args or stdin
# read from specified file, else read standard input
if len(sys.argv) == 2:
f = open(sys.argv[1])
else:
f = sys.stdin
# Compile some regular expressions to use later.
# You don't have to pre-compile, but it's more efficient.
pat0 = re.compile("regexp_pattern_goes_here")
pat1 = re.compile("some_other_regexp_here")
# for loop to read input lines.
# This assumes you want normal line separation.
# If you want lines split on some other character, you would
# have to split the input yourself (which isn't hard).
# I can't remember ever changing the line separator in my AWK code...
for line in f:
FS = None # default: split on whitespace
# change FS to some other string to change field sep
words = line.split(FS)
if pat0.search(line):
# handle the pat0 match case
elif pat1.search(line):
# handle the pat1 match case
elif words[0].lower() == "the":
# handle the case where the first word is "the"
else:
for word in words:
# do something with words
Not the same as AWK, but easy to learn, and actually more powerful than AWK (the language has more features and there are many "modules" to import and use). Python doesn't have anything implicit like the
/pattern_goes_here/ {
# code goes here
}
feature in AWK, but you can simply have an if/elif/elif/else chain with patterns to match.
Theres a standalone awk.exe in the Cygwin Toolkit as far as I know.
You could just bundle that in with whatever files you're distributing to your colleagues.
Does it have to be self contained? You could write a small executable that will invoke awk with the right arguments and pipe the results to a file the users chooses, or to stdout - whichever is appropriate for your co-workers.
MAWK in GnuWin32 — http://gnuwin32.sourceforge.net/packages/mawk.htm
also interesting alternative, Java implementation — http://sourceforge.net/projects/jawk/