When using pandas dataframe.to_csv(), with compression='zip', it creates a zip file with two archive files with the EXACT same name - pandas

I am trying to save OHLCV (stock pricing) data from a dataframe into a single zipped csv file as follows. My test data is ohlcvData.csv, which I read into a dataframe with
import pandas as pd
df = pd.read_csv('ohlcvData.csv', header=None, names=['datetime', 'open', 'high', 'low', 'close', 'volume'], index_col='datetime')
and when I try to write it to a zip file like so (following stackoverflow.com/questions/55134716) :
df.to_csv('ohlcvData.zip', header=False, compression=dict(method='zip', archive_name='ohlcv.csv'))
I get the following warning ...
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\zipfile.py:1473: UserWarning: Duplicate name: 'ohlcv.csv'
return self._open_to_write(zinfo, force_zip64=force_zip64)
and the resultant ohlcvData.zip file contains two files, both named ohlcv.csv, each containing a portion of the results.
When I try to read the zip file back into a dataframe ...
dfRead = pd.read_csv(ohlcvData.zip', header=None, names=['datetime', 'open', 'high', 'low', 'close', 'volume'], index_col='datetime')
... I get the following error...
*File "C:\Users\jeffm\AppData\Roaming\Python\Python37\site-packages\pandas\io\common.py", line 618, in get_handle
"Multiple files found in ZIP file. "
ValueError: Multiple files found in ZIP file. Only one file per ZIP: ['ohlcv.csv', 'ohlcv.csv']*
However, when I reduce the number of rows in the input file from 200 to around 175 (for this file structure it varies slightly how many lines I have to remove depending on the data), it works and produces a zip file, containing one csv file, which can be loaded back into a dataframe without error. I have tried many different files, with different data and formats and I still get the same result -- any file with over (approx) 175 lines fails and any file with less works fine. So it looks like its splitting the file after a certain size, but from the docs there doesn't appear to be such a setting. Any help on this would be appreciated. Thanks.

This appears to be a bug introduced in 1.2.0, I created a minimal reproducing example and posted an issue: https://github.com/pandas-dev/pandas/issues/39190
import pandas as pd
# enough data to cause chunking into multiple files
n_data = 100000
df = pd.DataFrame(
{'name': ["Raphael"]*n_data,
'mask': ["red"]*n_data,
'weapon': ["sai"]*n_data,
}
)
compression_opts = dict(method='zip', archive_name='out.csv')
df.to_csv('out.csv.zip', index=False, compression=compression_opts)
# reading back the data produces an error
r_df = pd.read_csv("out.csv.zip")
# passing in compression_opts doesn't work either
r_df = pd.read_csv("out.csv.zip", compression=compression_opts)

Looks like this may be a recent Pandas bug. I was having the same issue in Pandas 1.2.0. Reverting to 1.1.3 (i.e. what I was using before) solved the issue. I haven't tested 1.1.4 and 1.1.5.

Related

how to read data from multiple folder from adls to databricks dataframe

file path format is data/year/weeknumber/no of day/data_hour.parquet
data/2022/05/01/00/data_00.parquet
data/2022/05/01/01/data_01.parquet
data/2022/05/01/02/data_02.parquet
data/2022/05/01/03/data_03.parquet
data/2022/05/01/04/data_04.parquet
data/2022/05/01/05/data_05.parquet
data/2022/05/01/06/data_06.parquet
data/2022/05/01/07/data_07.parquet
how to read all this file one by one in data bricks notebook and store into the data frame
import pandas as pd
#Get all the files under the folder
data = dbutils.fs.la(file)
df = pd.DataFrame(data)
#Create the list of file
list = df.path.tolist()
enter code here
for i in list:
df = spark.read.load(path=f'{f}*',format='parquet')
i can able to read only the last file skipping the other file
The last line of your code cannot load data incrementally. In contrast, it refreshes df variable with the data from each path for each time it ran.
Removing the for loop and trying the code below would give you an idea how file masking with asterisks works. Note that the path should be a full path. (I'm not sure if the data folder is your root folder or not)
df = spark.read.load(path='/data/2022/05/*/*/*.parquet',format='parquet')
This is what I have applied from the same answer I shared with you in the comment.

'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte [duplicate]

I'm running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error...
File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
data = pd.read_csv(filepath, names=fields)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
return parser.read()
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
ret = self._engine.read(nrows)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
data = self._reader.read(nrows)
File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid continuation byte
The source/creation of these files all come from the same place. What's the best way to correct this to proceed with the import?
read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf-8" for reading, and generally utf-8 for to_csv.
You can also use one of several alias options like 'latin' or 'cp1252' (Windows) instead of 'ISO-8859-1' (see python docs, also for numerous other encodings you may encounter).
See relevant Pandas documentation,
python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.
To detect the encoding (assuming the file contains non-ascii characters), you can use enca (see man page) or file -i (linux) or file -I (osx) (see man page).
Simplest of all Solutions:
import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')
Alternate Solution:
Sublime Text:
Open the csv file in Sublime text editor or VS Code.
Save the file in utf-8 format.
In sublime, Click File -> Save with encoding -> UTF-8
VS Code:
In the bottom bar of VSCode, you'll see the label UTF-8. Click it. A popup opens. Click Save with encoding. You can now pick a new encoding for that file.
Then, you could read your file as usual:
import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')
and the other different encoding types are:
encoding = "cp1252"
encoding = "ISO-8859-1"
Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. So there is no one size fits all method but different ways depending on the actual use case.
You know the encoding, and there is no encoding error in the file.
Great: you have just to specify the encoding:
file_encoding = 'cp1252' # set file_encoding to the file encoding (utf8, latin1, etc.)
pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code):
pd.read_csv(input_file_and_path, ..., encoding='latin1')
You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence:
file_encoding = 'utf8' # set file_encoding to the file encoding (utf8, latin1, etc.)
input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
pd.read_csv(input_fd, ...)
with open('filename.csv') as f:
print(f)
after executing this code you will find encoding of 'filename.csv' then execute code as following
data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"
there you go
This is a more general script approach for the stated question.
import pandas as pd
encoding_list = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737'
, 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862'
, 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950'
, 'cp1006', 'cp1026', 'cp1125', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254'
, 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr'
, 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2'
, 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2'
, 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9'
, 'iso8859_10', 'iso8859_11', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab'
, 'koi8_r', 'koi8_t', 'koi8_u', 'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2'
, 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32'
, 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig']
for encoding in encoding_list:
worked = True
try:
df = pd.read_csv(path, encoding=encoding, nrows=5)
except:
worked = False
if worked:
print(encoding, ':\n', df.head())
One starts with all the standard encodings available for the python version (in this case 3.7 python 3.7 standard encodings).
A usable python list of the standard encodings for the different python version is provided here: Helpful Stack overflow answer
Trying each encoding on a small chunk of the data;
only printing the working encoding.
The output is directly obvious.
This output also addresses the problem that an encoding like 'latin1' that runs through with ought any error, does not necessarily produce the wanted outcome.
In case of the question, I would try this approach specific for problematic CSV file and then maybe try to use the found working encoding for all others.
Please try to add
import pandas as pd
df = pd.read_csv('file.csv', encoding='unicode_escape')
This will help. Worked for me. Also, make sure you're using the correct delimiter and column names.
You can start with loading just 1000 rows to load the file quickly.
Try changing the encoding.
In my case, encoding = "utf-16" worked.
df = pd.read_csv("file.csv",encoding='utf-16')
In my case, a file has USC-2 LE BOM encoding, according to Notepad++.
It is encoding="utf_16_le" for python.
Hope, it helps to find an answer a bit faster for someone.
Try specifying the engine='python'.
It worked for me but I'm still trying to figure out why.
df = pd.read_csv(input_file_path,...engine='python')
In my case this worked for python 2.7:
data = read_csv(filename, encoding = "ISO-8859-1", dtype={'name_of_colum': unicode}, low_memory=False)
And for python 3, only:
data = read_csv(filename, encoding = "ISO-8859-1", low_memory=False)
You can always try to detect the encoding of the file first, with chardet or cchardet or charset-normalizer:
from pathlib import Path
import chardet
filename = "file_name.csv"
detected = chardet.detect(Path(filename).read_bytes())
# detected is something like {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
encoding = detected.get("encoding")
assert encoding, "Unable to detect encoding, is it a binary file?"
df = pd.read_csv(filename, encoding=encoding)
Struggled with this a while and thought I'd post on this question as it's the first search result. Adding the encoding="iso-8859-1" tag to pandas read_csv didn't work, nor did any other encoding, kept giving a UnicodeDecodeError.
If you're passing a file handle to pd.read_csv(), you need to put the encoding attribute on the file open, not in read_csv. Obvious in hindsight, but a subtle error to track down.
I am posting an answer to provide an updated solution and explanation as to why this problem can occur. Say you are getting this data from a database or Excel workbook. If you have special characters like La Cañada Flintridge city, well unless you are exporting the data using UTF-8 encoding, you're going to introduce errors. La Cañada Flintridge city will become La Ca\xf1ada Flintridge city. If you are using pandas.read_csv without any adjustments to the default parameters, you'll hit the following error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte
Fortunately, there are a few solutions.
Option 1, fix the exporting. Be sure to use UTF-8 encoding.
Option 2, if fixing the exporting problem is not available to you, and you need to use pandas.read_csv, be sure to include the following paramters, engine='python'. By default, pandas uses engine='C' which is great for reading large clean files, but will crash if anything unexpected comes up. In my experience, setting encoding='utf-8' has never fixed this UnicodeDecodeError. Also, you do not need to use errors_bad_lines, however, that is still an option if you REALLY need it.
pd.read_csv(<your file>, engine='python')
Option 3: solution is my preferred solution personally. Read the file using vanilla Python.
import pandas as pd
data = []
with open(<your file>, "rb") as myfile:
# read the header seperately
# decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
# read the rest of the data
for line in myfile:
row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
data.append(row)
# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)
Hope this helps people encountering this issue for the first time.
Another important issue that I faced which resulted in the same error was:
_values = pd.read_csv("C:\Users\Mujeeb\Desktop\file.xlxs")
^This line resulted in the same error because I am reading an excel file using read_csv() method. Use read_excel() for reading .xlxs
You can try this.
import csv
import pandas as pd
df = pd.read_csv(filepath,encoding='unicode_escape')
I have trouble opening a CSV file in simplified Chinese downloaded from an online bank,
I have tried latin1, I have tried iso-8859-1, I have tried cp1252, all to no avail.
But pd.read_csv("",encoding ='gbk') simply does the work.
This answer seems to be the catch-all for CSV encoding issues. If you are getting a strange encoding problem with your header like this:
>>> f = open(filename,"r")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('\ufeffid', '1'), ... ])
Then you have a byte order mark (BOM) character at the beginning of your CSV file. This answer addresses the issue:
Python read csv - BOM embedded into the first key
The solution is to load the CSV with encoding="utf-8-sig":
>>> f = open(filename,"r", encoding="utf-8-sig")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('id', '1'), ... ])
Hopefully this helps someone.
I am posting an update to this old thread. I found one solution that worked, but requires opening each file. I opened my csv file in LibreOffice, chose Save As > edit filter settings. In the drop-down menu I chose UTF8 encoding. Then I added encoding="utf-8-sig" to the data = pd.read_csv(r'C:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig").
Hope this helps someone.
I am using Jupyter-notebook. And in my case, it was showing the file in the wrong format. The 'encoding' option was not working.
So I save the csv in utf-8 format, and it works.
Try this:
import pandas as pd
with open('filename.csv') as f:
data = pd.read_csv(f)
Looks like it will take care of the encoding without explicitly expressing it through argument
Check the encoding before you pass to pandas. It will slow you down, but...
with open(path, 'r') as f:
encoding = f.encoding
df = pd.read_csv(path,sep=sep, encoding=encoding)
In python 3.7
Sometimes the problem is with the .csv file only. The file may be corrupted.
When faced with this issue. 'Save As' the file as csv again.
0. Open the xls/csv file
1. Go to -> files
2. Click -> Save As
3. Write the file name
4. Choose 'file type' as -> CSV [very important]
5. Click -> Ok
In my case, I could not manage to overcome this issue using any method provided before. Changing the encoder type to utf-8, utf-16, iso-8859-1, or any other type somehow did not work.
But instead of using pd.read_csv(filename, delimiter=';'), I used;
pd.read_csv(open(filename, 'r'), delimiter=';')
and things seem working just fine.
You can try with:
df = pd.read_csv('./file_name.csv', encoding='gbk')
Pandas does not automatically replace the offending bytes by changing the encoding style. In my case, changing the encoding parameter from encoding = "utf-8" to encoding = "utf-16" resolved the issue.

Appending csv files in directory into a pandas dataframe

I have written a scraper which downloads daily flight prices, stores them as pandas data frames and saves them off as csv files in a given folder. I am now trying to combine these csv files into pandas for data analysis using append, but end result is an empty data frame.
Specifically, individual csv files are loaded correctly into pandas, but the append seems to fail (and several methods found on stackoverflow posts don't seem to work). Code is below, any pointers? Thanks!
directory = os.path.join("C:\\Testfolder\\")
for root,dirs,files in os.walk(directory):
for file in files:
daily_flight_df = (pd.read_csv(directory+file,sep=";")) #loads csv into dataframe - works correctly
cons_flight_df.append(daily_flight_df) #appends daily flight prices into a pandas with consolidated flight prices - does not seem to work
print(cons_flight_df) #currently prints out an empty data frame
cons_flight_df.to_csv('C:\\Testfolder\\test.csv') #currently returns empty csv file
In pandas, the append method isn't in place. You need to assign it.
cons_flight_df = cons_flight_df.append(daily_flight_df)

open all csv files with different structure at the same time in pandas

I have a text file and I am using pandas to process it. I am trying to skip the 1st few rows while opening the file. I don't have a fixed number of lines to be removed in different files (I am making a script) to use skiprows argument. but I know the line that would be the 1st line in new dataframe. so all the files would have this line in common and I want to remove any line before this line in all of them.
small example:
ID,Sample01
Owner,Administrator
ID,identifier
IL21R,6
CD84,21
KLRC2,9
TNFRSF11A,18
and here is the expected output:
ID,identifier
IL21R,6
CD84,21
KLRC2,9
TNFRSF11A,18
the common line among all files is:
ID,identifier
the following code works for the small example since there are only 2 line above the common line. how can I change the code to make it useful also for the files with more than 2 lines above the common line.
df = pd.read_csv(filename, skiprows=2, sep=',')
logic - read the file in a list and find the index of ID,identifier.
Then use the value as skip rows.
with open(r'file_path') as f:
total = f.readlines()
skip_value = total.index('ID,identifier\n')
import pandas as pd
df = pd.read_csv(r'file_path',skiprows=skip_value, sep=',')
print(df)

Pandas save CSV ZIP with proper internal name

I'm running on Pandas 0.23.4.
I have a DataFrame called df. On it, I invoke:
df.to_csv('name.csv.zip', compression='zip')
This creates a zip file called name.csv.zip. Inside it, however, the CSV file is called name.csv.zip and not name.csv. How can I correct this?
In pandas 0.24, there is a new to_csv keyword compression='infer' which will look at the suffix of the file being saved. Unfortunately, it doesn't work that great with zip archives, because the name of the file being saved is used as the name of the member of the zip archive. And it is unclear how to provide archive member names. So what happens is you get the replace df.csv.zip? [y]es, [n]o, [A]ll, [N]one, [r]ename: on extraction and would be left to rename the members of the archive. This also happens when infer is not used and a name and compression method of zip is instead used.
saving df.csv with compression zip gives df.csv with df.csv in it - the archive does not get a .zip suffix. Can be annoying to someone trying to use the file.
saving df.csv.zip with compression zip gives df.csv.zip with df.csv.zip as the archive member name. Can be annoying when extracting, because there is then an archive/member name collision.
Yet a zip archive can be constructed with proper zip archive member names.
import pandas as pd
import zipfile as zf
from pandas.compat import StringIO
print(pd.__version__)
csvdata = StringIO("""index,id1,id2,timestamp,number
465,255,3644,2019-05-02 08:00:20.137000,123123
62,87,912,2019-05-02 5:00:00,435456
""")
# prep dataframe
df = pd.read_csv(csvdata, sep=",")
with zf.ZipFile('archive.zip', 'w') as myziparchive:
myziparchive.writestr('df.csv', df.to_csv())
file archive.zip
archive.zip: Zip archive data, at least v2.0 to extract
Richs-MBP:pandas_examples randrews$ zip --show-files archive.zip
Archive contains:
df.csv
Total 1 entries (119 bytes)
And more than on dataframe can be placed inside.