How do I read a gzipped XLSX file in Julia? - dataframe

I have a gz file which I downloaded from here, using HTTP. Now I want to read the xlsx file contained in the gz file and convert it to a DataFrame. I tried this:
julia> using HTTP, XLSX, DataFrames, GZip
julia> file = HTTP.get("http://www.tsetmc.com/tsev2/excel/IntraDayPrice.aspx?i=35425587644337450&m=30")
julia> write("c:/users/shayan/desktop/file.xlsx.gz", file.body);
julia> df = GZip.open("c:/users/shayan/desktop/file.xlsx.gz", "r") do io
XLSX.readxlsx(io)
end
But this throws a MethodError:
ERROR: MethodError: no method matching readxlsx(::GZipStream)
Closest candidates are:
readxlsx(::AbstractString) at C:\Users\Shayan\.julia\packages\XLSX\FFzH0\src\read.jl:37
Stacktrace:
[1] (::var"#23#24")(io::GZipStream)
# Main c:\Users\Shayan\Documents\Python Scripts\test.jl:15
[2] gzopen(::var"#23#24", ::String, ::String)
# GZip C:\Users\Shayan\.julia\packages\GZip\JNmGn\src\GZip.jl:269
[3] open(::Function, ::Vararg{Any})
# GZip C:\Users\Shayan\.julia\packages\GZip\JNmGn\src\GZip.jl:265
[4] top-level scope
# c:\Users\Shayan\Documents\Python Scripts\test.jl:14

XLSX.jl does not work on streams. So you would need to ungzip the file to some temporary location and then read it.
tname = tempname() * ".xlsx"
GZip.open("c://temp//journals.xlsx.gz", "r") do io
open(tname, "w") do out
write(out, read(io))
end
end
df = XLSX.readxlsx(tname)

Related

Error while converting csv to parquet file using pandas

I would like to upload csv as parquet file to S3 bucket. Below is the code snippet.
df = pd.read_csv('right_csv.csv')
csv_buffer = BytesIO()
df.to_parquet(csv_buffer, compression='gzip', engine='fastparquet')
csv_buffer.seek(0)
Above is giving me an error: TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO
How to make it work?
As per the documentation, when fastparquet is used as the engine, io.BytesIO cannot be used. auto or pyarrow engine have to be used. Quoting from the documentation.
The engine fastparquet does not accept file-like objects.
Below code works without any issues.
import io
f = io.BytesIO()
df.to_parquet(f, compression='gzip', engine='pyarrow')
f.seek(0)
As mentioned in the other answer, this is not supported. One work around would be to save as parquet to a NamedTemporaryFile. Then copy the content to a BytesIO buffer:
import tempfile
with tempfile.NamedTemporaryFile() as tmp:
df.to_parquet(tmp.name, compression='gzip', engine='fastparquet')
with open(tmp.name, 'rb') as fh:
buf = io.BytesIO(fh.read())

Python reading csv file inside a subfolder in a zipped folder

I am trying the following:
import pandas as pd
loc = r'T:\Analysis\calibraer19.zip\col1\profiles\myfile.csv'
pd.read_csv(loc)
But I keep getting file not exists error. I am not sure how to read this file as the zip folder size is very large with 100s of files in it so unzipping is not a good option.
You can use the zipfile library to extract only the file you want to read:
import zipfile
with zipfile.ZipFile(r'T:\Analysis\calibraer19.zip') as z:
with open('myfile.csv', 'wb') as f:
f.write(z.read(r'col1\profiles\myfile.csv'))
df = pd.read_csv('myfile.csv')
You can try the following approach with zipfile module:
import zipfile
with zipfile.ZipFile("Desktop.zip") as z:
data = z.read("pandas_test_data.csv").decode("utf-8-sig")
lines = (elem for elem in data.split("\r\n"))
# lines = (elem for elem in data.split("\n")) if you're csv contains \n instead of \r\n
rows_of_data = (elem.split(",") for elem in lines)
df = pd.DataFrame(rows_of_data)
You read the data once and then simply create generators for subsequent steps. The generators can be consumed by the pandas DataFrame class's constructor.
Note: I added the decode("utf-8-sig") since i have encountered UTF-BOM characters when reading Zip Files.

How to make pandas read csv input as string, not as an url

I'm trying to load a csv (from an API response) into pandas, but keep getting an error
"ValueError: stat: path too long for Windows" and "FileNotFoundError: [Errno 2] File b'"fwefwe","fwef..."
indicating that pandas interprets it as an url, not a string.
The code below causes the errors above.
fake_csv='"fwefwe","fwefw","fwefew";"2","5","7"'
df = pd.read_csv(fake_csv, encoding='utf8')
df
How do I force pandas to interpret my argument as a csv string?
You can do that using StringIO:
import io
fake_csv='"fwefwe","fwefw","fwefew";"2","5","7"'
df = pd.read_csv(io.StringIO(fake_csv), encoding='utf8', sep=',', lineterminator=';')
df
Result:
Out[30]:
fwefwe fwefw fwefew
0 2 5 7

Read multiple parquet files in a folder and write to single csv file using python

I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder.
I need to read these parquet files starting from file1 in order and write it to a singe csv file. After writing contents of file1, file2 contents should be appended to same csv without header. Note that all files have same column names and only data is split into multiple files.
I learnt to convert single parquet to csv file using pyarrow with the following code:
import pandas as pd
df = pd.read_parquet('par_file.parquet')
df.to_csv('csv_file.csv')
But I could'nt extend this to loop for multiple parquet files and append to single csv.
Is there a method in pandas to do this? or any other way to do this would be of great help. Thank you.
I ran into this question looking to see if pandas can natively read partitioned parquet datasets. I have to say that the current answer is unnecessarily verbose (making it difficult to parse). I also imagine that it's not particularly efficient to be constantly opening/closing file handles then scanning to the end of them depending on the size.
A better alternative would be to read all the parquet files into a single DataFrame, and write it once:
from pathlib import Path
import pandas as pd
data_dir = Path('dir/to/parquet/files')
full_df = pd.concat(
pd.read_parquet(parquet_file)
for parquet_file in data_dir.glob('*.parquet')
)
full_df.to_csv('csv_file.csv')
Alternatively, if you really want to just append to the file:
data_dir = Path('dir/to/parquet/files')
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
df = pd.read_parquet(parquet_path)
write_header = i == 0 # write header only on the 0th file
write_mode = 'w' if i == 0 else 'a' # 'write' mode for 0th file, 'append' otherwise
df.to_csv('csv_file.csv', mode=write_mode, header=write_header)
A final alternative for appending each file that opens the target CSV file in "a+" mode at the onset, keeping the file handle scanned to the end of the file for each write/append (I believe this works, but haven't actually tested it):
data_dir = Path('dir/to/parquet/files')
with open('csv_file.csv', "a+") as csv_handle:
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
df = pd.read_parquet(parquet_path)
write_header = i == 0 # write header only on the 0th file
df.to_csv(csv_handle, header=write_header)
I'm having a similar need and I read current Pandas version supports a directory path as argument for the read_csv function. So you can read multiple parquet files like this:
import pandas as pd
df = pd.read_parquet('path/to/the/parquet/files/directory')
It concats everything into a single dataframe so you can convert it to a csv right after:
df.to_csv('csv_file.csv')
Make sure you have the following dependencies according to the doc:
pyarrow
fastparquet
This helped me to load all parquet files into one data frame
import glob
files = glob.glob("*.snappy.parquet")
data = [pd.read_parquet(f,engine='fastparquet') for f in files]
merged_data = pd.concat(data,ignore_index=True)
If you are going to copy the files over to your local machine and run your code you could do something like this. The code below assumes that you are running your code in the same directory as the parquet files. It also assumes the naming of files as your provided above: "order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder." If you need to search for your files then you will need to get the file names using glob and explicitly provide the path where you want to save the csv: open(r'this\is\your\path\to\csv_file.csv', 'a') Hope this helps.
import pandas as pd
# Create an empty csv file and write the first parquet file with headers
with open('csv_file.csv','w') as csv_file:
print('Reading par_file1.parquet')
df = pd.read_parquet('par_file1.parquet')
df.to_csv(csv_file, index=False)
print('par_file1.parquet appended to csv_file.csv\n')
csv_file.close()
# create your file names and append to an empty list to look for in the current directory
files = []
for i in range(2,101):
files.append(f'par_file{i}.parquet')
# open files and append to csv_file.csv
for f in files:
print(f'Reading {f}')
df = pd.read_parquet(f)
with open('csv_file.csv','a') as file:
df.to_csv(file, header=False, index=False)
print(f'{f} appended to csv_file.csv\n')
You can remove the print statements if you want.
Tested in python 3.6 using pandas 0.23.3
a small change for those trying to read remote files, which helps to read it faster (direct read_parquet for remote files was doing this much slower for me):
import io
merged = []
# remote_reader = ... <- init some remote reader, for example AzureDLFileSystem()
for f in files:
with remote_reader.open(f, 'rb') as f_reader:
merged.append(remote_reader.read())
merged = pd.concat((pd.read_parquet(io.BytesIO(file_bytes)) for file_bytes in merged))
Adds a little temporary memory overhead though.
You can use Dask to read in the multiple Parquet files and write them to a single CSV.
Dask accepts an asterisk (*) as wildcard / glob character to match related filenames.
Make sure to set single_file to True and index to False when writing the CSV file.
import pandas as pd
import numpy as np
# create some dummy dataframes using np.random and write to separate parquet files
rng = np.random.default_rng()
for i in range(3):
df = pd.DataFrame(rng.integers(0, 100, size=(10, 4)), columns=list('ABCD'))
df.to_parquet(f"dummy_df_{i}.parquet")
# load multiple parquet files with Dask
import dask.dataframe as dd
ddf = dd.read_parquet('dummy_df_*.parquet', index=False)
# write to single csv
ddf.to_csv("dummy_df_all.csv",
single_file=True,
index=False
)
# test to verify
df_test = pd.read_csv("dummy_df_all.csv")
Using Dask for this means you won't have to worry about the resulting file size (Dask is a distributed computing framework that can handle anything you throw at it, while pandas might throw a MemoryError if the resulting DataFrame is too large) and you can easily read and write from cloud data storage like Amazon S3.

Dask array from_npy_stack misses info file

Action
Trying to create a Dask array from a stack of .npy files not written by Dask.
Problem
Dask from_npy_stack() expects an info file, which is normally created by to_npy_stack() function when creating .npy stack with Dask.
Attempts
I found this PR (https://github.com/dask/dask/pull/686) with a description of how the info file is created
def to_npy_info(dirname, dtype, chunks, axis):
with open(os.path.join(dirname, 'info'), 'wb') as f:
pickle.dump({'chunks': chunks, 'dtype': x.dtype, 'axis': axis}, f)
Question
How do I go about loading .npy stacks that are created outside of Dask?
Example
from pathlib import Path
import numpy as np
import dask.array as da
data_dir = Path('/home/tom/data/')
for i in range(3):
data = np.zeros((2,2))
np.save(data_dir.joinpath('{}.npy'.format(i)), data)
data = da.from_npy_stack('/home/tom/data')
Resulting in the following error:
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-94-54315c368240> in <module>()
9 np.save(data_dir.joinpath('{}.npy'.format(i)), data)
10
---> 11 data = da.from_npy_stack('/home/tom/data/')
/home/tom/vue/env/local/lib/python2.7/site-packages/dask/array/core.pyc in from_npy_stack(dirname, mmap_mode)
3722 Read data in memory map mode
3723 """
-> 3724 with open(os.path.join(dirname, 'info'), 'rb') as f:
3725 info = pickle.load(f)
3726
IOError: [Errno 2] No such file or directory: '/home/tom/data/info'
The function from_npy_stack is short and simple. Agree that it probably ought to take the metadata as an optional argument for cases such as yours, but you could simply use the lines of code after loading the "info" file assuming you have the right values to. Some of these values, i.e., dtype and the shape of each array for making chunks, could presumably be obtained by looking at the first of the data files
name = 'from-npy-stack-%s' % dirname
keys = list(product([name], *[range(len(c)) for c in chunks]))
values = [(np.load, os.path.join(dirname, '%d.npy' % i), mmap_mode)
for i in range(len(chunks[axis]))]
dsk = dict(zip(keys, values))
out = Array(dsk, name, chunks, dtype)
Also, note that we are constructing the names of the files here, but you might want to get those by doing a listdir or glob.