How can I read many large .7z files containing many CSV files? - pandas

I have many .7z files every file containing many large CSV files (more than 1GB). How can I read this in python (especially pandas and dask data frame)? Should I change the compression format to something else?

I believe you should be able to open the file using
import lzma
with lzma.open("myfile.7z", "r") as f:
df = pd.read_csv(f, ...)
This is, strictly speaking, meant for the xz file format, but may work for 7z also. If not, you will need to use libarchive.
For use with Dask, you can do the above for each file with dask.delayed.
dd.read_csv directly also allows you to specify storage_options={'compression': 'xz'}; however, ramdom access within a file is likely to be inefficient at best, so you should add blocksize=None to force one partition per file:
df = dd.read_csv('myfiles.*.7z', storage_options={'compression': 'xz'},
blocksize=None)

Related

pyspark dataframe writing csv files twice in s3

I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.
code snippet:
df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing.
So something like this:
df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.

Not able to filter files using pathGlobFilter

We are trying to read file from directory based on pattern from azure blob srorage.We are using
pathGlobFilter option to select files. The directory contains following files
Sales_51820_14529409_T_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_51820_14529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_T_7a3cc7d1d17261fd17e7e1fabd3.csv
We need to process only those files which does not have "T" in file name .We need to process only these two files
Sales_51820_14529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_7a3cc7d1d17261fd17e7e1fabd3.csv
But we are not able to read only these two files.
Here is the code,
df = spark.read.format("csv").schema(structSchema).options(header=False,inferSchema=True,sep='|',pathGlobFilter= "Sales_\d{5} _ \d{8}_[a-z0-9]+.csv$").load("wasbs://abc#xxxxx.blob.core.windows.net/abc/2022/02/11/"
Regards,
Rajib
Glob is not a standard regular expression, there is differences between them.
For example glob doesn't match the number of times.
For details, see:here
Back to this question, a relatively stupid way, looking forward to the perfect solution of the giant.
pathGlobFilter="Sales_[0-9][0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[a-z0-9]*.csv"

trouble with utf-8 with julia and jupyterlab

I'm reading the csv file at https://github.com/VinitaSilaparasetty/julia-beginners/blob/master/data/nba/nba19-20.csv
I get a DataFrame and I save it as XLSX. When I try to read it in jupyterlab I get the error the file is not UTF-8 encoded and therefore the file is not read.
This is my code:
using HTTP, XLSX, CSV, DataFrames
df = CSV.read(HTTP.get("https://raw.githubusercontent.com/VinitaSilaparasetty/julia-beginners/master/data/nba/nba19-20.csv").body)
# first(df,5) # first shows the top five rows ok
XLSX.writetable("data/nba/nba19-20.XLSX", collect(eachcol(df)), names(df), overwrite = true)
The file is saved in my data folder. When I try to open it with jupyterlab, I get a pop up with the file is not UTF-8 encoded and the file is not opened.
When I try to open the file in Ubuntu (with LibreOffice) I do not see anything suspicious.
As I'm new to Julia I'm struggling to understand where the problem lies or how to fix it.
I tried to see if I could encode the dataframe in UTF-8 (after saving the file to disk) with
data = DataFrame(CSV.File(open(read,"data/nba/nba19-20.csv", enc"utf-8")))
But I did not see any change. Any suggestion is welcome.
Do you have the jupyterlab-spreadsheet plugin installed? JupyterLab by default doesn't support opening xlsx files (it isn't mentioned in the file formats list here for example).
See also this similar question involving Python pandas (which says pretty much the same thing).

pandas.read_csv of a gzip file within a zipped directory

I would like to use pandas.read_csv to open a gzip file (.asc.gz) within a zipped directory (.zip). Is there an easy way to do this?
This code doesn't work:
csv = pd.read_csv(r'C:\folder.zip\file.asc.gz') // can't find the file
This code does work (however, it requires me to unzip the folder, which I want to avoid because my dataset currently contains thousands of zipped folders):
csv = pd.read_csv(r'C:\folder\file.asc.gz')
Is there an easy way to do this? I have tried using a combination of zipfile.Zipfile and read_csv, but have been unsuccessful (I think partly due to the fact that this is an ascii file as well)
Maybe the followings might help.
df = pd.read_csv('filename.gz', compression='gzip')
OR
import gzip
file=gzip.open('filename.gz','rb')
content=file.read()

How does numpy handle mmap's over npz files?

I have a case where I would like to open a compressed numpy file using mmap mode, but can't seem to find any documentation about how it will work under the covers. For example, will it decompress the archive in memory and then mmap it? Will it decompress on the fly?
The documentation is absent for that configuration.
The short answer, based on looking at the code, is that archiving and compression, whether using np.savez or gzip, is not compatible with accessing files in mmap_mode. It's not just a matter of how it is done, but whether it can be done at all.
Relevant bits in the np.load function
elif isinstance(file, gzip.GzipFile):
fid = seek_gzip_factory(file)
...
if magic.startswith(_ZIP_PREFIX):
# zip-file (assume .npz)
# Transfer file ownership to NpzFile
tmp = own_fid
own_fid = False
return NpzFile(fid, own_fid=tmp)
...
if mmap_mode:
return format.open_memmap(file, mode=mmap_mode)
Look at np.lib.npyio.NpzFile. An npz file is a ZIP archive of .npy files. It loads a dictionary(like) object, and only loads the individual variables (arrays) when you access them (e.g. obj[key]). There's no provision in its code for opening those individual files inmmap_mode`.
It's pretty obvious that a file created with np.savez cannot be accessed as mmap. The ZIP archiving and compression is not the same as the gzip compression addressed earlier in the np.load.
But what of a single array saved with np.save and then gzipped? Note that format.open_memmap is called with file, not fid (which might be a gzip file).
More details on open_memmap in np.lib.npyio.format. Its first test is that file must be a string, not an existing file fid. It ends up delegating the work to np.memmap. I don't see any provision in that function for gzip.