Pandas save CSV ZIP with proper internal name - pandas

I'm running on Pandas 0.23.4.
I have a DataFrame called df. On it, I invoke:
df.to_csv('name.csv.zip', compression='zip')
This creates a zip file called name.csv.zip. Inside it, however, the CSV file is called name.csv.zip and not name.csv. How can I correct this?

In pandas 0.24, there is a new to_csv keyword compression='infer' which will look at the suffix of the file being saved. Unfortunately, it doesn't work that great with zip archives, because the name of the file being saved is used as the name of the member of the zip archive. And it is unclear how to provide archive member names. So what happens is you get the replace df.csv.zip? [y]es, [n]o, [A]ll, [N]one, [r]ename: on extraction and would be left to rename the members of the archive. This also happens when infer is not used and a name and compression method of zip is instead used.
saving df.csv with compression zip gives df.csv with df.csv in it - the archive does not get a .zip suffix. Can be annoying to someone trying to use the file.
saving df.csv.zip with compression zip gives df.csv.zip with df.csv.zip as the archive member name. Can be annoying when extracting, because there is then an archive/member name collision.
Yet a zip archive can be constructed with proper zip archive member names.
import pandas as pd
import zipfile as zf
from pandas.compat import StringIO
print(pd.__version__)
csvdata = StringIO("""index,id1,id2,timestamp,number
465,255,3644,2019-05-02 08:00:20.137000,123123
62,87,912,2019-05-02 5:00:00,435456
""")
# prep dataframe
df = pd.read_csv(csvdata, sep=",")
with zf.ZipFile('archive.zip', 'w') as myziparchive:
myziparchive.writestr('df.csv', df.to_csv())
file archive.zip
archive.zip: Zip archive data, at least v2.0 to extract
Richs-MBP:pandas_examples randrews$ zip --show-files archive.zip
Archive contains:
df.csv
Total 1 entries (119 bytes)
And more than on dataframe can be placed inside.

Related

combine all csv in various subfolders in pandas or in powershell/terminal and create a pandas dataframe

I have individual csv files within each subfolders of subfolders. From year to months, and within each month folder are day folders, and within each day, is the individual csv. I would like to combine all of the individual csv into one and create a pandas df.
In the tree diagram, it looks like this:
I tried this approach below but nothing was created:
import pandas as pd
import glob
path = r'~/root/up/to/the/folder/2022'
alldata = glob.glob(path + "each*.csv")
alldata.head()
I initially had it just looking for "each*.csv" files but realized there is something missing in between in order to get individual csv within each folder. Then maybe, a for loop will work. like loop through each folder within each subfolder, but that is where I am stucked right now.
The answer to this: Combining separate daily CSVs in pandas shows files that are in the same folder.
I tried to make sense on this answer: batch file to concatenate all csv files in a subfolder for all subfolders, but it just won't click on me.
I also tried the following as suggested in Python importing csv files within subfolders
import os
import pandas as pd
path = '<Insert Path>'
file_extension = '.csv'
csv_file_list = []
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith(file_extension):
file_path = os.path.join(root, name)
csv_file_list.append(file_path)
dfs = [pd.read_csv(f) for f in csv_file_list]
but nothing is showing, I think there is something wrong with the path to redirect as shown in the tree above.
Or maybe there is a following step I need to do because when I ran dfs.head() it says AttributeError: 'list' object has no attribute 'head'
The following should work:
from pathlib import Path
import pandas as pd
csv_folder = Path('.') # path to your folder, e.g. to `2022`
df = pd.concat(pd.read_csv(p) for p in csv_folder.glob('**/*.csv'))
Alternatively, if you prefer you can also use glob.glob('**/*.csv', recursive=True) instead of the Path.glob method.

how to read data from multiple folder from adls to databricks dataframe

file path format is data/year/weeknumber/no of day/data_hour.parquet
data/2022/05/01/00/data_00.parquet
data/2022/05/01/01/data_01.parquet
data/2022/05/01/02/data_02.parquet
data/2022/05/01/03/data_03.parquet
data/2022/05/01/04/data_04.parquet
data/2022/05/01/05/data_05.parquet
data/2022/05/01/06/data_06.parquet
data/2022/05/01/07/data_07.parquet
how to read all this file one by one in data bricks notebook and store into the data frame
import pandas as pd
#Get all the files under the folder
data = dbutils.fs.la(file)
df = pd.DataFrame(data)
#Create the list of file
list = df.path.tolist()
enter code here
for i in list:
df = spark.read.load(path=f'{f}*',format='parquet')
i can able to read only the last file skipping the other file
The last line of your code cannot load data incrementally. In contrast, it refreshes df variable with the data from each path for each time it ran.
Removing the for loop and trying the code below would give you an idea how file masking with asterisks works. Note that the path should be a full path. (I'm not sure if the data folder is your root folder or not)
df = spark.read.load(path='/data/2022/05/*/*/*.parquet',format='parquet')
This is what I have applied from the same answer I shared with you in the comment.

Save the file in a different folder using python

I have a pandas dataframe and I would like to save it as a text file to another folder. What I tried so far?
import pandas as pd
df.to_csv(path = './output/filename.txt')
This does not save the file and gives me an error. How do I save the dataframe (df) into the folder called output?
the first arguement name of to_csv() is path_or_buf either change it or just remove it
df.to_csv('./output/filename.txt')

When using pandas dataframe.to_csv(), with compression='zip', it creates a zip file with two archive files with the EXACT same name

I am trying to save OHLCV (stock pricing) data from a dataframe into a single zipped csv file as follows. My test data is ohlcvData.csv, which I read into a dataframe with
import pandas as pd
df = pd.read_csv('ohlcvData.csv', header=None, names=['datetime', 'open', 'high', 'low', 'close', 'volume'], index_col='datetime')
and when I try to write it to a zip file like so (following stackoverflow.com/questions/55134716) :
df.to_csv('ohlcvData.zip', header=False, compression=dict(method='zip', archive_name='ohlcv.csv'))
I get the following warning ...
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\zipfile.py:1473: UserWarning: Duplicate name: 'ohlcv.csv'
return self._open_to_write(zinfo, force_zip64=force_zip64)
and the resultant ohlcvData.zip file contains two files, both named ohlcv.csv, each containing a portion of the results.
When I try to read the zip file back into a dataframe ...
dfRead = pd.read_csv(ohlcvData.zip', header=None, names=['datetime', 'open', 'high', 'low', 'close', 'volume'], index_col='datetime')
... I get the following error...
*File "C:\Users\jeffm\AppData\Roaming\Python\Python37\site-packages\pandas\io\common.py", line 618, in get_handle
"Multiple files found in ZIP file. "
ValueError: Multiple files found in ZIP file. Only one file per ZIP: ['ohlcv.csv', 'ohlcv.csv']*
However, when I reduce the number of rows in the input file from 200 to around 175 (for this file structure it varies slightly how many lines I have to remove depending on the data), it works and produces a zip file, containing one csv file, which can be loaded back into a dataframe without error. I have tried many different files, with different data and formats and I still get the same result -- any file with over (approx) 175 lines fails and any file with less works fine. So it looks like its splitting the file after a certain size, but from the docs there doesn't appear to be such a setting. Any help on this would be appreciated. Thanks.
This appears to be a bug introduced in 1.2.0, I created a minimal reproducing example and posted an issue: https://github.com/pandas-dev/pandas/issues/39190
import pandas as pd
# enough data to cause chunking into multiple files
n_data = 100000
df = pd.DataFrame(
{'name': ["Raphael"]*n_data,
'mask': ["red"]*n_data,
'weapon': ["sai"]*n_data,
}
)
compression_opts = dict(method='zip', archive_name='out.csv')
df.to_csv('out.csv.zip', index=False, compression=compression_opts)
# reading back the data produces an error
r_df = pd.read_csv("out.csv.zip")
# passing in compression_opts doesn't work either
r_df = pd.read_csv("out.csv.zip", compression=compression_opts)
Looks like this may be a recent Pandas bug. I was having the same issue in Pandas 1.2.0. Reverting to 1.1.3 (i.e. what I was using before) solved the issue. I haven't tested 1.1.4 and 1.1.5.

Importing a *random* csv file from a folder into pandas

I have a folder with several csv files, with file names between 100 and 400 (Eg. 142.csv, 278.csv etc). Not all the numbers between 100-400 are associated with a file, for example there is no 143.csv. I want to write a loop that imports 5 random files into separate dataframes in pandas instead of manually searching and typing out the file names over and over. Any ideas to get me started with this?
You can use glob and read all the csv files in the directory.
file = glob.glob('*.csv')
random_files=np.random.choice(file,5)
dataframes= []
for fp in random_files :
dataframes.append(pd.read_csv(fp))
From this you can chose the random 5 files from directory and then read them seprately.
Hope I answer your question