Importing a *random* csv file from a folder into pandas - pandas

I have a folder with several csv files, with file names between 100 and 400 (Eg. 142.csv, 278.csv etc). Not all the numbers between 100-400 are associated with a file, for example there is no 143.csv. I want to write a loop that imports 5 random files into separate dataframes in pandas instead of manually searching and typing out the file names over and over. Any ideas to get me started with this?

You can use glob and read all the csv files in the directory.
file = glob.glob('*.csv')
random_files=np.random.choice(file,5)
dataframes= []
for fp in random_files :
dataframes.append(pd.read_csv(fp))
From this you can chose the random 5 files from directory and then read them seprately.
Hope I answer your question

Related

how to read multiple .xlsx files from multiple subfolders using python

I have one folder that includes 10-12 subfolders, from each subfolder I need to read a specific .xlsx file. I am stuck, I have got all the .xlsx files I want to use os.walk but I don't know how to proceed further.
for root,dirs,files in os.walk(path):
for name in files:
if name.endswith("abc.xlsx"):
I You would like to us os.walk, this is how.
import os
for root,dirs,files in os.walk(path):
reqfiles = [i for i in files if i.endswith("abc.xlsx")]
You can use just os.listdir.
reqfiles = [i for i in os.listdir(path) if i.endswith("abc.xlsx")]

how to read multiple text files into a dataframe in pyspark

i have a few txt files in a directory(i have only the path and not the names of the files) that contain json data,and i need to read all of them into a dataframe.
i tried this:
df=sc.wholeTextFiles("path/*")
but i cant even display the data and my main goal is to preform queries in diffrent ways on the data.
Instead of wholeTextFiles(gives key, value pair having key as filename and data as value),
Try with read.json and give your directory name spark will read all the files in the directory into dataframe.
df=spark.read.json("<directorty_path>/*")
df.show()
From docs:
wholeTextFiles(path, minPartitions=None, use_unicode=True)
Read a directory of text files from HDFS, a local file system
(available on all nodes), or any Hadoop-supported file system URI.
Each file is read as a single record and returned in a key-value pair,
where the key is the path of each file, the value is the content of
each file.
Note: Small files are preferred, as each file will be loaded fully in
memory.

file format of training/ testing dataset

I was building lately a dataset that I gather from the internet to use for training NN models. now I have a bunch of jpg images in one file and their labels on a txt file. the question is to which file format should I convert this data to make it easily callable in frameworks (python). a second question is how to build a metadata file about this dataset and which format should it have
In my opinion the easiest way is to build csv file to with two columns: directory and label. The directory value is the path (relative path) to the image, and label is of course the label. It requires you a merge from txt file and all jpg file into one csv files, but essentially it is easier to work with csv in pandas

pandas.read_csv of a gzip file within a zipped directory

I would like to use pandas.read_csv to open a gzip file (.asc.gz) within a zipped directory (.zip). Is there an easy way to do this?
This code doesn't work:
csv = pd.read_csv(r'C:\folder.zip\file.asc.gz') // can't find the file
This code does work (however, it requires me to unzip the folder, which I want to avoid because my dataset currently contains thousands of zipped folders):
csv = pd.read_csv(r'C:\folder\file.asc.gz')
Is there an easy way to do this? I have tried using a combination of zipfile.Zipfile and read_csv, but have been unsuccessful (I think partly due to the fact that this is an ascii file as well)
Maybe the followings might help.
df = pd.read_csv('filename.gz', compression='gzip')
OR
import gzip
file=gzip.open('filename.gz','rb')
content=file.read()

How can I read many large .7z files containing many CSV files?

I have many .7z files every file containing many large CSV files (more than 1GB). How can I read this in python (especially pandas and dask data frame)? Should I change the compression format to something else?
I believe you should be able to open the file using
import lzma
with lzma.open("myfile.7z", "r") as f:
df = pd.read_csv(f, ...)
This is, strictly speaking, meant for the xz file format, but may work for 7z also. If not, you will need to use libarchive.
For use with Dask, you can do the above for each file with dask.delayed.
dd.read_csv directly also allows you to specify storage_options={'compression': 'xz'}; however, ramdom access within a file is likely to be inefficient at best, so you should add blocksize=None to force one partition per file:
df = dd.read_csv('myfiles.*.7z', storage_options={'compression': 'xz'},
blocksize=None)