file format of training/ testing dataset - dataframe

I was building lately a dataset that I gather from the internet to use for training NN models. now I have a bunch of jpg images in one file and their labels on a txt file. the question is to which file format should I convert this data to make it easily callable in frameworks (python). a second question is how to build a metadata file about this dataset and which format should it have

In my opinion the easiest way is to build csv file to with two columns: directory and label. The directory value is the path (relative path) to the image, and label is of course the label. It requires you a merge from txt file and all jpg file into one csv files, but essentially it is easier to work with csv in pandas

Related

Create a ADF Dataset to load multiple csv files (same format) from the Blob

I try to create a dataset containing multiple csv files from the Blob. In the file path of dataset setting: I create a parameter - #dataset().FolderName and add FolderName in the Parameters.
I leave file (from File Path) empty as I want to grab all files in the folder. However, there is no data when I preview data. Is there anything missing? Thank you
I have tested it on my side and it can work fine.
add FolderName parameter
preview data
If you want to merge all csv files in Data Flow, you can do this:
1.output to single file
2.set Single partition

Pandas to_csv with ZIP compresses whole directory

df.to_csv("/path/to/destination.zip", compression="zip")
The above line will generate a file called destination.zip in the directory /path/to/.
Decompressing the ZIP file, will result in a directory structure path/to/destination.zip where destination.zip is the CSV file.
Why is the path/to/ folder structure included in the compressed file? Is there any way to avoid this?
Was blown away by this, currently writing the ZIP locally (destination.zip) and using os.rename to move it to the desired location.. Is this a bug ?

how to read multiple text files into a dataframe in pyspark

i have a few txt files in a directory(i have only the path and not the names of the files) that contain json data,and i need to read all of them into a dataframe.
i tried this:
df=sc.wholeTextFiles("path/*")
but i cant even display the data and my main goal is to preform queries in diffrent ways on the data.
Instead of wholeTextFiles(gives key, value pair having key as filename and data as value),
Try with read.json and give your directory name spark will read all the files in the directory into dataframe.
df=spark.read.json("<directorty_path>/*")
df.show()
From docs:
wholeTextFiles(path, minPartitions=None, use_unicode=True)
Read a directory of text files from HDFS, a local file system
(available on all nodes), or any Hadoop-supported file system URI.
Each file is read as a single record and returned in a key-value pair,
where the key is the path of each file, the value is the content of
each file.
Note: Small files are preferred, as each file will be loaded fully in
memory.

Importing a *random* csv file from a folder into pandas

I have a folder with several csv files, with file names between 100 and 400 (Eg. 142.csv, 278.csv etc). Not all the numbers between 100-400 are associated with a file, for example there is no 143.csv. I want to write a loop that imports 5 random files into separate dataframes in pandas instead of manually searching and typing out the file names over and over. Any ideas to get me started with this?
You can use glob and read all the csv files in the directory.
file = glob.glob('*.csv')
random_files=np.random.choice(file,5)
dataframes= []
for fp in random_files :
dataframes.append(pd.read_csv(fp))
From this you can chose the random 5 files from directory and then read them seprately.
Hope I answer your question

importing training data to CloudML with images that do not have a file-extension

i created some training data and put the CSV in the google-storage, but it looks like the import won't work when the files do not have a proper .jpg extension:
Error: INVALID_ROW: Invalid input found at row 1 of gs://weg-li-production/training/test.csv: "Unsupported file extension."
values look like this:
TRAIN,gs://weg-li-production/d7nwcheo8774rvbcgj4lyta3athj,Opel
is there a way to work around this issue?
It seems you put the whole "TRAIN,gs://weg-li-production/d7nwcheo8774rvbcgj4lyta3athj,Opel" into a single unit in your csv file. The comma should represent another unit in the csv file. You can open it in Excel to check your csv file, and the correct format should include three columns in Excel.
Assuming gs://weg-li-production/d7nwcheo8774rvbcgj4lyta3athj is the image file & Opel is the label. It all looks fine, just that the image file name does not have a valid extension.
Check https://cloud.google.com/vision/automl/docs/prepare for valid file types (extension), during training & predictions