importing training data to CloudML with images that do not have a file-extension - google-vision

i created some training data and put the CSV in the google-storage, but it looks like the import won't work when the files do not have a proper .jpg extension:
Error: INVALID_ROW: Invalid input found at row 1 of gs://weg-li-production/training/test.csv: "Unsupported file extension."
values look like this:
TRAIN,gs://weg-li-production/d7nwcheo8774rvbcgj4lyta3athj,Opel
is there a way to work around this issue?

It seems you put the whole "TRAIN,gs://weg-li-production/d7nwcheo8774rvbcgj4lyta3athj,Opel" into a single unit in your csv file. The comma should represent another unit in the csv file. You can open it in Excel to check your csv file, and the correct format should include three columns in Excel.

Assuming gs://weg-li-production/d7nwcheo8774rvbcgj4lyta3athj is the image file & Opel is the label. It all looks fine, just that the image file name does not have a valid extension.
Check https://cloud.google.com/vision/automl/docs/prepare for valid file types (extension), during training & predictions

Related

pyspark dataframe writing csv files twice in s3

I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.
code snippet:
df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing.
So something like this:
df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.

trouble with utf-8 with julia and jupyterlab

I'm reading the csv file at https://github.com/VinitaSilaparasetty/julia-beginners/blob/master/data/nba/nba19-20.csv
I get a DataFrame and I save it as XLSX. When I try to read it in jupyterlab I get the error the file is not UTF-8 encoded and therefore the file is not read.
This is my code:
using HTTP, XLSX, CSV, DataFrames
df = CSV.read(HTTP.get("https://raw.githubusercontent.com/VinitaSilaparasetty/julia-beginners/master/data/nba/nba19-20.csv").body)
# first(df,5) # first shows the top five rows ok
XLSX.writetable("data/nba/nba19-20.XLSX", collect(eachcol(df)), names(df), overwrite = true)
The file is saved in my data folder. When I try to open it with jupyterlab, I get a pop up with the file is not UTF-8 encoded and the file is not opened.
When I try to open the file in Ubuntu (with LibreOffice) I do not see anything suspicious.
As I'm new to Julia I'm struggling to understand where the problem lies or how to fix it.
I tried to see if I could encode the dataframe in UTF-8 (after saving the file to disk) with
data = DataFrame(CSV.File(open(read,"data/nba/nba19-20.csv", enc"utf-8")))
But I did not see any change. Any suggestion is welcome.
Do you have the jupyterlab-spreadsheet plugin installed? JupyterLab by default doesn't support opening xlsx files (it isn't mentioned in the file formats list here for example).
See also this similar question involving Python pandas (which says pretty much the same thing).

Create a ADF Dataset to load multiple csv files (same format) from the Blob

I try to create a dataset containing multiple csv files from the Blob. In the file path of dataset setting: I create a parameter - #dataset().FolderName and add FolderName in the Parameters.
I leave file (from File Path) empty as I want to grab all files in the folder. However, there is no data when I preview data. Is there anything missing? Thank you
I have tested it on my side and it can work fine.
add FolderName parameter
preview data
If you want to merge all csv files in Data Flow, you can do this:
1.output to single file
2.set Single partition

file format of training/ testing dataset

I was building lately a dataset that I gather from the internet to use for training NN models. now I have a bunch of jpg images in one file and their labels on a txt file. the question is to which file format should I convert this data to make it easily callable in frameworks (python). a second question is how to build a metadata file about this dataset and which format should it have
In my opinion the easiest way is to build csv file to with two columns: directory and label. The directory value is the path (relative path) to the image, and label is of course the label. It requires you a merge from txt file and all jpg file into one csv files, but essentially it is easier to work with csv in pandas

Cannot open a csv file

I have a csv file on which i need to work in my jupyter notebook ,even though i am able to view the contents in the file using the code in the picture
When i am trying to convert the data into a data frame i get a "no columns to parse from file error"
i have no headers. My csv file looks like this and also i have saved it in the UTF-8 format
Try to use pandas to read the csv file:
df = pd.read_csv("BON3_NC_CUISINES.csv)
print(df)