I have been reading and working on SO questions related to the Street View House Numbers (SVHN) datasets. The files are available at 2 different locations:
Stanford:
The Street View House Numbers (SVHN) Dataset
kaggle:
Street View House Numbers (SVHN) | Kaggle
My question is related to the format of the digitStruct.mat files for each image set (train, test, and extras). These define the name, label, and bounding box dimensions for each image. As I understand, the mat file is written as a Matlab structure in HDF5 format (that can be read with h5py).
I have been able to access and read the digitStruct.mat files from kaggle with h5py. I cannot open the same files from the Stanford site with h5py (or with HDFView). Some SO posts I've read indicate the Stanford files are an older Matlab format and should be read with scipy.io.loadmat.
Are the files at Stanford and kaggle the same?
If not, what are the differences?
Should I be able to open the Stanford digitStruct.mat files with h5py?
If so, what method should I use to download and extract the Standford tar.gz files? (FYI, I'm on Win-7, and have been using HTTP download and WinZip to extract.)
I am adding additional info to document different behavior observed with different .mat files. It may help with diagnosis.
I can open and operate on .mat files from kaggle with this call:
h5f = h5py.File('digitStruct.mat','r')
For files from Stanford, I get different errors depending on the file and function used to open.
The command below executes without an error message. That leads me to believe it is not a Matlab v7.3 file that can be opened with h5py.
mat = scipy.io.loadmat('./Stanford/test_32x32.mat')
Both of these calls do not work (brief error message provided):
mat = scipy.io.loadmat('./test/digitStruct.mat')
Traceback...
NotImplementedError: Please use HDF reader for matlab v7.3 files
h5f = h5py.File('./test/digitStruct.mat','r')
Traceback...
OSError: Unable to open file (file signature not found)
In addition, I cannot open test/digitStruct.mat with HDFView. My conclusion for the Stanford digitStruct.mat files: they might be Matlab v7.3 files, but were corrupted when I downloaded. However, I'm not sure what I did wrong (since I can download and read kaggle files without problems).
With some Linux detective work, I figured out the problem.
As I suspected, the digitStruct.mat files extracted from the *.tar.gz files on the Stanford site are HDF5 (Matlab v7.3) files, and were corrupted when I downloaded.
To confirm, I downloaded the 3 tar.gz files with a browser on a Linux system, then used the tar command to extract them, and successfully opened with h5py on Linux. I then transferred them to my Windows system, and each worked as expected with h5py.
This is a little surprising, as I have used WinZip to extract tarball files in the past. Apparently there's something special about these that caused the corruption.
Hopefully this saves someone the same headache in the future.
Note: the 3 xxxx_32x32.mat files are an older Matlab format that must be accessed with scipy.io.loadmat()
Related
i prepared my dataset and created a version of it. Then i tried to export dataset in TensorFlow Object Detection CSV format but when i got output of given zip file. But i see that there is nothing inside the zip file except "README.roboflow.txt" and "README.Dataset.txt"
Is there anything i'm doing wrong or it's in process of development ? or it's a bug?
Thanks
I just tried this and was able to get my test, train, valid in the folder which also included the README.Dataset.txt and README.roboflow.txt.
Can you try again or share the email you used to build this project so one of us Roboflow staff members can take a look at it? You can feel free to dm it to me in our forum if it still doesn't work.
Kelly M., Developer Advocate
I am working on a Deep Learning project, the data was provided to me in a file with the ".data" extension. Able to read the data from the file using the Pandas "read_csv" function. I tried to search about the file properties on the web, but i am not clear about the file properties, usage, etc. Here are the few questions i have,
What is the ".data" file?
How they are created? (Mean exported from any application or database)
Is this the correct way to read the ".data" file using the pd.read_csv method? (Tried read_table as well)
Is there any other way to read the ".data" file?
Recently i found a solution for .data files using pandas.
import pandas as pd
data = pd.read_fwf("example.data")
For more details check here.
I just ran into a .data file in the wild myself. I've been able to view it in any text editor (notepad, visual studio code, jupyter lab, etc). This helped determine what the separator should be. Mine was not tab-delimited as mrinali mentioned, but that's not to say that there aren't any tab-delimited .data files. Mine was space-delimited, so I just specified this as "sep" in panda's .read_csv() method:
pd.read_csv('<your_path>', sep=' ')
A DATA file is a data file used by Analysis Studio, a statistical analysis and data mining program. It contains mined data in a plain text, tab-delimited format, including an Analysis Studio file header. DATA files are commonly used to store data for offline data analysis when not connected to an Analysis Studio server, but may also be used in online mode.
Due to their tab-delimited format, DATA files may be imported using pandas via read_csv function once their header information is stripped.
HOW TO OPEN A .DATA FILE?
Launch a .data file, or any other file on your PC, by double-clicking it. If your file associations are set up correctly, the application that's meant to open your .data file will open it. It's possible you may need to download or purchase the correct application. It's also possible that you have the correct application on your PC, but .data files aren't yet associated with it. In this case, when you try to open a .data file, you can tell Windows which application is the correct one for that file. From then on, opening a .data file will open the correct application.
I've tried using Adobe Acrobat X Pro to "recognize text in multiple files."
When I start this process and it asks for the directory, I've chose C:, my main hard drive.
It took hours to load and when it did, the list of files it generated included word documents as well. Adobe said I couldn't proceed until I removed the problem files.
Once I removed all the pdfs Adobe flagged as having errors (like password protection) and the prompt remained, I assumed it meant the word documents in the list.
So I manually removed those too. But Adobe still said that I couldn't proceed until problem files were removed and there weren't any remaining files in the list that adobe had flagged as having issues.
My firm is trying to make sure all pdfs we have are searcheable. Currently, some are and some aren't. Our goal is to make them all searchable without removing them from their varied locations.
I think you can do this using a combination of
regular java : to list all files in a directory that match a given criterium (e.g. their name ends with '.pdf')
iText : to iterate over the PDF document and extract all images
Tess4J : a port of Tesseract (google OCR engine) for java, to turn the extracted images back into text
Unless I am much mistaken, Tesseract even offers a crude version of this workflow for you. But only for 1 pdf at a time. So you'd still need some windows/linux scripting to pipe in all files of a given directory.
MS Word's .docx files contain a bunch of .xml files.
Setup.exe files spit out hundreds of files that a program uses.
Zips, rars etc also hold lots of compressed stuff.
So how are they made? What does MS Word or another program that produces these files have to do to put files inside files?
When I looked this up I just got a bunch of results about compression, but let's say I wanted to make a program that 'wraps' files inside a file without making the final result any smaller. What would I even have to write?
I'm not asking/expecting any source code that does this, I just need a pointer. Is there something you think I'm misunderstanding based on what I've asked here?
Even a simple link to an article or some documentation would be greatly appreciated.
Ok, I'll just come up with some headers for ordinary files and write them along with the bytes of the actual files into one custom-defined file. You guys were very helpful, thank you!
Historically, Windows had a number of technologies to support solutions like this. These were often called Compound Files or Structured storage. However, I don't think the newer Office documents use these technologies. I think the Office file formats are similar to ZIP files with a different extensions. If you change a file with .docx extension to .zip and open it with your favorite compression tool, you'll see a bunch of folders and XML files.
Here are some links to descriptions of different file formats that create "files within files"
Zip file format
Compound File Binary Format (CFBF)
Structured Storage
Compound Document File Format
Office Open XML I: Exploring the Office Open XML Formats
At least on POSIX systems (e.g. Linux), a file is only a stream (i.e. a sequence) of bytes. And you can only grow (or shrink, i.e. truncate) it at the end - there is no way to insert bytes in the middle (without copying the rest).
You need some conventions, and some additional software, to handle it otherwise.
You might be interested in Sqlite, which gives you a library to handle some (e.g.) *.sqlite file as an SQL database
You could also use GDBM - a library giving you some indexed file abstraction.
libtar is a library to manipulate tar archives. See also tardy, a tar file postprocessor.
I'm looking for a way to access files within a zip file without extracting the whole file. All the zip solutions I find on the internet seems to extract the whole zip. Does anyone know of a solution?
Google has an objective-c lib based on minizip. http://code.google.com/p/objective-zip/
Supports unzip of individual files
EDIT: the project has moved to GitHub
The zlib library source distribution comes with a 'contrib' directory. In it, you'll find a library called 'minizip' (same license as zlib itself), which has APIs for creating (zip.h) and navigating/extracting (unzip.h) ZIP files. Despite the filename, there are functions in unzip.h which let you list or search for files within the zip file without extracting it.
If the zip is up on the internet you can have a look at pinch which will let you extract individual files from the zip without downloading the whole file.
https://github.com/epatel/pinch-objc
Maybe you can use it as a base to extract individual files from a local zip archive.