I am working on a Deep Learning project, the data was provided to me in a file with the ".data" extension. Able to read the data from the file using the Pandas "read_csv" function. I tried to search about the file properties on the web, but i am not clear about the file properties, usage, etc. Here are the few questions i have,
What is the ".data" file?
How they are created? (Mean exported from any application or database)
Is this the correct way to read the ".data" file using the pd.read_csv method? (Tried read_table as well)
Is there any other way to read the ".data" file?
Recently i found a solution for .data files using pandas.
import pandas as pd
data = pd.read_fwf("example.data")
For more details check here.
I just ran into a .data file in the wild myself. I've been able to view it in any text editor (notepad, visual studio code, jupyter lab, etc). This helped determine what the separator should be. Mine was not tab-delimited as mrinali mentioned, but that's not to say that there aren't any tab-delimited .data files. Mine was space-delimited, so I just specified this as "sep" in panda's .read_csv() method:
pd.read_csv('<your_path>', sep=' ')
A DATA file is a data file used by Analysis Studio, a statistical analysis and data mining program. It contains mined data in a plain text, tab-delimited format, including an Analysis Studio file header. DATA files are commonly used to store data for offline data analysis when not connected to an Analysis Studio server, but may also be used in online mode.
Due to their tab-delimited format, DATA files may be imported using pandas via read_csv function once their header information is stripped.
HOW TO OPEN A .DATA FILE?
Launch a .data file, or any other file on your PC, by double-clicking it. If your file associations are set up correctly, the application that's meant to open your .data file will open it. It's possible you may need to download or purchase the correct application. It's also possible that you have the correct application on your PC, but .data files aren't yet associated with it. In this case, when you try to open a .data file, you can tell Windows which application is the correct one for that file. From then on, opening a .data file will open the correct application.
Related
I am trying to export a .xlsx file in to oracle via the import wizard. However, when I select the .xlsx file nothing happens, usually when I import .csv I then specify format etc, but I am just brought back to the home screen. The file is quite small so I don't see why this wouln't work. Does anyone have any advice?
The fastest way is to convert your excel data into csv and import it as usual. Depending on the size of file, version of sql developer, operating system there seems to be some problems with memory (especialy on 64-bit systems with 64-bit jdk) though the file looks small.
Some report says they are succeded to import xls file after increasing the SQL Developer virtual memory limit by adding a line like AddVMOption -Xmx1280M or larger into SQLDeveloper.conf file.
Converting xls to csv is easy, fast an less stressful than messing with config file.
I have been reading and working on SO questions related to the Street View House Numbers (SVHN) datasets. The files are available at 2 different locations:
Stanford:
The Street View House Numbers (SVHN) Dataset
kaggle:
Street View House Numbers (SVHN) | Kaggle
My question is related to the format of the digitStruct.mat files for each image set (train, test, and extras). These define the name, label, and bounding box dimensions for each image. As I understand, the mat file is written as a Matlab structure in HDF5 format (that can be read with h5py).
I have been able to access and read the digitStruct.mat files from kaggle with h5py. I cannot open the same files from the Stanford site with h5py (or with HDFView). Some SO posts I've read indicate the Stanford files are an older Matlab format and should be read with scipy.io.loadmat.
Are the files at Stanford and kaggle the same?
If not, what are the differences?
Should I be able to open the Stanford digitStruct.mat files with h5py?
If so, what method should I use to download and extract the Standford tar.gz files? (FYI, I'm on Win-7, and have been using HTTP download and WinZip to extract.)
I am adding additional info to document different behavior observed with different .mat files. It may help with diagnosis.
I can open and operate on .mat files from kaggle with this call:
h5f = h5py.File('digitStruct.mat','r')
For files from Stanford, I get different errors depending on the file and function used to open.
The command below executes without an error message. That leads me to believe it is not a Matlab v7.3 file that can be opened with h5py.
mat = scipy.io.loadmat('./Stanford/test_32x32.mat')
Both of these calls do not work (brief error message provided):
mat = scipy.io.loadmat('./test/digitStruct.mat')
Traceback...
NotImplementedError: Please use HDF reader for matlab v7.3 files
h5f = h5py.File('./test/digitStruct.mat','r')
Traceback...
OSError: Unable to open file (file signature not found)
In addition, I cannot open test/digitStruct.mat with HDFView. My conclusion for the Stanford digitStruct.mat files: they might be Matlab v7.3 files, but were corrupted when I downloaded. However, I'm not sure what I did wrong (since I can download and read kaggle files without problems).
With some Linux detective work, I figured out the problem.
As I suspected, the digitStruct.mat files extracted from the *.tar.gz files on the Stanford site are HDF5 (Matlab v7.3) files, and were corrupted when I downloaded.
To confirm, I downloaded the 3 tar.gz files with a browser on a Linux system, then used the tar command to extract them, and successfully opened with h5py on Linux. I then transferred them to my Windows system, and each worked as expected with h5py.
This is a little surprising, as I have used WinZip to extract tarball files in the past. Apparently there's something special about these that caused the corruption.
Hopefully this saves someone the same headache in the future.
Note: the 3 xxxx_32x32.mat files are an older Matlab format that must be accessed with scipy.io.loadmat()
MS Word's .docx files contain a bunch of .xml files.
Setup.exe files spit out hundreds of files that a program uses.
Zips, rars etc also hold lots of compressed stuff.
So how are they made? What does MS Word or another program that produces these files have to do to put files inside files?
When I looked this up I just got a bunch of results about compression, but let's say I wanted to make a program that 'wraps' files inside a file without making the final result any smaller. What would I even have to write?
I'm not asking/expecting any source code that does this, I just need a pointer. Is there something you think I'm misunderstanding based on what I've asked here?
Even a simple link to an article or some documentation would be greatly appreciated.
Ok, I'll just come up with some headers for ordinary files and write them along with the bytes of the actual files into one custom-defined file. You guys were very helpful, thank you!
Historically, Windows had a number of technologies to support solutions like this. These were often called Compound Files or Structured storage. However, I don't think the newer Office documents use these technologies. I think the Office file formats are similar to ZIP files with a different extensions. If you change a file with .docx extension to .zip and open it with your favorite compression tool, you'll see a bunch of folders and XML files.
Here are some links to descriptions of different file formats that create "files within files"
Zip file format
Compound File Binary Format (CFBF)
Structured Storage
Compound Document File Format
Office Open XML I: Exploring the Office Open XML Formats
At least on POSIX systems (e.g. Linux), a file is only a stream (i.e. a sequence) of bytes. And you can only grow (or shrink, i.e. truncate) it at the end - there is no way to insert bytes in the middle (without copying the rest).
You need some conventions, and some additional software, to handle it otherwise.
You might be interested in Sqlite, which gives you a library to handle some (e.g.) *.sqlite file as an SQL database
You could also use GDBM - a library giving you some indexed file abstraction.
libtar is a library to manipulate tar archives. See also tardy, a tar file postprocessor.
I want to be able to put a file to a variable so I can interact with it. For example I could put a wav file into a variable and play it back without having to distribute the separate file. Is this possible for instance by using Base64. I have seen some Python programs for example that have images embedded in the code.
Yes, you could conceivably store the contents of a binary .wav file as a static, uuencoded text array.
Probably a better way to go about it would be to create a "resource" for your binary data:
http://msdn.microsoft.com/en-us/library/xbx3z216.aspx
Is there any spreadsheet program that supports reading HDF5 files ?;
have you already tried HDFview?
its tabular view is quite similar to a spreadsheet application, you can also save to text file and then open it with a more standard spreadsheet application if you prefer:
http://www.hdfgroup.org/hdf-java-html/hdfview/UsersGuide/ug05spreadsheet.html#ug05save
You can download HDFview here:
http://www.hdfgroup.org/hdf-java-html/hdfview/index.html