Opening tsv format Eurostat data - file-io

I've been trying to open this data: http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fdemo_gind.tsv.gz. I've already unzipped it and get the tsv file, but when I opened it in gedit, it looks like a binary file. Could anybody help me to open this file?

The file is correctly formatted even if not so readable for human beings.
TSV is a file extension for a tab-delimited file used with spreadsheet
software. TSV stands for Tab Separated Values. TSV files are used for
raw data and can be imported into and exported from spreadsheet
software. TSV files are essentially text files, and the raw data can
be viewed by text editors, though they are often used when moving raw
data between spreadsheets.
You can import it inside Excel or Open Office. Otherwise you may convert it by using online service (example google sheets).

Once you've unarchived the original .gz file there are two more steps required to view the data, as noted on Eurostat's website.
TSV files may be imported into Excel by (1) Saving on hard disk with
the suffix .tsv and (2) unzipping and (3) saving the table(s) as Text
(*.txt).
As per user74158's comment, decompress/unzip the tsv file. This can
likely be done with many different programs, I used 7zip and it
worked for me. On windows 7 I did this by right clicking, hovering
over 7zip, selecting extract files, tell 7zip where you'd like to
extract the files too and press OK.
Next go to the file, and change the .tsv file extension to .txt. Answer yes, you're sure you want to change the file extension and then you should be able to read the data.

Related

How to Create a Program Which Searches for Values from a .txt or any Text Document in Specific Folders

I am relatively new to programming and want to create a program which can solve a problem that I frequently have.
So here's the background to my short story: I was on a website which hosted many files (We're talking about around 500-1000 small files). I was then like," Oh sweet! I want to have all these things in my hard drive so I know that I have access to them... but am probably not going to use them either way". I proceeded to download all 500-1000 files on that site, but encountered a problem when I looked at the properties of my destination file. Let's say that out of 500 on the site, my computer only had 499 files. Just my luck. I wanted to know what was that one pesky file that slipped right by me and download that file specifically. What I didn't want to do was to delete all the files and then try my luck once more in downloading all the files from the website. On the site, there was no indication of what all files I downloaded, so I was completely in the blue. I could go in Ctrl+C each item, then Ctrl+V into the file manager search bar, but that would be tedious to repeat that 500 times.
Now, what I want to do: I wanted to go ahead and take all of the file names from the website (The file name that I downloaded and the file name that was in my drive are the same), put them all in a simple .txt document or something (The website has multiple unwanted text alongside the text I need, such as:
. If this is not possible to extract the text from the site like this, then I am ok with manually entering the names via copy paste). Then I want the computer to take these values in the document and then search for it in a specific folder path (Note: the actual files are in subfolders within the root folder I want to choose, so the program has to be able to search within multiple folders of the root). Then I want the computer to know if the value in the document, is present as a file. If the file doesn't exist, then I want that value/those values in the document to be displayed as the output. I want this cycle to repeat until all the values have been gone through. The output should list the values that were not present.
Conclusion: You probably now get at what I am trying to do, if you don't, tell me what I need to elaborate on. I really don't care how this program is made (what language or software), I just want something that works... but myself don't know how to create.
Thanks for reading and any response is appreciated!
Dhanwanth P :)
Here's a solution in Python in case you would like to explore...
Similar to what you described, all files from the website are listed in an Excel file 'website_files.xlsx'
And all files are saved in a folder 'downloaded_wav'. The script will work regardless the files are saved in the root directory or sub-folders.
Then I run below Python script to look for the missing file:
import pandas as pd
import os
path_folder = 'C:\\Users\\Admin\\Downloads\\downloaded_wav'
downloaded_files = []
d,m = 0,0
for path_name, subfolders, files in os.walk(path_folder): #include all subfolders
for file in files:
d+=1
downloaded_files.append(file)
df = pd.read_excel('website_files.xlsx')
for file in df.values:
if file not in downloaded_files:
print('MISSING', file)
m+=1
print(len(df), 'files on website')
print(d, 'files downloaded')
print(m, 'missing file(s) found')
Output:
MISSING ['OLIVER_snare_disco_mixready_hybrid.wav']
3 files on website
2 files downloaded
1 missing file(s) found
No worries; I found a solution by myself using Excel (God, it's powerful!).
Basically, I copied and pasted my values from the website, then used a filter to show the values only with .wav. Then I used a Power Query from the folder to get me a list of all names of files in a folder. Finally, I went ahead and compared the two using a formula:
=IF(COUNTIF(B:B,D,"OK","MISSING")
If you need more elaboration, I'd be happy to help, just reply to this. There might be an easier way, but I personally liked the straight-forwardness of this. You only need Microsoft excel!
EDIT:
For me, I used these two videos which go over the power query and countif function:
How to Get the List of File Names in a Folder in Excel (without VBA): https://www.youtube.com/watch?v=OSCPVBWOqwc
How to Compare Two Excel Sheets (and find the differences): https://www.youtube.com/watch?v=8Ou_wfzcKKk
In my case, I made my sheet look like this:

Modify the content of an MS Word file contained inside a .zip file, without extracting it?

Is it possible to manipulate the content of an MS Word file contained inside a .zip file, without extracting it?
I have 2,000 zip files containing Word files. I need to modify the same field in each of the 2,000 zipped MS Word files. Is this possible without extracting the file first?
Yes it is possible, but the difference is semantics. When I do this, with single documents, I COPY (not extract) the xml file from the zip container, edit as required, and then OVERWRITE back into the zip container.
I've also tried to edit the file from within the zip file, but it can't be saved directly (at least not the way I have tried) - so (for example in NotePad++) file SaveAs would be required...

Preventing other application from opening custom file vb.net

I have a text file. Now I have changed its file type from .txt to .abc. My VB.NET program loads the text into textboxes from that file. After changing the file type, however, other apps like NotePad and Word are able to open and read my .abc file.
Is there any way that only my application will be able to open/read from the file and no other app would be able to do so? What I mean is, suppose I have a PhotoShop document .psd file, no other app, rather that photoshop itself, can open it. How do I make my file unreadable by other apps?
There is no way to prevent an app that you don't develop from opening any file. The extensions are just there for helping us humans, and maybe a bit for the computer to know the default app you select for an extension.
Like you said, a .txt file can be opened by many many apps. You can open a .txt file with Notepad, Firefox, VSCode, and many others.
Same way, a .psd file can be opened by many many apps. You can open that .psd file with Photoshop, but also Notepad, Firefox, and VSCode, and probably the same apps as above.
The difference is which apps can read and understand the file.
In order to make a file not understandable by other apps, you need to make it into a format that cannot recognize, because you planned it "in secret".
Like Visual Vincent said above, you could encrypt the file in a way, or you can have a binary file, that basically only your app knows know to understand.
Since you dont own the app you want the file to be understood by, then you either have to accept that it can be opened by any app that can open files, or you can try to encrypt the file outside the app, or like zipping it with a password, and then decrypting or unzipping when you want to use it.
Firstly, any file can be read unless it is still open by a particular process or service. Even PhotoShop files can be 'read' by NotePad - try it!
So, an attempt at my first answer...
You can try a couple of methods to prevent opening the file, for instance, applying a file lock. As an example, SQL Server .mdf files are locked by the SQL Server service. This happens because the files are maintained in an open state, however; your application would have to remain running to keep these files open. Technically, though, the files can still be copied.
Another way is to set the hidden attribute for the file. This hides the file from the less savvy users, but it will be displayed if the user show's hidden files.
And my second answer: You refer to the format of files by saying only PhotoShop can read or write its own files (not true, but I know what you're saying).
The format of the file must be decided by yourself. You must determine how you are going to store the data that you output from your application. It looks like you have been attempting to write your application data into a text file. Perhaps you should try writing to binary files instead. Binary files, while not encrypted, as suggested by Visual Vincent in the comments to your question, still provide a more tailored approach to storing your data.
Binary files write raw binary data instead of humanised text. For instance, if you write an integer to the file it will appear as a string of four bytes, not your usual 123456789 textual format.
So, you really need to clarify what data you want to write to the file, decide on a set structure to your file (as you also have to be able to read it back in to your application) and then be able to write the information.

Method to inspect first 4 bytes and rename file extension

I have a large batch of assorted files, all missing their file extension.
I'm currently using Windows 7 Pro. I am able to "open with" and experiment to determine what application opens these files, and rename manually to suit.
However I would like some method to identify the correct file type (typically PDF, others include JPG, HTML, DOC, XLS and PPT), and batch rename to add the appropriate file extension.
I am able to open some files with notepad and review the first four bytes, which in some cases shows "%PDF".
I figure a small script would be able to inspect these bytes, and rename as appropriate. However not all files give such an easy method. HTML, JPG, DOC etc do not appear to give such an easy identifier.
This Powershell method appears to be close: https://superuser.com/questions/186942/renaming-multiple-file-extensions-based-on-a-condition
Difficulty here is focusing the method to work on file types with no extension; and then what to do with the files that don't have the first four bytes identifier?
Appreciate any help!!
EDIT: Solution using TriD seen here: http://mark0.net/soft-trid-e.html
And recursive method using Powershell to execute TriD here: http://mark0.net/forum/index.php?topic=550.0
You could probably save some time by getting a file utility for Windows (see What is the equivalent to the Linux File command for windows?) and then writing a simple script that maps from file type to extension.
EDIT: Looks like the TriD utility that's mentioned on that page can do what you want out of the box; see the -ae and -ce options)
Use python3.
import os,re
fldrPth = "path/to/folder" # relative to My Documents
os.chdir(fldrPth)
for i in os.listdir():
with open(i,'r') as doc:
st = doc.read(4)
os.rename(i,i+'.'+re.search(r'\w+',st).group())
Hopefully this would work.
I don't have test files to check the code. Take a backup and then run it and let me know if it works.

Creating an ics file from data on a PDF file

I'm looking for a way to convert a PDF document into multiple ics files that staff can use to add their fortnight roster to their smart phone calendars or outlook calendar on their desktops. The information required to create the multiple files would be pulled from the PDF by searching for selected initials from each column then referencing data from the same row as the initials. Is their a particular order I need the data to appear in the ics file to allow it to import to a smartphone calendar??
You can search for pdf APIs for more details in handling a pdf using programmatically.
and here are some online converters that could help. They convert a pdf into word
http://www.pdftoword.com/success.aspx
http://www.pdfescape.com/account/?expired
However, reconstructing structured data from PDF is not trivial because a program has to deduct the semantics in the layout. So most programs can only restore scattered data from a pdf.
I've done this with PERL and windows Adobe PDF viewer to highlight all the text in the PDF and cut and paste to a text file. As the previous answer said, you have to write PERL (or any other text processing language) to pick out the format of the PDF you have. Then you can print it with PERL to csv or to ical or whatever format you want. I've shared my code on github.com. I'm not sure if you know GIT, but send me a private message if you want me to send the PERL code outside of GIT.
The PDF's I've converted are here:
http://recplexonline.com/sports/hockey/old-geezers-hockey-35
The Git hub of my PERL code and the input files I used are here:
https://github.com/jdeltoft/PdfParse
It's pretty ugly perl, sorry for that. But it works. I'll try to clean it up soon.