Extracting text from several images

Extracting text from several images - cycle

I want to extract text from several images.
I want to do it in colab.
I know how to do it with one image:https://github.com/bhadreshpsavani/ExploringOCR/blob/master/OCRusingTesseract.ipynb
But how to do it in a cycle, because I have more than a hundred pictures?
Thanks in advance!

I uploaded my images in colab.research in root directory and resolved this task with following code:
image_ext = ['.jpg', '.png', '.jpeg']
directory = '/'
for file in os.listdir(directory):
ext = os.path.splitext(file)[-1].lower()
if ext not in image_ext:
continue
filename = os.path.join(directory, file)
extracted_information = pytesseract.image_to_string(Image.open(filename))
print(extracted_information)

Related

How to rename multiple files from the multiple text files?

My goal is to do following:
I am using Win 10 and I have files like so:
folder
2020-04-23_19-30-52_UTC.mp4
2020-04-23_19-30-52_UTC.txt which contains string "This video is me at a wedding"
2020-05-25_19-30-52_UTC.mp4
2020-05-25_19-30-52_UTC.txt which contains string "This video is dogwalk at the sunset"
where .txt contains the name of the mp4 from the same date and I want to do the following:
folder
This video is me at a wedding.mp4
2020-04-23_19-30-52_UTC.txt
This video is dogwalk at the sunset.mp4
2020-05-25_19-30-52_UTC.txt
there is a few ways how to achieve this but I am not that good with coding. My only priority is to have it done and I am for now not limited to use of any tool or programming language.
Thanks

I'd tackle this problem with Python.
import os
dir = ('[path to original folder]')
files = os.listdir(dir)
# Iterate through all the files in the folder
for path in files:
filetype = path[-4:] # Grabs last 4 characters of the filepath
# Checks if it's a textfile
if (filetype == '.txt'):
f = open(os.path.join(dir, path), "r") # open the textfile
new_name = f.read() # grab the description
f.close() # close the textfile
new_name = new_name + '.mp4' # Add proper filetype
path = path[:-4] # Throws away the last 4 characters of the filepath
path = path + '.mp4' # Add proper filetype
os.rename(os.path.join(dir, path), os.path.join(dir, new_name)) # Rename
If any more issues arise please let me know so I can help.

Changing the value/file of output_stream dynamically in tf.print?

I am doing some processing on wave audio files using Tensorflow
and saving them using the tf.print with output_stream option.
pcm =contrib_audio.encode_wav(processed_audio,16000)
tf.print(output_stream="file:///tmp/test.wav",summarize=-1)
The problem is that I am not able to change value of /tmp/test.wav dynamically
so that multiple wave files are stored.

Kindly refer to the below code.
# Using a counter
for i in range(1,10):
fname = "test_"+str(i)+".wav" #filename
path = "//content/sample_data/" #path to save
fname = "file://{path}{fname}".format(fname=fname, path = path)
tf.print(output_stream=fname,summarize=-1)
You can create a dynamic text, for it to be a unique filename.

Read contents of .gz file with python

I'm new to Python and am running into issues reading the contents of a .gz file:
I've got a folder full of .gz files that I've extracted programatically using a private API. The contents of each .gz file is a .xml file so I need to iterate over the dir and extract them.
The problem is when I programatically extract these .gz files into their respective .xml versions... The files create without error and when I open one (Using TextWrangler) it looks like a regular .xml file, but NOT when I view it in a hex editor. Also, when I open the .xml file programatically and print it's contents, it shows up as a bunch of (binary?) jumbled text.
With the above in mind, If I manually extract one of the files (ie: using OSX, but not Python), the file is viewable in a hex editor as I'd expect it to be.
Here is my code snippet (appropriate imports not shown, but they are glob and gzip):
searchpattern = siteid + "_" + resource + "_*.gz"
for infile in glob.glob(workingDir + searchpattern):
print infile
#read the zipped contents (https://docs.python.org/2/library/gzip.html)
f = gzip.open(infile, 'rb')
file_content = f.read()
file_content = str(file_content) #This was an attempt to fix
print file_content # This shows a bunch of mumbo jumbo
#write the contents we just read to a new file (uncompressed)
newfilename = infile[0:-3] # the filename without the ".gz"
newfilename = newfilename + ".xml"
fnew = open(newfilename, 'w+b')
fnew.write(str(file_content))
fnew.close()
#delete the .gz version of the file
#os.remove(infile)

If I run this against XML I don't get any issues with the program.
If I compress and XML and extract it with this program and diff the original with the output of this program I get no differences.
This program does add an extra ".xml" extension.

So this turns out to be a silly mistake on my part, but I'll post this as a followup for anybody else who makes the same mistake I did.
The problem was that i was zipping what had already been zipped earlier in my program. So with that in mind, my code snippet on this thread didn't have anything wrong with it. Neither did my code that i created the .gz file with (technically). As you can see below. Opening the file normally, instead of with the gzip library earlier in the program did the trick.
#Download and write the contents of each response to a .gz file
if limitCounter < limit or int(limit) == 0:
print _name + " " + scopeStartDate + " through " + scopeEndDate + " at " + href
file = api.get(href)
gz_file_content = file.content
#gz_file = gzip.open(workingDir + _name, "wb") # This breaks the program later
gz_file = open(workingDir + _name, 'wb') # This works.
gz_file.write(gz_file_content)
gz_file.close()

How does one load some variables at runtime in Photoshop Script?

I have about 200 folders with X images in each of them.
I have a master script in the root folder that does some stuff to the images.
Each folder has some variables specific to it and its contents.
I want my master script, when it parses folder Y, load some sort of a config file from within folder Y to get those variables, then when folder Z is to be parsed, load the config file from that one.
I know of #include "config.jsx" that I use at the moment to load it but its at the beginning of the script, I need something dynamic and doesn't need to be a jsx at all.

I store all my parameters in xml format and read that in using the XML objects in extendscript. As long as your parameters file is always named something like 'config.xml' it is easily located.
var file = new File( /c/folder/file.xml );
file.open("r");
var str = file.read();
var xml = new XML(str);

Rails 3: Can't uncompress zip after compressing it in

I want to compress some files in Ruby on Rails and save the zip file in the tmp folder. I've got a Document model which has a name field with an associated uploader. I'm also using Carrierwave to upload files to Amazon S3. I've got the following code:
class Document < ActiveRecord::Base
mount_uploader :name, DocumentUploader
...
end
def create_zip
documents = Document.all
folder = "#{Rails.root}/tmp"
tmp_filename = "#{folder}/export.zip"
zip_path = tmp_filename
Zip::ZipFile::open(zip_path, true) do |zipfile|
documents.each do |photo|
zipfile.get_output_stream(document.name.identifier) do |io|
io.write document.name.file.read
end
end
end
end
This creates an export.zip file in my tmp folder, but when I try to open it, Archive Manager (Mac OS X) begins unarchiving it, but keeps doing it so without ever finishing. I believe there's something missing from my code. The zip file size does make sense to me, but I've got that problem. Any thoughts? Thanks!

Actually, I found out I could open the zip file using other program (zipeg). However, only the last file from the documents array was in the compressed file. I believe I had been overwriting previous files, as the only remaining file was called the same (export, as the name of the zip itself) in all cases.
The code bellow works for me:
def create_zip
documents = Document.all
folder = "#{Rails.root}/tmp"
tmp_filename = "#{folder}/export.zip"
zip_path = tmp_filename
Zip::ZipOutputStream.open(zip_path) do |zos|
documents.each do |document|
path = document.name_identifier
zos.put_next_entry(path)
zos.write photo.name.file.read
end
end
end

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Extracting text from several images - cycle

I want to extract text from several images. I want to do it in colab. I know how to do it with one image:https://github.com/bhadreshpsavani/ExploringOCR/blob/master/OCRusingTesseract.ipynb But how to do it in a cycle, because I have more than a hundred pictures? Thanks in advance!

Related

How to rename multiple files from the multiple text files?

Changing the value/file of output_stream dynamically in tf.print?

Read contents of .gz file with python

How does one load some variables at runtime in Photoshop Script?

Rails 3: Can't uncompress zip after compressing it in

Categories

Resources