Using string output from pytesseract to do a vlookup in pandas dataframe - pandas

I'm very new to Python, and I'm trying to make a simple image to song title to BPM program. My approach is using pytesseract to generate a string output; and then, using that string output, I wish to vlookup in a dataframe created by pandas. However, it always return zero value even though that song does exist in the data.
import PIL.ImageGrab
from PIL import ImageGrab
import numpy as np
import pytesseract
import pandas as pd
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
def getTitleImage(left, top, width, height):
printscreen_pil = ImageGrab.grab((left, top, left + width, top + height))
printscreen_numpy = np.array(printscreen_pil.getdata(), dtype='uint8') \
.reshape((printscreen_pil.size[1], printscreen_pil.size[0], 3))
return printscreen_numpy
# Printscreen:
titleImage = getTitleImage(x, y, w, h)
# pytesseract to string:
songTitle = pytesseract.image_to_string(titleImage)
print('Name of the song: ', songTitle)
# Importing the csv data via pandas.
songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv')
# A simple vlookup formula that return the BPM of the song by taking data from the same row.
bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum()
print('The BPM of the song is: ', bpmSong)
Output:
Name of the song: Macarena
The BPM of the song is: 0
However, when I tried to forcefully provide the string to the songTitle variable, it works:
songTitle = 'Macarena'
print('Name of the song: ', songTitle)
songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv')
bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum()
print('The BPM of the song is: ', bpmSong)
Output:
Name of the song: Macarena
The BPM of the song is: 103
I have checked the string generated from pytesseract: It has no extra space in the front or the back, totally identical to the forced string, but they still produce different results. What could be the problem?

I found the answer.
It is because the songTitle coming from:
songTitle = pytesseract.image_to_string(titleImage)
...is actually 'Macarena\n' instead of 'Macarena'.
They might look the same after print out, except the former will create a new line after it.
A great lesson learn for me.

Related

Tensor to Dataframe for each sentence

For a 6 class sentence classification task, I have a list of sentences where I retrieve the absolute values before the softmax is applied. Example list of sentences:
s = ['I like the weather today', 'The movie was very scary', 'Love is in the air']
I get the values the following way:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "Emanuel/bertweet-emotion-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
print(output.logits.detach().numpy())
# returns [[-0.8390876 2.9480567 -0.5134539 0.70386493 -0.5019671 -2.619496 ]]
#[[-0.8847909 -0.9642067 -2.2108874 -0.43932158 4.3386173 -0.37383893]]
#[[-0.48750368 3.2949197 2.1660519 -0.6453249 -1.7101991 -2.817954 ]]
How do I create a data frame with columns sentence, class_1, class_2, class_3, class_4, class_5, class_6 where I add values iteratively or maybe in a more optimal way where I append each new sentence and its absolute values? What would be the best way?
Expected output:
sentence class_1 class_2 class_3 ....
0 I like the weather today -0.8390876 2.9480567 -0.5134539 ....
1 The movie was very scary -0.8847909 -0.9642067 -2.2108874 ....
2 Love is in the air -0.48750368 3.2949197 2.1660519 ....
...
If I only had one sentence, I could transform it to a data frame like this, but I would still need to append the sentence somehow
sentence = tokenizer("Love is in the air", return_tensors="pt")
output = model(sentence["input_ids"])
px = pd.DataFrame(output.logits.detach().numpy())
Maybe creating two separate data frames and then appending them would be one plausible way of doing this?
Save the model outputs in a list and then create the dataframe from an object:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
import pandas as pd
model_name = "Emanuel/bertweet-emotion-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
outputs = []
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
outputs.append(output.logits.detach().numpy()[0])
# convert to one numpy array
outputs = np.array(outputs)
# create dataframe
obj = {"sentence": s}
for class_id in range(outputs.shape[1]):
# get the data column for that class
obj[f"class_{class_id}"] = outputs[:,class_id].tolist()
df = pd.DataFrame(obj)
I managed to come up with a solution and I am posting it as someone might find it useful.
The idea is to initialize a data frame and to append the absolute values for every sentence while iterating
absolute_vals = pd.DataFrame()
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
px = pd.DataFrame(output.logits.detach().numpy())
absolute_vals = absolute_vals.append(px, ignore_index = True)
absolute_vals
Returns:
sentence class_1 class_2 class_3 ....
0 I like the weather today -0.8390876 2.9480567 -0.5134539 ....
1 The movie was very scary -0.8847909 -0.9642067 -2.2108874 ....
2 Love is in the air -0.48750368 3.2949197 2.1660519 ....
...

How to calculate tf-idf when working on .txt files in python 3.7?

I have books in pdf and I want to do NLP tasks such as preprocessing, tf-idf calculation, word2vec, etc on those books. So I converted them into .txt files and was trying to get tf-idf scores. Previously I performed tf-idf on a CSV file, so I made some changes in that code and tried to use it for .txt file. But I am unsuccessful in my attempt.
Below is my code:
import pandas as pd
import numpy as np
from itertools import islice
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
data = open('jungle book.txt', 'r+')
# print(data.read())
cvec = CountVectorizer(stop_words='english', min_df=1, max_df=.5, ngram_range=(1,2))
cvec.fit(data)
list(islice(cvec.vocabulary_.items(), 20))
len(cvec.vocabulary_)
cvec_count = cvec.transform(data)
print('Sparse Matrix Shape : ', cvec_count.shape)
print('Non Zero Count : ', cvec_count.nnz)
print('sparsity: %.2f%%' % (100 * cvec_count.nnz / (cvec_count.shape[0] * cvec_count.shape[1])))
occ = np.asarray(cvec_count.sum(axis=0)).ravel().tolist()
count_df = pd.DataFrame({'term': cvec.get_feature_names(), 'occurrences' : occ})
term_freq = count_df.sort_values(by='occurrences', ascending=False).head(20)
print(term_freq)
transformer = TfidfTransformer()
transformed_weights = transformer.fit_transform(cvec_count)
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weight_df = pd.DataFrame({'term' : cvec.get_feature_names(), 'weight' : weights})
tf_idf = weight_df.sort_values(by='weight', ascending=False).head(20)
print(tf_idf)
This code is working until print ('Non Zero Count :', cvec_count.shape) and printing:
Sparse Matrix Shape : (0, 7132)
Non Zero Count : 0
Then it is giving error:
ZeroDivisionError: division by zero
Even if I run this code with ignoring ZeroDivisionError, still it is wrong as it is not counting any frequencies.
I have no idea how to work around .txt file. What is the proper way to work on .txt file for NLP tasks?
Thanks in advance!
You are getting the error because data variable is empty or wrong type. Just opening the text file is not enough. You have to read the contents into a string variable and then do the preprocessing on that variable. Try replacing
data = open('jungle book.txt', 'r+')
# print(data.read())
with
with open('jungle book.txt', 'r') as file:
data = file.read()

Unable to reload data as a csv file from IPython Notebook

I have the following IPython Notebook, I am trying to access data base of movies from rotten tomatoes website.
But Rotten Tomatoes limits to 10,000 API requests a day
So I don't want to re-run this function every time when I restart the notebook, I am trying to save and reload this data as a CSV file. When I convert the data to a csv file I am getting this processing symbol[*] inside IPython notebook. After some time I am getting the following error
ConnectionError: HTTPConnectionPool(host='api.rottentomatoes.com', port=80): Max retries exceeded with url: /api/public/v1.0/movie_alias.json?apikey=5xr26r2qtgf9h3kcq5kt6y4v&type=imdb&id=0113845 (Caused by <class 'socket.gaierror'>: [Errno 11002] getaddrinfo failed)
Is this problem due to slow internet connection? Should I make some changes to my code? Kindly help me with this.
The code for the file is shown below:
%matplotlib inline
import json
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
api_key = '5xr26r2qtgf9h3kcq5kt6y4v'
movie_id = '770672122' # toy story 3
url = 'http://api.rottentomatoes.com/api/public/v1.0/movies/%s/reviews.json' % movie_id
#these are "get parameters"
options = {'review_type': 'top_critic', 'page_limit': 20, 'page': 1, 'apikey': api_key}
data = requests.get(url, params=options).text
data = json.loads(data) # load a json string into a collection of lists and dicts
print json.dumps(data['reviews'][0], indent=2) # dump an object into a json string
from io import StringIO
movie_txt = requests.get('https://raw.github.com/cs109/cs109_data/master/movies.dat').text
movie_file = StringIO(movie_txt) # treat a string like a file
movies = pd.read_csv(movie_file,delimiter='\t')
movies
#print the first row
movies[['id', 'title', 'imdbID', 'year']]
def base_url():
return 'http://api.rottentomatoes.com/api/public/v1.0/'
def rt_id_by_imdb(imdb):
"""
Queries the RT movie_alias API. Returns the RT id associated with an IMDB ID,
or raises a KeyError if no match was found
"""
url = base_url() + 'movie_alias.json'
imdb = "%7.7i" % imdb
params = dict(id=imdb, type='imdb', apikey=api_key)
r = requests.get(url, params=params).text
r = json.loads(r)
return r['id']
def _imdb_review(imdb):
"""
Query the RT reviews API, to return the first page of reviews
for a movie specified by its IMDB ID
Returns a list of dicts
"""
rtid = rt_id_by_imdb(imdb)
url = base_url() + 'movies/{0}/reviews.json'.format(rtid)
params = dict(review_type='top_critic',
page_limit=20,
page=1,
country='us',
apikey=api_key)
data = json.loads(requests.get(url, params=params).text)
data = data['reviews']
data = [dict(fresh=r['freshness'],
quote=r['quote'],
critic=r['critic'],
publication=r['publication'],
review_date=r['date'],
imdb=imdb, rtid=rtid
) for r in data]
return data
def fetch_reviews(movies, row):
m = movies.irow(row)
try:
result = pd.DataFrame(_imdb_review(m['imdbID']))
result['title'] = m['title']
except KeyError:
return None
return result
def build_table(movies, rows):
dfs = [fetch_reviews(movies, r) for r in range(rows)]
dfs = [d for d in dfs if d is not None]
return pd.concat(dfs, ignore_index=True)
critics = build_table(movies, 3000)
critics.to_csv('critics.csv', index=False)
critics = pd.read_csv('critics.csv')

How do I import xyz and roll/pitch/yaw from csv file to Blender?

I want to know if it is possible to import data of attitude and position (roll/pitch/yaw & xyz) from a comma separated file to Blender?
I recorded data from a little RC car and I want to represent its movement in a 3D world.
I have timestamps too, so if there's a way to animated the movement of the object it'll be superb!!
Any help will be greatly appreciated!!
Best Regards.
A slight modifcation, making use of the csv module
import bpy
import csv
position_vectors = []
filepath = "C:\\Work\\position.log"
csvfile = open(filepath, 'r', newline='')
ofile = csv.reader(csvfile, delimiter=',')
for row in ofile:
position_vectors.append(tuple([float(i) for i in row]))
csvfile.close()
This will get your points into Blender. Note the delimiter parameter in csv.reader, change that accordingly. With a real example file of your RC car we could provide a more complete solution.
For blender v2.62:
If you have a file "positions.log" looking like:
-8.691985196313894e-002; 4.119284642631801e-001; -5.832147659661263e-001
1.037146774956164e+000; 8.137243553005405e-002; -5.703274929662892e-001
-3.602584527944123e-001; 8.378614512537046e-001; 2.615265921163826e-001
6.266465707681335e-001; -1.128416901202341e+000; -1.664644365541639e+000
3.327523280880091e-001; 4.488553740582839e-001; -2.449449085462368e+000
-7.311567199869298e-001; -1.860587923723032e+000; -1.297179602213110e+000
-7.453603745688361e-003; 4.770473577895327e-001; -2.319515785100494e+000
1.935170866863264e-001; -2.010280476717868e+000; 3.748000986190077e-001
5.201529166915653e-001; 3.952972788761738e-001; 1.658581747430548e+000
4.719198263774027e-001; 1.526020825619557e+000; 3.187088567866725e-002
you can read it with this python script in blender (watch out for the indentation!)
import bpy
from mathutils import *
from math import *
from bpy.props import *
import os
import time
# Init
position_vector = []
# Open file
file = open("C:\\Work\\position.log", "r")
# Loop over line in file
for line in file:
# Split line at ";"
splittet_line = line.split(";")
# Append new postion
position_vector.append(
Vector((float(splittet_line[0]),
float(splittet_line[1]),
float(splittet_line[2]))))
# Close file
file.close()
# Get first selected object
selected_object = bpy.context.selected_objects[0]
# Get first selected object
for position in position_vector:
selected_object.location = position
This reads the file and updates the position of the first selected object accordingly. Way forward: What you have to find out is how to set the keyframes for the animation...
Consider this python snippet to add to the solutions above
obj = bpy.context.object
temporalScale=bpy.context.scene.render.fps
for lrt in locRotArray:
obj.location = (lrt[0], lrt[1], lrt[2])
# radians, and do you want XYZ, or ZYX?
obj.rotation_euler = (lrt[3], lrt[4], lrt[5])
time = lrt[6]*temporalScale
obj.keyframe_insert(data_path="location", frame=time)
obj.keyframe_insert(data_path="rotation_euler", frame=time)
I haven't tested it, but it will probably work, and gets you started.
With a spice2xyzv file as input file. The script writed by "Mutant Bob" seems to work.
But the xyz velocity data are km/s not euler angles, I think, and the import does not work for the angles.
# Records are <jd> <x> <y> <z> <vel x> <vel y> <vel z>
# Time is a TDB Julian date
# Position in km
# Velocity in km/sec
2456921.49775 213928288.518 -446198013.001 -55595492.9135 6.9011736 15.130842 0.54325805
Is there a solution to get them in Blender? Should I convert velocity angle to euler, is that possible in fact?
I use this script :
import bpy
from mathutils import *
from math import *
from bpy.props import *
import os
import time
# Init
position_vector = []
# Open file
file = open("D:\\spice2xyzv\\export.xyzv", "r")
obj = bpy.context.object
temporalScale=bpy.context.scene.render.fps
for line in file:
# Split line at ";"
print("line = %s" % line)
line = line.replace("\n","")
locRotArray = line.split(" ")
print("locRotArray = %s" % locRotArray )
#for lrt in locRotArray:
print(locRotArray[1])
obj.location = (float(locRotArray[1]), float(locRotArray[2]), float(locRotArray[3]))
# radians, and do you want XYZ, or ZYX?
obj.rotation_euler = (float(locRotArray[4]), float(locRotArray[5]), float(locRotArray[5]))
time = float(locRotArray[0])*temporalScale
print("time = %s" % time)
obj.keyframe_insert(data_path="location", frame=time)
obj.keyframe_insert(data_path="rotation_euler", frame=time)

Empty outputs with python GDAL

Hello im new to Gdal and im struggling a with my codes. Everything seems to go well in my code mut the output bands at the end is empty. The no data value is set to 256 when i specify 255, so I don't really know whats wrong. Thanks any help will be appreciated!!!
Here is my code
from osgeo import gdal
from osgeo import gdalconst
from osgeo import osr
from osgeo import ogr
import numpy
#graticule
src_ds = gdal.Open("E:\\NFI_photo_plot\\photoplotdownloadAllCanada\\provincial_merge\\Aggregate\\graticule1.tif")
band = src_ds.GetRasterBand(1)
band.SetNoDataValue(0)
graticule = band.ReadAsArray()
print('graticule done')
band="none"
#Biomass
dataset1 = gdal.Open("E:\\NFI_photo_plot\\photoplotdownloadAllCanada\provincial_merge\\Aggregate\\Biomass_NFI.tif")
band1 = dataset1.GetRasterBand(1)
band1.SetNoDataValue(-1)
Biomass = band1.ReadAsArray()
maskbiomass = numpy.greater(Biomass, -1).astype(int)
print("biomass done")
Biomass="none"
band1="none"
dataset1="none"
#Baseline
dataset2 = gdal.Open("E:\\NFI_photo_plot\\Baseline\\TOTBM_250.tif")
band2 = dataset2.GetRasterBand(1)
band2.SetNoDataValue(0)
baseline = band2.ReadAsArray()
maskbaseline = numpy.greater(baseline, 0).astype(int)
print('baseline done')
baseline="none"
band2="none"
dataset2="none"
#sommation
biosource=(graticule+maskbiomass+maskbaseline)
biosource1=numpy.uint8(biosource)
biosource="none"
#Écriture
dst_file="E:\\NFI_photo_plot\\photoplotdownloadAllCanada\\provincial_merge\\Aggregate\\Biosource.tif"
dst_driver = gdal.GetDriverByName('GTiff')
dst_ds = dst_driver.Create(dst_file, src_ds.RasterXSize,
src_ds.RasterYSize, 1, gdal.GDT_Byte)
#projection
dst_ds.SetProjection( src_ds.GetProjection() )
dst_ds.SetGeoTransform( src_ds.GetGeoTransform() )
outband=dst_ds.GetRasterBand(1)
outband.WriteArray(biosource1,0,0)
outband.SetNoDataValue(255)
biosource="none"
graticule="none"
A few pointers:
Where you have ="none", these need to be = None to close/cleanup the objects, otherwise you are setting the objects to an array of characters: n o n e, which is not what you intend to do.
Why do you have band1.SetNoDataValue(-1), while other NoData values are 0? Is this data source signed or unsigned? If unsigned, then -1 doesn't exist.
When you open rasters with gdal.Open without the access option, it defaults to gdal.GA_ReadOnly, which means your subsequent SetNoDataValue calls do nothing. If you want to modify the dataset, you need to use gdal.GA_Update as your second parameter to gdal.Open.
Another strategy to create a new raster is to use driver.CreateCopy; see the tutorial for details.