Have error by using pypandoc writing to file - pypandoc

I'm trying to write to file the string that i got after running pypandoc.
After writing to file im getting a bounch of UnicodeEncodeError's.
My question how can i write to file without any encoding.
To write string as is?
Thank you.
from tools.general import *
from definitions.url_Parse import *
from data import *
import spacy
import re
import pypandoc
url='https://groupprops.subwiki.org/w/index.php?title=1-automorphism-
invariant_subgroup&action=edit'
strng = url_Parse(url)
str_ = strng.string
output = pypandoc.convert_text(str_, 'plain', format = 'mediawiki')
file1=open('output.txt','w')
file1.write(output)
file1.close()
My string is:
A subgroup H of a group G is termed a 1-AUTOMORPHISM-INVARIANT subgroup
if any 1-automorphism of G sends H to itself. In other words, for every
1-automorphism φ of G, φ(H) ⊆ H.

Related

Using string output from pytesseract to do a vlookup in pandas dataframe

I'm very new to Python, and I'm trying to make a simple image to song title to BPM program. My approach is using pytesseract to generate a string output; and then, using that string output, I wish to vlookup in a dataframe created by pandas. However, it always return zero value even though that song does exist in the data.
import PIL.ImageGrab
from PIL import ImageGrab
import numpy as np
import pytesseract
import pandas as pd
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
def getTitleImage(left, top, width, height):
printscreen_pil = ImageGrab.grab((left, top, left + width, top + height))
printscreen_numpy = np.array(printscreen_pil.getdata(), dtype='uint8') \
.reshape((printscreen_pil.size[1], printscreen_pil.size[0], 3))
return printscreen_numpy
# Printscreen:
titleImage = getTitleImage(x, y, w, h)
# pytesseract to string:
songTitle = pytesseract.image_to_string(titleImage)
print('Name of the song: ', songTitle)
# Importing the csv data via pandas.
songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv')
# A simple vlookup formula that return the BPM of the song by taking data from the same row.
bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum()
print('The BPM of the song is: ', bpmSong)
Output:
Name of the song: Macarena
The BPM of the song is: 0
However, when I tried to forcefully provide the string to the songTitle variable, it works:
songTitle = 'Macarena'
print('Name of the song: ', songTitle)
songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv')
bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum()
print('The BPM of the song is: ', bpmSong)
Output:
Name of the song: Macarena
The BPM of the song is: 103
I have checked the string generated from pytesseract: It has no extra space in the front or the back, totally identical to the forced string, but they still produce different results. What could be the problem?
I found the answer.
It is because the songTitle coming from:
songTitle = pytesseract.image_to_string(titleImage)
...is actually 'Macarena\n' instead of 'Macarena'.
They might look the same after print out, except the former will create a new line after it.
A great lesson learn for me.

How to calculate tf-idf when working on .txt files in python 3.7?

I have books in pdf and I want to do NLP tasks such as preprocessing, tf-idf calculation, word2vec, etc on those books. So I converted them into .txt files and was trying to get tf-idf scores. Previously I performed tf-idf on a CSV file, so I made some changes in that code and tried to use it for .txt file. But I am unsuccessful in my attempt.
Below is my code:
import pandas as pd
import numpy as np
from itertools import islice
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
data = open('jungle book.txt', 'r+')
# print(data.read())
cvec = CountVectorizer(stop_words='english', min_df=1, max_df=.5, ngram_range=(1,2))
cvec.fit(data)
list(islice(cvec.vocabulary_.items(), 20))
len(cvec.vocabulary_)
cvec_count = cvec.transform(data)
print('Sparse Matrix Shape : ', cvec_count.shape)
print('Non Zero Count : ', cvec_count.nnz)
print('sparsity: %.2f%%' % (100 * cvec_count.nnz / (cvec_count.shape[0] * cvec_count.shape[1])))
occ = np.asarray(cvec_count.sum(axis=0)).ravel().tolist()
count_df = pd.DataFrame({'term': cvec.get_feature_names(), 'occurrences' : occ})
term_freq = count_df.sort_values(by='occurrences', ascending=False).head(20)
print(term_freq)
transformer = TfidfTransformer()
transformed_weights = transformer.fit_transform(cvec_count)
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weight_df = pd.DataFrame({'term' : cvec.get_feature_names(), 'weight' : weights})
tf_idf = weight_df.sort_values(by='weight', ascending=False).head(20)
print(tf_idf)
This code is working until print ('Non Zero Count :', cvec_count.shape) and printing:
Sparse Matrix Shape : (0, 7132)
Non Zero Count : 0
Then it is giving error:
ZeroDivisionError: division by zero
Even if I run this code with ignoring ZeroDivisionError, still it is wrong as it is not counting any frequencies.
I have no idea how to work around .txt file. What is the proper way to work on .txt file for NLP tasks?
Thanks in advance!
You are getting the error because data variable is empty or wrong type. Just opening the text file is not enough. You have to read the contents into a string variable and then do the preprocessing on that variable. Try replacing
data = open('jungle book.txt', 'r+')
# print(data.read())
with
with open('jungle book.txt', 'r') as file:
data = file.read()

I need to convert netCDF44 variables to .csv format .I have difficulty in exporting multiple variables using pandas library.Can you find me a solution

I have altimeter satellite data in netCDF4 format.I need to export all the variables present in data to csv format.
I have tried using pandas library.But my code is not working
import netCDF4
import pandas as pd
nc_file = 'F:/data/JASON 3/2016/00 11.11.16/JA3_GPR_2PTP000_166_20160213_230539_20160214_000152.nc'
nc = netCDF4.Dataset(nc_file, mode='r')
#print(nc)
nc.variables.keys()
range_ku=nc.variables['range_ku'][:]
range1=((range_ku*0.001)+1300000.0)
mean_sea_surface=nc.variables['mean_sea_surface'][:]
mean_sea_surf_scaled=(mean_sea_surface*0.0001)
mean_topography=nc.variables['mean_topography'][:]
mean_topo_scaled=(mean_topography*0.0001)
model_dry_tropo_corr=nc.variables['model_dry_tropo_corr'][:]
topo_cr=(model_dry_tropo_corr*0.0001)
rad_wet_tropo_corr=nc.variables['rad_wet_tropo_corr'][:]
wet_topo_cr=(rad_wet_tropo_corr*0.0001)
iono_corr_alt_ku=nc.variables['iono_corr_alt_ku'][:]
iono_cr=(iono_corr_alt_ku*0.0001)
sea_state_bias_ku=nc.variables['sea_state_bias_ku'][:]
sea_state_bias_cr=(sea_state_bias_ku*0.0001)
lat = nc.variables['lat'][:]
lon = nc.variables['lon'][:]
time_var = nc.variables['time']
dtime = netCDF4.num2date(time_var[:],time_var.units)
ssha= nc.variables['ssha'][:]
ssha1=(ssha*0.001)
alt = nc.variables['alt'][:]
alt1=((alt*0.001)+1300000.0)
print(dtime)
print(lat)
print(lon)
print(alt1)
print(range1)
print(mean_sea_surface)
print(mean_topography)
print(model_dry_tropo_corr)
print(rad_wet_tropo_corr)
print(iono_corr_alt_ku)
print(sea_state_bias_ku)
nc_ts = pd.Series(lat, lon,alt1,range1,mean_sea_surface,mean_topography,model_dry_tropo_corr,rad_wet_tropo_corr,iono_corr_alt_ku,sea_state_bias_ku,dtime)
nc_ts.to_csv('data.csv',index=False, header=True)
File "F:/untitled3.py", line 50, in <module>
nc_ts = pd.Series(lat, lon,alt1,range1,mean_sea_surface,mean_topography,model_dry_tropo_corr,rad_wet_tropo_corr,iono_corr_alt_ku,sea_state_bias_ku,dtime)
TypeError: __init__() takes from 1 to 7 positional arguments but 12 were given

How do I import xyz and roll/pitch/yaw from csv file to Blender?

I want to know if it is possible to import data of attitude and position (roll/pitch/yaw & xyz) from a comma separated file to Blender?
I recorded data from a little RC car and I want to represent its movement in a 3D world.
I have timestamps too, so if there's a way to animated the movement of the object it'll be superb!!
Any help will be greatly appreciated!!
Best Regards.
A slight modifcation, making use of the csv module
import bpy
import csv
position_vectors = []
filepath = "C:\\Work\\position.log"
csvfile = open(filepath, 'r', newline='')
ofile = csv.reader(csvfile, delimiter=',')
for row in ofile:
position_vectors.append(tuple([float(i) for i in row]))
csvfile.close()
This will get your points into Blender. Note the delimiter parameter in csv.reader, change that accordingly. With a real example file of your RC car we could provide a more complete solution.
For blender v2.62:
If you have a file "positions.log" looking like:
-8.691985196313894e-002; 4.119284642631801e-001; -5.832147659661263e-001
1.037146774956164e+000; 8.137243553005405e-002; -5.703274929662892e-001
-3.602584527944123e-001; 8.378614512537046e-001; 2.615265921163826e-001
6.266465707681335e-001; -1.128416901202341e+000; -1.664644365541639e+000
3.327523280880091e-001; 4.488553740582839e-001; -2.449449085462368e+000
-7.311567199869298e-001; -1.860587923723032e+000; -1.297179602213110e+000
-7.453603745688361e-003; 4.770473577895327e-001; -2.319515785100494e+000
1.935170866863264e-001; -2.010280476717868e+000; 3.748000986190077e-001
5.201529166915653e-001; 3.952972788761738e-001; 1.658581747430548e+000
4.719198263774027e-001; 1.526020825619557e+000; 3.187088567866725e-002
you can read it with this python script in blender (watch out for the indentation!)
import bpy
from mathutils import *
from math import *
from bpy.props import *
import os
import time
# Init
position_vector = []
# Open file
file = open("C:\\Work\\position.log", "r")
# Loop over line in file
for line in file:
# Split line at ";"
splittet_line = line.split(";")
# Append new postion
position_vector.append(
Vector((float(splittet_line[0]),
float(splittet_line[1]),
float(splittet_line[2]))))
# Close file
file.close()
# Get first selected object
selected_object = bpy.context.selected_objects[0]
# Get first selected object
for position in position_vector:
selected_object.location = position
This reads the file and updates the position of the first selected object accordingly. Way forward: What you have to find out is how to set the keyframes for the animation...
Consider this python snippet to add to the solutions above
obj = bpy.context.object
temporalScale=bpy.context.scene.render.fps
for lrt in locRotArray:
obj.location = (lrt[0], lrt[1], lrt[2])
# radians, and do you want XYZ, or ZYX?
obj.rotation_euler = (lrt[3], lrt[4], lrt[5])
time = lrt[6]*temporalScale
obj.keyframe_insert(data_path="location", frame=time)
obj.keyframe_insert(data_path="rotation_euler", frame=time)
I haven't tested it, but it will probably work, and gets you started.
With a spice2xyzv file as input file. The script writed by "Mutant Bob" seems to work.
But the xyz velocity data are km/s not euler angles, I think, and the import does not work for the angles.
# Records are <jd> <x> <y> <z> <vel x> <vel y> <vel z>
# Time is a TDB Julian date
# Position in km
# Velocity in km/sec
2456921.49775 213928288.518 -446198013.001 -55595492.9135 6.9011736 15.130842 0.54325805
Is there a solution to get them in Blender? Should I convert velocity angle to euler, is that possible in fact?
I use this script :
import bpy
from mathutils import *
from math import *
from bpy.props import *
import os
import time
# Init
position_vector = []
# Open file
file = open("D:\\spice2xyzv\\export.xyzv", "r")
obj = bpy.context.object
temporalScale=bpy.context.scene.render.fps
for line in file:
# Split line at ";"
print("line = %s" % line)
line = line.replace("\n","")
locRotArray = line.split(" ")
print("locRotArray = %s" % locRotArray )
#for lrt in locRotArray:
print(locRotArray[1])
obj.location = (float(locRotArray[1]), float(locRotArray[2]), float(locRotArray[3]))
# radians, and do you want XYZ, or ZYX?
obj.rotation_euler = (float(locRotArray[4]), float(locRotArray[5]), float(locRotArray[5]))
time = float(locRotArray[0])*temporalScale
print("time = %s" % time)
obj.keyframe_insert(data_path="location", frame=time)
obj.keyframe_insert(data_path="rotation_euler", frame=time)

Empty outputs with python GDAL

Hello im new to Gdal and im struggling a with my codes. Everything seems to go well in my code mut the output bands at the end is empty. The no data value is set to 256 when i specify 255, so I don't really know whats wrong. Thanks any help will be appreciated!!!
Here is my code
from osgeo import gdal
from osgeo import gdalconst
from osgeo import osr
from osgeo import ogr
import numpy
#graticule
src_ds = gdal.Open("E:\\NFI_photo_plot\\photoplotdownloadAllCanada\\provincial_merge\\Aggregate\\graticule1.tif")
band = src_ds.GetRasterBand(1)
band.SetNoDataValue(0)
graticule = band.ReadAsArray()
print('graticule done')
band="none"
#Biomass
dataset1 = gdal.Open("E:\\NFI_photo_plot\\photoplotdownloadAllCanada\provincial_merge\\Aggregate\\Biomass_NFI.tif")
band1 = dataset1.GetRasterBand(1)
band1.SetNoDataValue(-1)
Biomass = band1.ReadAsArray()
maskbiomass = numpy.greater(Biomass, -1).astype(int)
print("biomass done")
Biomass="none"
band1="none"
dataset1="none"
#Baseline
dataset2 = gdal.Open("E:\\NFI_photo_plot\\Baseline\\TOTBM_250.tif")
band2 = dataset2.GetRasterBand(1)
band2.SetNoDataValue(0)
baseline = band2.ReadAsArray()
maskbaseline = numpy.greater(baseline, 0).astype(int)
print('baseline done')
baseline="none"
band2="none"
dataset2="none"
#sommation
biosource=(graticule+maskbiomass+maskbaseline)
biosource1=numpy.uint8(biosource)
biosource="none"
#Écriture
dst_file="E:\\NFI_photo_plot\\photoplotdownloadAllCanada\\provincial_merge\\Aggregate\\Biosource.tif"
dst_driver = gdal.GetDriverByName('GTiff')
dst_ds = dst_driver.Create(dst_file, src_ds.RasterXSize,
src_ds.RasterYSize, 1, gdal.GDT_Byte)
#projection
dst_ds.SetProjection( src_ds.GetProjection() )
dst_ds.SetGeoTransform( src_ds.GetGeoTransform() )
outband=dst_ds.GetRasterBand(1)
outband.WriteArray(biosource1,0,0)
outband.SetNoDataValue(255)
biosource="none"
graticule="none"
A few pointers:
Where you have ="none", these need to be = None to close/cleanup the objects, otherwise you are setting the objects to an array of characters: n o n e, which is not what you intend to do.
Why do you have band1.SetNoDataValue(-1), while other NoData values are 0? Is this data source signed or unsigned? If unsigned, then -1 doesn't exist.
When you open rasters with gdal.Open without the access option, it defaults to gdal.GA_ReadOnly, which means your subsequent SetNoDataValue calls do nothing. If you want to modify the dataset, you need to use gdal.GA_Update as your second parameter to gdal.Open.
Another strategy to create a new raster is to use driver.CreateCopy; see the tutorial for details.