Autocorrect a column in a pandas dataframe using pyenchant - pandas

I tried to apply the code from the accepted answer of this question to one of my dataframe columns where each row is a sentence, but it didn't work.
My code looks this:
from enchant.checker import SpellChecker
checker = SpellChecker("id_ID")
h = df['Jawaban'].astype(str).str.lower()
hayo = []
for text in h:
checker.set_text(text)
for s in checker:
sug = s.suggest()[0]
s.replace(sug)
hayo.append(checker.get_text())
I got this following error:
IndexError: list index out of range
Any help is greatly appreciated.

I don't get the error using your code. The only thing I'm doing differently is to import the spell checker.
from enchant.checker import SpellChecker
checker = SpellChecker('en_US','en_UK') # not using id_ID
# sample data
ds = pd.DataFrame({ 'text': ['here is a spllng mstke','the wrld is grwng']})
p = ds['text'].str.lower()
hayo = []
for text in p:
checker.set_text(text)
for s in checker:
sug = s.suggest()[0]
s.replace(sug)
print(checker.get_text())
hayo.append(checker.get_text())
print(hayo)
here is a spelling mistake
the world is growing

Related

How can I Read a input file with pandas/python?

I want to create a input that can bea readable by pandas and do some count with sorted values. The problem is that I want that in the input you put the name of the file to do the action but i have some problem. Hope someone can help me.
The file is an excel!
Here I give you the code:
import pandas as pd
doc = input('Ingresa el nombre del archivo: ')
print(f'Ingresaste {doc}')
df=pd.read.excel(doc)
df['Recordinaciones'] = df.apply(lambda _: '', axis=1)
rcs=df[['Cliente','# Externo','Recordinaciones']].groupby(['Cliente','# Externo']).count().reset_index().sort_values(['Recordinaciones'],ascending=False)
rcs
You have to put an underscore and not a dot in pandas.read_excel :
Replace this :
df=pd.read.excel(doc)
By this :
df=pd.read_excel(doc)

TypeError: 'Value' object is not iterable : iterate around a Dataframe for prediction purpose with GCP Natural Language Model

I'm trying to iterate over a dataframe in order to apply a predict function, which calls a Natural Language Model located on GCP. Here is the loop code :
model = 'XXXXXXXXXXXXXXXX'
barometre_df_processed = barometre_df
barometre_df_processed['theme'] = ''
barometre_df_processed['proba'] = ''
print('DEBUT BOUCLE FOR')
for ind in barometre_df.index:
if barometre_df.verbatim[ind] is np.nan :
barometre_df_processed.theme[ind]="RAS"
barometre_df_processed.proba[ind]="1"
else:
print(barometre_df.verbatim[ind])
print(type(barometre_df.verbatim[ind]))
res = get_prediction(file_path={'text_snippet': {'content': barometre_df.verbatim[ind]},'mime_type': 'text/plain'} },model_name=model)
print(res)
theme = res['displayNames']
proba = res["classification"]["score"]
barometre_df_processed.theme[ind]=theme
barometre_df_processed.proba[ind]=proba
and the get_prediction function that I took from the Natural Language AI Documentation :
def get_prediction(file_path, model_name):
options = ClientOptions(api_endpoint='eu-automl.googleapis.com:443')
prediction_client = automl_v1.PredictionServiceClient(client_options=options)
payload = file_path
# Uncomment the following line (and comment the above line) if want to predict on PDFs.
# payload = pdf_payload(file_path)
parameters_dict = {}
params = json_format.ParseDict(parameters_dict, Value())
request = prediction_client.predict(name=model_name, payload=payload, params=params)
print("fonction prediction")
print(request)
return resultat[0]["displayName"], resultat[0]["classification"]["score"], resultat[1]["displayName"], resultat[1]["classification"]["score"], resultat[2]["displayName"], resultat[2]["classification"]["score"]
I'm doing a loop this way because I want each of my couple [displayNames, score] to create a new line on my final dataframe, to have something like this :
verbatim1, theme1, proba1
verbatim1, theme2, proba2
verbatim1, theme3, proba3
verbatim2, theme1, proba1
verbatim2, theme2, proba2
...
The if barometre_df.verbatim[ind] is np.nan is not causing problems, I just use it to deal with nans, don't take care of it.
The error that I have is this one :
TypeError: 'Value' object is not iterable
I guess the issues is about
res = get_prediction(file_path={'text_snippet': {'content': barometre_df.verbatim[ind]} },model_name=model)
but I can't figure what's goign wrong here.
I already try to remove
,'mime_type': 'text/plain'}
from my get_prediction parameters, but it doesn't change anything.
Does someone knows how to deal with this issue ?
Thank you already.
I think you are not iterating correctly.
The way to iterate through a dataframe is:
for index, row in df.iterrows():
print(row['col1'])

Why my code is giving me data in 1 column it should give me in two different column

i need to know what is happening in my code? it should give data in separate columns it is giving me same data in a oath columns.
i tried to change the value of row variable but it didn't found the reason
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
import time
arrayofRequest= []
prices=[]
location=[]
columns=['Price', 'Location']
df = pd.DataFrame(columns=columns)
for i in range(0,50):
arrayofRequest.append("https://www.zameen.com/Homes/Karachi-2-"+str(i+1)+".html?gclid=Cj0KCQjw3JXtBRC8ARIsAEBHg4mj4jX1zZUt3WzGScjH6nfwzrEqkuILarcmg372imSneelSXPj0fGIaArNeEALw_wcB")
request = requests.get(arrayofRequest[i])
soupobj= BeautifulSoup(request.content,"lxml")
# print(soupobj.prettify())
links =soupobj.find_all('span',{'class':'f343d9ce'})
addresses =soupobj.find_all('div',{'class':'_162e6469'})
price = ""
for i in range(0,len(links)):
price = str(links[i]).split(">")
price = price[len(price)-2].split("<")[0]
prices.append(price)
address = str(addresses[i]).split(">")
address = address[len(address)-2].split("<")[0]
location.append(address)
row=location[i]+","+prices[i]
df = df.append(pd.Series(row, index=columns), ignore_index=False)
# filewriter = csv.writer(csvfile, delimiter=',',filewriter.writerow(['Price', 'Location']),filewriter.writerow([prices[0],location[0]])
df.to_csv('DATA.csv', index=False)
because of this:
pd.Series(row, index=columns)
try smthg like
pd.DataFrame([[locations[i], prices[i]]], index=columns))
However this could be done only once outside of your for loop
pd.DataFrame(list(zip(locations, prices)), index=columns))

Python 3.6 Pandas Difflib Get_Close_Matches to filter a dataframe with user input

Using a csv imported using a pandas dataframe, I am trying to search one column of the df for entries similar to a user generated input. Never used difflib before and my tries have ended in a TypeError: object of type 'float' has no len() or an empty [] list.
import difflib
import pandas as pd
df = pd.read_csv("Vendorlist.csv", encoding= "ISO-8859-1")
word = input ("Enter a vendor: ")
def find_it(w):
w = w.lower()
return difflib.get_close_matches(w, df.vendorname, n=50, cutoff=.6)
alternatives = find_it(word)
print (alternatives)
The error seems to occur at "return.difflib.get_close_matches(w, df.vendorname, n=50, cutoff=.6)"
Am attempting to get similar results to "word" with a column called 'vendorname'.
Help is greatly appreciated.
Your column vendorname is of the incorrect type.
Try in your return statement:
return difflib.get_close_matches(w, df.vendorname.astype(str), n=50, cutoff=.6)
import difflib
import pandas as pd
df = pd.read_csv("Vendorlist.csv", encoding= "ISO-8859-1")
word = input ("Enter a vendor: ")
def find_it(w):
w = w.lower()
return difflib.get_close_matches(w, df.vendorname.astype(str), n=50, cutoff=.6)
alternatives = find_it(word)
print (alternatives)
As stated in the comments by #johnchase
The question also mentions the return of an empty list. The return of get_close_matches is a list of matches, if no item matched within the cutoff an empty list will be returned – johnchase
I've skipped the:
astype(str)in (return difflib.get_close_matches(w, df.vendorname.astype(str), n=50, cutoff=.6))
Instead used:
dtype='string' in (df = pd.read_csv("Vendorlist.csv", encoding= "ISO-8859-1"))

Empty outputs with python GDAL

Hello im new to Gdal and im struggling a with my codes. Everything seems to go well in my code mut the output bands at the end is empty. The no data value is set to 256 when i specify 255, so I don't really know whats wrong. Thanks any help will be appreciated!!!
Here is my code
from osgeo import gdal
from osgeo import gdalconst
from osgeo import osr
from osgeo import ogr
import numpy
#graticule
src_ds = gdal.Open("E:\\NFI_photo_plot\\photoplotdownloadAllCanada\\provincial_merge\\Aggregate\\graticule1.tif")
band = src_ds.GetRasterBand(1)
band.SetNoDataValue(0)
graticule = band.ReadAsArray()
print('graticule done')
band="none"
#Biomass
dataset1 = gdal.Open("E:\\NFI_photo_plot\\photoplotdownloadAllCanada\provincial_merge\\Aggregate\\Biomass_NFI.tif")
band1 = dataset1.GetRasterBand(1)
band1.SetNoDataValue(-1)
Biomass = band1.ReadAsArray()
maskbiomass = numpy.greater(Biomass, -1).astype(int)
print("biomass done")
Biomass="none"
band1="none"
dataset1="none"
#Baseline
dataset2 = gdal.Open("E:\\NFI_photo_plot\\Baseline\\TOTBM_250.tif")
band2 = dataset2.GetRasterBand(1)
band2.SetNoDataValue(0)
baseline = band2.ReadAsArray()
maskbaseline = numpy.greater(baseline, 0).astype(int)
print('baseline done')
baseline="none"
band2="none"
dataset2="none"
#sommation
biosource=(graticule+maskbiomass+maskbaseline)
biosource1=numpy.uint8(biosource)
biosource="none"
#Écriture
dst_file="E:\\NFI_photo_plot\\photoplotdownloadAllCanada\\provincial_merge\\Aggregate\\Biosource.tif"
dst_driver = gdal.GetDriverByName('GTiff')
dst_ds = dst_driver.Create(dst_file, src_ds.RasterXSize,
src_ds.RasterYSize, 1, gdal.GDT_Byte)
#projection
dst_ds.SetProjection( src_ds.GetProjection() )
dst_ds.SetGeoTransform( src_ds.GetGeoTransform() )
outband=dst_ds.GetRasterBand(1)
outband.WriteArray(biosource1,0,0)
outband.SetNoDataValue(255)
biosource="none"
graticule="none"
A few pointers:
Where you have ="none", these need to be = None to close/cleanup the objects, otherwise you are setting the objects to an array of characters: n o n e, which is not what you intend to do.
Why do you have band1.SetNoDataValue(-1), while other NoData values are 0? Is this data source signed or unsigned? If unsigned, then -1 doesn't exist.
When you open rasters with gdal.Open without the access option, it defaults to gdal.GA_ReadOnly, which means your subsequent SetNoDataValue calls do nothing. If you want to modify the dataset, you need to use gdal.GA_Update as your second parameter to gdal.Open.
Another strategy to create a new raster is to use driver.CreateCopy; see the tutorial for details.