Python 3.6 Pandas Difflib Get_Close_Matches to filter a dataframe with user input - pandas

Using a csv imported using a pandas dataframe, I am trying to search one column of the df for entries similar to a user generated input. Never used difflib before and my tries have ended in a TypeError: object of type 'float' has no len() or an empty [] list.
import difflib
import pandas as pd
df = pd.read_csv("Vendorlist.csv", encoding= "ISO-8859-1")
word = input ("Enter a vendor: ")
def find_it(w):
w = w.lower()
return difflib.get_close_matches(w, df.vendorname, n=50, cutoff=.6)
alternatives = find_it(word)
print (alternatives)
The error seems to occur at "return.difflib.get_close_matches(w, df.vendorname, n=50, cutoff=.6)"
Am attempting to get similar results to "word" with a column called 'vendorname'.
Help is greatly appreciated.

Your column vendorname is of the incorrect type.
Try in your return statement:
return difflib.get_close_matches(w, df.vendorname.astype(str), n=50, cutoff=.6)
import difflib
import pandas as pd
df = pd.read_csv("Vendorlist.csv", encoding= "ISO-8859-1")
word = input ("Enter a vendor: ")
def find_it(w):
w = w.lower()
return difflib.get_close_matches(w, df.vendorname.astype(str), n=50, cutoff=.6)
alternatives = find_it(word)
print (alternatives)
As stated in the comments by #johnchase
The question also mentions the return of an empty list. The return of get_close_matches is a list of matches, if no item matched within the cutoff an empty list will be returned – johnchase

I've skipped the:
astype(str)in (return difflib.get_close_matches(w, df.vendorname.astype(str), n=50, cutoff=.6))
Instead used:
dtype='string' in (df = pd.read_csv("Vendorlist.csv", encoding= "ISO-8859-1"))

Related

Use pandas df.concat to replace .append with custom index

I'm currently trying to replace .append in my code since it won't be supported in the future and I have some trouble with the custom index I'm using
I read the names of every .shp files in a directory and extract some date from it
To make the link with an excel file I have, I use the name I extract from the title of the file
df = pd.DataFrame(columns = ['date','fichier'])
for i in glob.glob("*.shp"):
nom_parcelle = i.split("_")[2]
if not nom_parcelle in df.index:
# print(df.last_valid_index())
date_recolte = i.split("_")[-1]
new_row = pd.Series(data={'date':date_recolte.split(".")[0], 'fichier':i}, name = nom_parcelle)
df = df.append(new_row, ignore_index=False)
This works exactly as I want it to be
Sadly, I can't find a way to replace it with .concat
I looked for ways to keep the index whith concat but didn't find anything that worked as I intended
Did I miss anything?
Try the approach below with pandas.concat based on your code :
import glob
import pandas as pd
​
df = pd.DataFrame(columns = ['date','fichier'])
dico_dfs={}
​
for i in glob.glob("*.shp"):
nom_parcelle = i.split("_")[2]
if not nom_parcelle in df.index:
# print(df.last_valid_index())
date_recolte = i.split("_")[-1]
new_row = pd.Series(data={'date':date_recolte.split(".")[0], 'fichier':i}, name = nom_parcelle)
dico_dfs[i]= new_row.to_frame()
df= pd.concat(dico_dfs, ignore_index=False, axis=1).T.droplevel(0)
# Output :
print(df)
date fichier
nom1 20220101 a_xx_nom1_20220101.shp
nom2 20220102 b_yy_nom2_20220102.shp
nom3 20220103 c_zz_nom3_20220103.shp

How to quickly convert a pandas dataframe to a list of tuples

I have a pandas dataframe as follows.
thi 0.969378
text 0.969378
is 0.969378
anoth 0.699030
your 0.497120
first 0.497120
book 0.497120
third 0.445149
the 0.445149
for 0.445149
analysi 0.445149
I want to convert it to a list of tuples as follows.
[["this", 0.969378], ["text", 0.969378], ..., ["analysi", 0.445149]]
My code is as follows.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
# your corpus
text = ["This is your first text book", "This is the third text for analysis", "This is another text"]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
print(top_words)
I tried the following two options.
list(zip(*map(top_words.get, top_words)))
I got the error as TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0.9693779251346359] of <class 'float'>
list(top_words.itertuples(index=True))
I got the error as AttributeError: 'Series' object has no attribute 'itertuples'.
Please let me know a quick way of doing this in pandas.
I am happy to provide more details if needed.
Use zip by index with map tuples to lists:
a = list(map(list,zip(top_words.index,top_words)))
Or convert index to column, convert to nupy array and then to lists:
a = top_words.reset_index().to_numpy().tolist()
print (a)
[['thi', 0.9693780000000001], ['text', 0.9693780000000001],
['is', 0.9693780000000001], ['anoth', 0.69903],
['your', 0.49712], ['first', 0.49712], ['book', 0.49712],
['third', 0.44514899999999996], ['the', 0.44514899999999996],
['for', 0.44514899999999996], ['analysi', 0.44514899999999996]]

Why my code is giving me data in 1 column it should give me in two different column

i need to know what is happening in my code? it should give data in separate columns it is giving me same data in a oath columns.
i tried to change the value of row variable but it didn't found the reason
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
import time
arrayofRequest= []
prices=[]
location=[]
columns=['Price', 'Location']
df = pd.DataFrame(columns=columns)
for i in range(0,50):
arrayofRequest.append("https://www.zameen.com/Homes/Karachi-2-"+str(i+1)+".html?gclid=Cj0KCQjw3JXtBRC8ARIsAEBHg4mj4jX1zZUt3WzGScjH6nfwzrEqkuILarcmg372imSneelSXPj0fGIaArNeEALw_wcB")
request = requests.get(arrayofRequest[i])
soupobj= BeautifulSoup(request.content,"lxml")
# print(soupobj.prettify())
links =soupobj.find_all('span',{'class':'f343d9ce'})
addresses =soupobj.find_all('div',{'class':'_162e6469'})
price = ""
for i in range(0,len(links)):
price = str(links[i]).split(">")
price = price[len(price)-2].split("<")[0]
prices.append(price)
address = str(addresses[i]).split(">")
address = address[len(address)-2].split("<")[0]
location.append(address)
row=location[i]+","+prices[i]
df = df.append(pd.Series(row, index=columns), ignore_index=False)
# filewriter = csv.writer(csvfile, delimiter=',',filewriter.writerow(['Price', 'Location']),filewriter.writerow([prices[0],location[0]])
df.to_csv('DATA.csv', index=False)
because of this:
pd.Series(row, index=columns)
try smthg like
pd.DataFrame([[locations[i], prices[i]]], index=columns))
However this could be done only once outside of your for loop
pd.DataFrame(list(zip(locations, prices)), index=columns))

Autocorrect a column in a pandas dataframe using pyenchant

I tried to apply the code from the accepted answer of this question to one of my dataframe columns where each row is a sentence, but it didn't work.
My code looks this:
from enchant.checker import SpellChecker
checker = SpellChecker("id_ID")
h = df['Jawaban'].astype(str).str.lower()
hayo = []
for text in h:
checker.set_text(text)
for s in checker:
sug = s.suggest()[0]
s.replace(sug)
hayo.append(checker.get_text())
I got this following error:
IndexError: list index out of range
Any help is greatly appreciated.
I don't get the error using your code. The only thing I'm doing differently is to import the spell checker.
from enchant.checker import SpellChecker
checker = SpellChecker('en_US','en_UK') # not using id_ID
# sample data
ds = pd.DataFrame({ 'text': ['here is a spllng mstke','the wrld is grwng']})
p = ds['text'].str.lower()
hayo = []
for text in p:
checker.set_text(text)
for s in checker:
sug = s.suggest()[0]
s.replace(sug)
print(checker.get_text())
hayo.append(checker.get_text())
print(hayo)
here is a spelling mistake
the world is growing

How to drop the row?

I'm handling my data now.
I got a problems when I using pandas
Here is the code.
import pandas as pd
import numpy as np
import os
join_file2 = r'D:\raw data\서울시 공공데이터\5.16년7월분\17.상권-추정매출\tbsm_trdar_selng.txt\tbsm_trdar_selng_utf8.txt'
os.chdir(os.path.dirname(join_file2))
join_data2 = pd.read_csv(os.path.basename(join_file2),sep='|',
header=None ,
usecols=[0,1,2,3,4,11],
names=['STDR_YM_CD', 'TRDAR_CD', 'TRDAR_CD_NM', 'SVC_INDUTY_CD','SVC_INDUTY_CD_NM','THSMON_SELNG_AMT'],
dtype = { '0' : int},
encoding='utf-8' )
join_data2_d = join_data2[(join_data2.SVC_INDUTY_CD != 'CS000000') | (join_data2.SVC_INDUTY_CD != 'CS100000') | (join_data2.SVC_INDUTY_CD != 'CS200000')| (join_data2.SVC_INDUTY_CD != 'CS300000') ]
When I print join_data2.head(), I got this data
and print join_data2_d, I got this which doesn't drop rows.
How can I fix it??
Here is my data link.
http://blogattach.naver.com/f96ce5504c7273c4e30c6a5e6581fe8323768bd1/20170214_57_blogfile/khm2963_1487051333659_hYIWuM_zip/tbsm_trdar_selng_utf8.zip?type=attachment
It appears that you are trying to filter out rows that have SVC_INDUTY_CD not equal to several values. You should use the isin method and reverse it using the unary operator ~
join_data2_d = join_data2[~join_data2.SVC_INDUTY_CD.isin(['CS000000',
'CS100000',
'CS200000',
'CS300000'])]