How can I Read a input file with pandas/python? - pandas
I want to create a input that can bea readable by pandas and do some count with sorted values. The problem is that I want that in the input you put the name of the file to do the action but i have some problem. Hope someone can help me.
The file is an excel!
Here I give you the code:
import pandas as pd
doc = input('Ingresa el nombre del archivo: ')
print(f'Ingresaste {doc}')
df=pd.read.excel(doc)
df['Recordinaciones'] = df.apply(lambda _: '', axis=1)
rcs=df[['Cliente','# Externo','Recordinaciones']].groupby(['Cliente','# Externo']).count().reset_index().sort_values(['Recordinaciones'],ascending=False)
rcs
You have to put an underscore and not a dot in pandas.read_excel :
Replace this :
df=pd.read.excel(doc)
By this :
df=pd.read_excel(doc)
Related
Use pandas df.concat to replace .append with custom index
I'm currently trying to replace .append in my code since it won't be supported in the future and I have some trouble with the custom index I'm using I read the names of every .shp files in a directory and extract some date from it To make the link with an excel file I have, I use the name I extract from the title of the file df = pd.DataFrame(columns = ['date','fichier']) for i in glob.glob("*.shp"): nom_parcelle = i.split("_")[2] if not nom_parcelle in df.index: # print(df.last_valid_index()) date_recolte = i.split("_")[-1] new_row = pd.Series(data={'date':date_recolte.split(".")[0], 'fichier':i}, name = nom_parcelle) df = df.append(new_row, ignore_index=False) This works exactly as I want it to be Sadly, I can't find a way to replace it with .concat I looked for ways to keep the index whith concat but didn't find anything that worked as I intended Did I miss anything?
Try the approach below with pandas.concat based on your code : import glob import pandas as pd df = pd.DataFrame(columns = ['date','fichier']) dico_dfs={} for i in glob.glob("*.shp"): nom_parcelle = i.split("_")[2] if not nom_parcelle in df.index: # print(df.last_valid_index()) date_recolte = i.split("_")[-1] new_row = pd.Series(data={'date':date_recolte.split(".")[0], 'fichier':i}, name = nom_parcelle) dico_dfs[i]= new_row.to_frame() df= pd.concat(dico_dfs, ignore_index=False, axis=1).T.droplevel(0) # Output : print(df) date fichier nom1 20220101 a_xx_nom1_20220101.shp nom2 20220102 b_yy_nom2_20220102.shp nom3 20220103 c_zz_nom3_20220103.shp
'poorly' organized csv file
I have a CSV file that I have to do some data processing and it's a bit of a mess. It's about 20 columns long, but there are multiple datasets that are concatenated in each column. see dummy file below I'm trying to import each sub file into a separate pandas dataframe, but I'm not sure the best way to parse the csv other than manually hardcoding importing a certain length. any suggestions? I guess if there is some way to find where the spaces are (I could loop through the entire file and find them, and then read each block, but that doesn't seem very efficient). I have lots of csv files like this to read. import pandas as pd nrows = 20 skiprows = 0 #but this only reads in the first block df = pd.read_csv(csvfile, nrows=nrows, skiprows=skiprows) Below is a dummy example: TIME,HDRA-1,HDRA-2,HDRA-3,HDRA-4 0.473934934,0.944026678,0.460177668,0.157028404,0.221362174 0.911384892,0.336694914,0.586014563,0.828339071,0.632790473 0.772652589,0.318146985,0.162987171,0.555896202,0.659099194 0.541382917,0.033706768,0.229596419,0.388057901,0.465507295 0.462815443,0.088206108,0.717132904,0.545779038,0.268174922 0.522861489,0.736462083,0.532785319,0.961993893,0.393424116 0.128671067,0.56740537,0.689995486,0.518493779,0.94916205 0.214026742,0.176948186,0.883636252,0.732258971,0.463732841 0.769415726,0.960761306,0.401863804,0.41823372,0.812081565 0.529750933,0.360314266,0.461615009,0.387516958,0.136616263 TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4 0.92264286,0.026312552,0.905839375,0.869477136,0.985560264 0.410573341,0.004825381,0.920616162,0.19473237,0.848603523 0.999293171,0.259955029,0.380094352,0.101050014,0.428047493 0.820216119,0.655118219,0.586754951,0.568492346,0.017038336 0.040384337,0.195101879,0.778631044,0.655215972,0.701596844 0.897559206,0.659759362,0.691643603,0.155601111,0.713735399 0.860188233,0.805013656,0.772153733,0.809025634,0.257632085 0.844167809,0.268060979,0.015993504,0.95131982,0.321210766 0.86288383,0.236599974,0.279435193,0.311005146,0.037592509 0.938348876,0.941851279,0.582434058,0.900348616,0.381844182 0.344351819,0.821571854,0.187962046,0.218234588,0.376122331 0.829766776,0.869014514,0.434165111,0.051749472,0.766748447 0.327865017,0.938176948,0.216764504,0.216666543,0.278110502 0.243953506,0.030809033,0.450110334,0.097976735,0.762393831 0.484856452,0.312943244,0.443236377,0.017201097,0.038786057 0.803696521,0.328088545,0.764850865,0.090543472,0.023363909 TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4 0.342418934,0.290979228,0.84201758,0.690964176,0.927385229 0.173485057,0.214049903,0.27438753,0.433904377,0.821778689 0.982816721,0.094490904,0.105895645,0.894103833,0.34362529 0.738593272,0.423470984,0.343551191,0.192169774,0.907698897 0.021809601,0.406001002,0.072701623,0.964640184,0.023427393 0.406226618,0.421944527,0.413150342,0.337243905,0.515996389 0.829989793,0.168974332,0.246064043,0.067662474,0.851182924 0.812736737,0.667154845,0.118274705,0.484017732,0.052666038 0.215947395,0.145078319,0.484063281,0.79414799,0.373845815 0.497877968,0.554808367,0.370429652,0.081553316,0.793608698 0.607612542,0.424703584,0.208995066,0.249033837,0.808169709 0.199613478,0.065853429,0.77236195,0.757789625,0.597225697 0.044167285,0.1024231,0.959682778,0.892311813,0.621810775 0.861175219,0.853442735,0.742542086,0.704287769,0.435969078 0.706544823,0.062501379,0.482065481,0.598698867,0.845585046 0.967217599,0.13127149,0.294860203,0.191045015,0.590202032 0.031666757,0.965674812,0.177792841,0.419935921,0.895265056 TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4 0.306849588,0.177454423,0.538670939,0.602747137,0.081221293 0.729747557,0.11762043,0.409064884,0.051577964,0.666653287 0.492543468,0.097222882,0.448642979,0.130965724,0.48613413 0.0802024,0.726352481,0.457476151,0.647556514,0.033820374 0.617976299,0.934428994,0.197735831,0.765364856,0.350880707 0.07660401,0.285816636,0.276995238,0.047003343,0.770284864 0.620820688,0.700434525,0.896417099,0.652364756,0.93838793 0.364233925,0.200229902,0.648342989,0.919306736,0.897029239 0.606100716,0.203585366,0.167232701,0.523079381,0.767224301 0.616600448,0.130377791,0.554714839,0.468486555,0.582775753 0.254480861,0.933534632,0.054558237,0.948978985,0.731855548 0.620161044,0.583061202,0.457991555,0.441254272,0.657127968 0.415874646,0.408141761,0.843133575,0.40991199,0.540792744 0.254903429,0.655739954,0.977873649,0.210656057,0.072451639 0.473680525,0.298845701,0.144989283,0.998560665,0.223980961 0.30605008,0.837920854,0.450681322,0.887787908,0.793229776 0.584644405,0.423279153,0.444505314,0.686058204,0.041154856
from io import StringIO import pandas as pd data =""" TIME,HDRA-1,HDRA-2,HDRA-3,HDRA-4 0.473934934,0.944026678,0.460177668,0.157028404,0.221362174 0.911384892,0.336694914,0.586014563,0.828339071,0.632790473 0.772652589,0.318146985,0.162987171,0.555896202,0.659099194 0.541382917,0.033706768,0.229596419,0.388057901,0.465507295 0.462815443,0.088206108,0.717132904,0.545779038,0.268174922 0.522861489,0.736462083,0.532785319,0.961993893,0.393424116 TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4 0.92264286,0.026312552,0.905839375,0.869477136,0.985560264 0.410573341,0.004825381,0.920616162,0.19473237,0.848603523 0.999293171,0.259955029,0.380094352,0.101050014,0.428047493 0.820216119,0.655118219,0.586754951,0.568492346,0.017038336 0.040384337,0.195101879,0.778631044,0.655215972,0.701596844 TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4 0.342418934,0.290979228,0.84201758,0.690964176,0.927385229 0.173485057,0.214049903,0.27438753,0.433904377,0.821778689 0.982816721,0.094490904,0.105895645,0.894103833,0.34362529 0.738593272,0.423470984,0.343551191,0.192169774,0.907698897 """ df = pd.read_csv(StringIO(data), header=None) start_marker = 'TIME' grouper = (df.iloc[:, 0] == start_marker).cumsum() groups = df.groupby(grouper) frames = [gr.T.set_index(gr.index[0]).T for _, gr in groups]
Using string output from pytesseract to do a vlookup in pandas dataframe
I'm very new to Python, and I'm trying to make a simple image to song title to BPM program. My approach is using pytesseract to generate a string output; and then, using that string output, I wish to vlookup in a dataframe created by pandas. However, it always return zero value even though that song does exist in the data. import PIL.ImageGrab from PIL import ImageGrab import numpy as np import pytesseract import pandas as pd pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" def getTitleImage(left, top, width, height): printscreen_pil = ImageGrab.grab((left, top, left + width, top + height)) printscreen_numpy = np.array(printscreen_pil.getdata(), dtype='uint8') \ .reshape((printscreen_pil.size[1], printscreen_pil.size[0], 3)) return printscreen_numpy # Printscreen: titleImage = getTitleImage(x, y, w, h) # pytesseract to string: songTitle = pytesseract.image_to_string(titleImage) print('Name of the song: ', songTitle) # Importing the csv data via pandas. songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv') # A simple vlookup formula that return the BPM of the song by taking data from the same row. bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum() print('The BPM of the song is: ', bpmSong) Output: Name of the song: Macarena The BPM of the song is: 0 However, when I tried to forcefully provide the string to the songTitle variable, it works: songTitle = 'Macarena' print('Name of the song: ', songTitle) songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv') bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum() print('The BPM of the song is: ', bpmSong) Output: Name of the song: Macarena The BPM of the song is: 103 I have checked the string generated from pytesseract: It has no extra space in the front or the back, totally identical to the forced string, but they still produce different results. What could be the problem?
I found the answer. It is because the songTitle coming from: songTitle = pytesseract.image_to_string(titleImage) ...is actually 'Macarena\n' instead of 'Macarena'. They might look the same after print out, except the former will create a new line after it. A great lesson learn for me.
Autocorrect a column in a pandas dataframe using pyenchant
I tried to apply the code from the accepted answer of this question to one of my dataframe columns where each row is a sentence, but it didn't work. My code looks this: from enchant.checker import SpellChecker checker = SpellChecker("id_ID") h = df['Jawaban'].astype(str).str.lower() hayo = [] for text in h: checker.set_text(text) for s in checker: sug = s.suggest()[0] s.replace(sug) hayo.append(checker.get_text()) I got this following error: IndexError: list index out of range Any help is greatly appreciated.
I don't get the error using your code. The only thing I'm doing differently is to import the spell checker. from enchant.checker import SpellChecker checker = SpellChecker('en_US','en_UK') # not using id_ID # sample data ds = pd.DataFrame({ 'text': ['here is a spllng mstke','the wrld is grwng']}) p = ds['text'].str.lower() hayo = [] for text in p: checker.set_text(text) for s in checker: sug = s.suggest()[0] s.replace(sug) print(checker.get_text()) hayo.append(checker.get_text()) print(hayo) here is a spelling mistake the world is growing
Python 3.6 Pandas Difflib Get_Close_Matches to filter a dataframe with user input
Using a csv imported using a pandas dataframe, I am trying to search one column of the df for entries similar to a user generated input. Never used difflib before and my tries have ended in a TypeError: object of type 'float' has no len() or an empty [] list. import difflib import pandas as pd df = pd.read_csv("Vendorlist.csv", encoding= "ISO-8859-1") word = input ("Enter a vendor: ") def find_it(w): w = w.lower() return difflib.get_close_matches(w, df.vendorname, n=50, cutoff=.6) alternatives = find_it(word) print (alternatives) The error seems to occur at "return.difflib.get_close_matches(w, df.vendorname, n=50, cutoff=.6)" Am attempting to get similar results to "word" with a column called 'vendorname'. Help is greatly appreciated.
Your column vendorname is of the incorrect type. Try in your return statement: return difflib.get_close_matches(w, df.vendorname.astype(str), n=50, cutoff=.6) import difflib import pandas as pd df = pd.read_csv("Vendorlist.csv", encoding= "ISO-8859-1") word = input ("Enter a vendor: ") def find_it(w): w = w.lower() return difflib.get_close_matches(w, df.vendorname.astype(str), n=50, cutoff=.6) alternatives = find_it(word) print (alternatives) As stated in the comments by #johnchase The question also mentions the return of an empty list. The return of get_close_matches is a list of matches, if no item matched within the cutoff an empty list will be returned – johnchase
I've skipped the: astype(str)in (return difflib.get_close_matches(w, df.vendorname.astype(str), n=50, cutoff=.6)) Instead used: dtype='string' in (df = pd.read_csv("Vendorlist.csv", encoding= "ISO-8859-1"))