Use pandas to read the csv file with several uncertain factors - pandas
I have asked the related question of string in: Find the number of \n before a given word in a long string. But this method cannot solve the complicate case I happened to. Thus I want to find out a solution of Pandas here.
I have a csv file (I just represent as a string):
csvfile = 'Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
I want to use the pandas:
value = pandas.read_csv(csvfile, sep = '\t', skiprows = 3).set_index('maturity')
to obtain the table like:
and set the first columan maturity as index.
But there are several uncertain factors in the csvfile:
1..set_index('maturity'), the key maturity
of index is included in the row key: maturity. Then I should find the row key: xxxx and obtain the string xxxx
2.skiprows = 3: the number of skipped rows before the title:
is uncertain. The csvfile can be something like:
'Idnum\tId\nkey:maturity\n2\n\n\n\n\n\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
I should find the row number of title (namely the row beginning with xxxx found in the rowkey: xxxx).
3.sep = '\t': the csvfile may use space as separator like:
csvfile = 'Idnum Id\nkey: maturity\n2\nmaturity para1 para2\n1Y 0 0\n2Y 0 0'
So is there any general code of pandas to deal with the csvfile with above uncertain factors?
Actually the string:
csvfile = 'Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
is from a StringIO: data
data.getvalue() = 'Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
I am not familiar with this structure and even I want to obtain a table of original data without any edition by using:
value = pandas.read_csv(data, sep = '\t')
There will be a error.
You can read the file line by line, collecting the necessary information and then pass the remainder to pd.read_csv with the appropriate arguments:
from io import StringIO
import re
import pandas as pd
with open('data.csv') as fh:
key = next(filter(lambda x: x.startswith('key:'), fh)).lstrip('key:').strip()
header = re.split('[ \t]+', next(filter(lambda x: x.startswith(key), fh)).strip())
df = pd.read_csv(StringIO(fh.read()), header=None, names=header, index_col=0, sep=r'\s+')
Example for data via StringIO:
fh = StringIO('Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0')
key = next(filter(lambda x: x.startswith('key:'), fh)).lstrip('key:').strip()
header = re.split('[ \t]+', next(filter(lambda x: x.startswith(key), fh)).strip())
df = pd.read_csv(fh, header=None, names=header, index_col=0, sep=r'\s+')
If you do not mind reading the csv file twice you can try doing something like:
from io import StringIO
csvfile = 'Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
data = pd.read_csv(StringIO(csvfile), sep='\t', error_bad_lines=False, header=None)
skiprows = len(data)
pd.read_csv(StringIO(csvfile), sep='\t', skiprows=skiprows)
same for you other example:
csvfile = 'Idnum\tId\nkey:maturity\n2\n\n\n\n\n\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
data = pd.read_csv(StringIO(csvfile), sep='\t', error_bad_lines=False, header=None)
skiprows = len(data)
pd.read_csv(StringIO(csvfile), sep='\t', skiprows=skiprows)
This assumes that you know the sep of the file
Also if you want to find the key:
csvfile = 'Idnum\tId\nkey:maturity\n2\n\n\n\n\n\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
data = pd.read_csv(StringIO(csvfile), sep='\t', error_bad_lines=False, header=None)
key = [x.replace('key:','') for x in data[0] if x.find('key')>-1]
skiprows = len(data)
pd.read_csv(StringIO(csvfile), sep='\t', skiprows=skiprows).set_index(key)
Related
Use pandas df.concat to replace .append with custom index
I'm currently trying to replace .append in my code since it won't be supported in the future and I have some trouble with the custom index I'm using I read the names of every .shp files in a directory and extract some date from it To make the link with an excel file I have, I use the name I extract from the title of the file df = pd.DataFrame(columns = ['date','fichier']) for i in glob.glob("*.shp"): nom_parcelle = i.split("_")[2] if not nom_parcelle in df.index: # print(df.last_valid_index()) date_recolte = i.split("_")[-1] new_row = pd.Series(data={'date':date_recolte.split(".")[0], 'fichier':i}, name = nom_parcelle) df = df.append(new_row, ignore_index=False) This works exactly as I want it to be Sadly, I can't find a way to replace it with .concat I looked for ways to keep the index whith concat but didn't find anything that worked as I intended Did I miss anything?
Try the approach below with pandas.concat based on your code : import glob import pandas as pd df = pd.DataFrame(columns = ['date','fichier']) dico_dfs={} for i in glob.glob("*.shp"): nom_parcelle = i.split("_")[2] if not nom_parcelle in df.index: # print(df.last_valid_index()) date_recolte = i.split("_")[-1] new_row = pd.Series(data={'date':date_recolte.split(".")[0], 'fichier':i}, name = nom_parcelle) dico_dfs[i]= new_row.to_frame() df= pd.concat(dico_dfs, ignore_index=False, axis=1).T.droplevel(0) # Output : print(df) date fichier nom1 20220101 a_xx_nom1_20220101.shp nom2 20220102 b_yy_nom2_20220102.shp nom3 20220103 c_zz_nom3_20220103.shp
'poorly' organized csv file
I have a CSV file that I have to do some data processing and it's a bit of a mess. It's about 20 columns long, but there are multiple datasets that are concatenated in each column. see dummy file below I'm trying to import each sub file into a separate pandas dataframe, but I'm not sure the best way to parse the csv other than manually hardcoding importing a certain length. any suggestions? I guess if there is some way to find where the spaces are (I could loop through the entire file and find them, and then read each block, but that doesn't seem very efficient). I have lots of csv files like this to read. import pandas as pd nrows = 20 skiprows = 0 #but this only reads in the first block df = pd.read_csv(csvfile, nrows=nrows, skiprows=skiprows) Below is a dummy example: TIME,HDRA-1,HDRA-2,HDRA-3,HDRA-4 0.473934934,0.944026678,0.460177668,0.157028404,0.221362174 0.911384892,0.336694914,0.586014563,0.828339071,0.632790473 0.772652589,0.318146985,0.162987171,0.555896202,0.659099194 0.541382917,0.033706768,0.229596419,0.388057901,0.465507295 0.462815443,0.088206108,0.717132904,0.545779038,0.268174922 0.522861489,0.736462083,0.532785319,0.961993893,0.393424116 0.128671067,0.56740537,0.689995486,0.518493779,0.94916205 0.214026742,0.176948186,0.883636252,0.732258971,0.463732841 0.769415726,0.960761306,0.401863804,0.41823372,0.812081565 0.529750933,0.360314266,0.461615009,0.387516958,0.136616263 TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4 0.92264286,0.026312552,0.905839375,0.869477136,0.985560264 0.410573341,0.004825381,0.920616162,0.19473237,0.848603523 0.999293171,0.259955029,0.380094352,0.101050014,0.428047493 0.820216119,0.655118219,0.586754951,0.568492346,0.017038336 0.040384337,0.195101879,0.778631044,0.655215972,0.701596844 0.897559206,0.659759362,0.691643603,0.155601111,0.713735399 0.860188233,0.805013656,0.772153733,0.809025634,0.257632085 0.844167809,0.268060979,0.015993504,0.95131982,0.321210766 0.86288383,0.236599974,0.279435193,0.311005146,0.037592509 0.938348876,0.941851279,0.582434058,0.900348616,0.381844182 0.344351819,0.821571854,0.187962046,0.218234588,0.376122331 0.829766776,0.869014514,0.434165111,0.051749472,0.766748447 0.327865017,0.938176948,0.216764504,0.216666543,0.278110502 0.243953506,0.030809033,0.450110334,0.097976735,0.762393831 0.484856452,0.312943244,0.443236377,0.017201097,0.038786057 0.803696521,0.328088545,0.764850865,0.090543472,0.023363909 TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4 0.342418934,0.290979228,0.84201758,0.690964176,0.927385229 0.173485057,0.214049903,0.27438753,0.433904377,0.821778689 0.982816721,0.094490904,0.105895645,0.894103833,0.34362529 0.738593272,0.423470984,0.343551191,0.192169774,0.907698897 0.021809601,0.406001002,0.072701623,0.964640184,0.023427393 0.406226618,0.421944527,0.413150342,0.337243905,0.515996389 0.829989793,0.168974332,0.246064043,0.067662474,0.851182924 0.812736737,0.667154845,0.118274705,0.484017732,0.052666038 0.215947395,0.145078319,0.484063281,0.79414799,0.373845815 0.497877968,0.554808367,0.370429652,0.081553316,0.793608698 0.607612542,0.424703584,0.208995066,0.249033837,0.808169709 0.199613478,0.065853429,0.77236195,0.757789625,0.597225697 0.044167285,0.1024231,0.959682778,0.892311813,0.621810775 0.861175219,0.853442735,0.742542086,0.704287769,0.435969078 0.706544823,0.062501379,0.482065481,0.598698867,0.845585046 0.967217599,0.13127149,0.294860203,0.191045015,0.590202032 0.031666757,0.965674812,0.177792841,0.419935921,0.895265056 TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4 0.306849588,0.177454423,0.538670939,0.602747137,0.081221293 0.729747557,0.11762043,0.409064884,0.051577964,0.666653287 0.492543468,0.097222882,0.448642979,0.130965724,0.48613413 0.0802024,0.726352481,0.457476151,0.647556514,0.033820374 0.617976299,0.934428994,0.197735831,0.765364856,0.350880707 0.07660401,0.285816636,0.276995238,0.047003343,0.770284864 0.620820688,0.700434525,0.896417099,0.652364756,0.93838793 0.364233925,0.200229902,0.648342989,0.919306736,0.897029239 0.606100716,0.203585366,0.167232701,0.523079381,0.767224301 0.616600448,0.130377791,0.554714839,0.468486555,0.582775753 0.254480861,0.933534632,0.054558237,0.948978985,0.731855548 0.620161044,0.583061202,0.457991555,0.441254272,0.657127968 0.415874646,0.408141761,0.843133575,0.40991199,0.540792744 0.254903429,0.655739954,0.977873649,0.210656057,0.072451639 0.473680525,0.298845701,0.144989283,0.998560665,0.223980961 0.30605008,0.837920854,0.450681322,0.887787908,0.793229776 0.584644405,0.423279153,0.444505314,0.686058204,0.041154856
from io import StringIO import pandas as pd data =""" TIME,HDRA-1,HDRA-2,HDRA-3,HDRA-4 0.473934934,0.944026678,0.460177668,0.157028404,0.221362174 0.911384892,0.336694914,0.586014563,0.828339071,0.632790473 0.772652589,0.318146985,0.162987171,0.555896202,0.659099194 0.541382917,0.033706768,0.229596419,0.388057901,0.465507295 0.462815443,0.088206108,0.717132904,0.545779038,0.268174922 0.522861489,0.736462083,0.532785319,0.961993893,0.393424116 TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4 0.92264286,0.026312552,0.905839375,0.869477136,0.985560264 0.410573341,0.004825381,0.920616162,0.19473237,0.848603523 0.999293171,0.259955029,0.380094352,0.101050014,0.428047493 0.820216119,0.655118219,0.586754951,0.568492346,0.017038336 0.040384337,0.195101879,0.778631044,0.655215972,0.701596844 TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4 0.342418934,0.290979228,0.84201758,0.690964176,0.927385229 0.173485057,0.214049903,0.27438753,0.433904377,0.821778689 0.982816721,0.094490904,0.105895645,0.894103833,0.34362529 0.738593272,0.423470984,0.343551191,0.192169774,0.907698897 """ df = pd.read_csv(StringIO(data), header=None) start_marker = 'TIME' grouper = (df.iloc[:, 0] == start_marker).cumsum() groups = df.groupby(grouper) frames = [gr.T.set_index(gr.index[0]).T for _, gr in groups]
Copy/assign a Pandas dataframe based on their name in a for loop
I am relatively new with python - and I am struggling to do the following: I have a set of different data frames, with sequential naming (df_i), which I want to access in a for loop based on their name (with an string), how can I do that? e.g. df_1 = pd.read_csv('...') df_2 = pd.read_csv('...') df_3 = pd.read_csv('...') .... n_df = 3 for i in range(len(n_df)): df_namestr= 'df_' + str(i+1) # --------------------- df_temp = df_namestr # --------------------- # Operate with df_temp. For i+1= 1, df_temp should be df_1 Kind regards, DF
You can try something like that: for n in range(1, n_df+1): df_namestr = f"df_{n}" df_tmp = locals().get(df_namestr) if not isinstance(df_tmp, pd.DataFrame): continue print(df_namestr) print(df_tmp) Refer to the documentation of locals() to know more.
Would it be better to approach the accessing of multiple dataframes by reading them into a list? You could put all the csv files required in a subfolder and read them all in. Then they are in a list and you can access each one as an item in that list. Example: import pandas as pd import glob path = r'/Users/myUsername/Documents/subFolder' csv_files = glob.glob(path + "/*.csv") dfs = [] for filename in csv_files: df = pd.read_csv(filename) dfs.append(df) print(len(dfs)) print(dfs[1].head())
Pandas saving in text format
I am trying to save the output, which is a number ,to a text format in pandas after working on the dataset. import pandas as pd df = pd.read_csv("sales.csv") def HighestSales(): df.drop(['index', "month"], axis =1, inplace = True) df2 = df.groupby("year").sum() df2 = df2.sort_values(by = 'sales', ascending = True).reset_index() df3 = df2.loc[11, 'year'] df4 = pd.Series(df3) df5 = df4.iloc[0] #*the output here is 1964 , which alone needs to be saved in the text file*. df5.to_csv("modified.txt") HighestSales() But I get 'numpy.int64' object has no attribute 'to_csv'- this error . Is there a way to save just one single value in the text file?
you can do: # open a file named modified.txt with open('modified.txt', 'w') as f: # df5 is just an integer of 196 # and write 1964 plus a line break f.write(df5 + '\n')
You cannot save a single value to csv by using "pd.to_csv". In your case you should convert it into DataFrame again and then saving it. If you want to see only the number in .txt file, you need to add some parameters: result = pd.DataFrame(df5) result.to_csv('modified.txt', index=False, header=False)
Pandas - Trying to save a set of files by reading it using Pandas but only the latest file gets saved
I am trying to read a set of txt files into Pandas as below. I see I am able to read them to a Dataframe however when I try to save the Dataframe it only saves the last file it read. However when I perform print(df) it prints all the records. Given below is the code I am using: files = '/users/user/files' list = [] for file in files: df = pd.read_csv(file) list.append(df) print(df) df.to_csv('file_saved_path') Could anyone advice why is the last file only being saved to the csv file and now the entire list. Expected output: output1 output2 output3 Current output: output1,output2,output3
Try this: path = '/users/user/files' for id in range(len(os.listdir(path))): file = os.listdir(path)[id] data = pd.read_csv(path+'/'+file, sep='\t') if id == 0: df1 = data else: data = pd.concat([df1, data], ignore_index=True) data.to_csv('file_saved_path')
First change variable name list, because code word in python (builtin), then for final DataFrame use concat: files = '/users/user/files' L = [] for file in files: df = pd.read_csv(file) L.append(df) bigdf = pd.concat(L, ignore_index=True) bigdf.to_csv('file_saved_path')