How to read every file in folder to dataframe named after filename and overlay column names? - pandas

I am working on a project where I am downloading public data from (http://pdata.hcad.org/download/) and more particularly downloading the zip files "real_acct_ownership" and "real_building_land".
Each of these zip files contains data on homes built in the houston area, such as addresses, fixtures, sq ft, etc.
My goal is to organize the data so that all the files in the zip folder are data frames indexable by the column "account".
I am running into the issue as to how to create a function or for loop that will read and write the data into a data frame based on file name and how to overlay column names as the data in the zip folders does not contain the column names. The column names can be found in the access zip folder at the top left hand corner labeled "access.zip" of the website.
In my code so far I am calling each file from the above two folders and specifying each column name. I want this to be a iterative process as I will have to do this for other counties and would like a way to create a loop over the files in the folder.
my code so far with NO loops:
import pandas as pd
fixtures = pd.read_csv('/Users/Desktop/Real_building_land/fixtures.txt',header = None,
encoding= 'cp037', error_bad_lines=False, sep='\t')
real_acct =pd.read_csv('/Users/Desktop/Real_acct_owner/real_acct.txt', header = None,
encoding = 'cp037', error_bad_lines=False, sep='\t')
exterior = pd.read_csv('/Users/Desktop/Real_building_land/exterior.txt', header = None,
encoding = 'cp037', error_bad_lines=False, sep='\t')
fixtures.columns = ('ACCOUNT','BUILDING_NUMBER','FIXTURE_TYPE','FIXTURE_DESCRIPTION','UNITS')
real_acct.columns = ("ACCOUNT","TAX_YEAR","MAILTO","MAIL_ADDR_1","MAIL_ADDR_2","MAIL_CITY","MAIL_STATE",
"MAIL_ZIP","MAIL_COUNTRY","UNDELIVERABLE","STR_PFX" ,"STR_NUM", "STR_NUM_SFX","STR_NAME",
"STR_SFX","STR_SFX_DIR","STR_UNIT","SITE_ADDR_1","SITE_ADDR_2","SITE_ADDR_3","STATE_CLASS",
"SCHOOL_DIST","MAP_FACET","KEY_MAP","NEIGHBORHOOD_CODE","NEIGHBORHOOD_GROUP","MARKET_AREA_1",
"MARKET_AREA_1_DSCR","MARKET_AREA_2","MARKET_AREA_2_DSCR","ECON_AREA","ECON_BLD_CLASS",
"CENTER_CODE","YR_IMPR","YR_ANNEXED","SPLT_DT","DSC_CD","NXT_BUILDING","TOTAL_BUILDING_AREA",
"TOTAL_LAND_AREA","ACREAGE","CAP_ACCOUNT","SHARED_CAD_CODE","LAND_VALUE","IMPROVEMENT_VALUE",
"EXTRA_FEATURES_VALUE" ,"AG_VALUE","ASSESSED_VALUE","TOTAL_APPRAISED_VALUE","TOTAL_MARKET_VALUE",
"PRIOR_LND_VALUE","PRIOR_IMPR_VALUE","PRIOR_X_FEATURES_VALUE","PRIOR_AG_VALUE",
"PRIOR_TOTAL_APPRAISED_VALUE","PRIOR_TOTAL_MARKET_VALUE","NEW_CONSTRUCTION_VALUE",
"TOTAL_RCN_VALUE","VALUE_STATUS","NOTICED","NOTICE_DATE","PROTESTED","CERTIFIED_DATE",
"LAST_INSPECTED_DATE","LAST_INSPECTED_BY","NEW_OWNER_DATE","LEGAL_DSCR_1","LEGAL_DSCR_2",
"LEGAL_DSCR_3","LEGAL_DSCR_4","JURS")
exterior.columns = ("ACCOUNT","BUILDING_NUMBER","EXTERIOR_TYPE","EXTERIOR_DESCRIPTION","AREA")
df = fixtures.merge(real_acct,on='ACCOUNT').merge(exterior,on='ACCOUNT')
#df = df.loc[df['ACCOUNT'] == 10020000015]
print(df.shape)
Code with Few trials with loops nothing worked:
import pandas as pd
import glob
import os
dfs = {os.path.basename(f): pd.read_csv(f, sep='\t', header=None,encoding='cp037',
error_bad_lines=False) for f in glob.glob('/Users/Desktop/Real_building_land/*.txt')}
print(dfs)
path =r'path' # use your path
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None, header=0)
list_.append(df)
frame = pd.concat(list_)
Thank you in advance.

Related

'poorly' organized csv file

I have a CSV file that I have to do some data processing and it's a bit of a mess. It's about 20 columns long, but there are multiple datasets that are concatenated in each column. see dummy file below
I'm trying to import each sub file into a separate pandas dataframe, but I'm not sure the best way to parse the csv other than manually hardcoding importing a certain length. any suggestions? I guess if there is some way to find where the spaces are (I could loop through the entire file and find them, and then read each block, but that doesn't seem very efficient). I have lots of csv files like this to read.
import pandas as pd
nrows = 20
skiprows = 0 #but this only reads in the first block
df = pd.read_csv(csvfile, nrows=nrows, skiprows=skiprows)
Below is a dummy example:
TIME,HDRA-1,HDRA-2,HDRA-3,HDRA-4
0.473934934,0.944026678,0.460177668,0.157028404,0.221362174
0.911384892,0.336694914,0.586014563,0.828339071,0.632790473
0.772652589,0.318146985,0.162987171,0.555896202,0.659099194
0.541382917,0.033706768,0.229596419,0.388057901,0.465507295
0.462815443,0.088206108,0.717132904,0.545779038,0.268174922
0.522861489,0.736462083,0.532785319,0.961993893,0.393424116
0.128671067,0.56740537,0.689995486,0.518493779,0.94916205
0.214026742,0.176948186,0.883636252,0.732258971,0.463732841
0.769415726,0.960761306,0.401863804,0.41823372,0.812081565
0.529750933,0.360314266,0.461615009,0.387516958,0.136616263
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.92264286,0.026312552,0.905839375,0.869477136,0.985560264
0.410573341,0.004825381,0.920616162,0.19473237,0.848603523
0.999293171,0.259955029,0.380094352,0.101050014,0.428047493
0.820216119,0.655118219,0.586754951,0.568492346,0.017038336
0.040384337,0.195101879,0.778631044,0.655215972,0.701596844
0.897559206,0.659759362,0.691643603,0.155601111,0.713735399
0.860188233,0.805013656,0.772153733,0.809025634,0.257632085
0.844167809,0.268060979,0.015993504,0.95131982,0.321210766
0.86288383,0.236599974,0.279435193,0.311005146,0.037592509
0.938348876,0.941851279,0.582434058,0.900348616,0.381844182
0.344351819,0.821571854,0.187962046,0.218234588,0.376122331
0.829766776,0.869014514,0.434165111,0.051749472,0.766748447
0.327865017,0.938176948,0.216764504,0.216666543,0.278110502
0.243953506,0.030809033,0.450110334,0.097976735,0.762393831
0.484856452,0.312943244,0.443236377,0.017201097,0.038786057
0.803696521,0.328088545,0.764850865,0.090543472,0.023363909
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.342418934,0.290979228,0.84201758,0.690964176,0.927385229
0.173485057,0.214049903,0.27438753,0.433904377,0.821778689
0.982816721,0.094490904,0.105895645,0.894103833,0.34362529
0.738593272,0.423470984,0.343551191,0.192169774,0.907698897
0.021809601,0.406001002,0.072701623,0.964640184,0.023427393
0.406226618,0.421944527,0.413150342,0.337243905,0.515996389
0.829989793,0.168974332,0.246064043,0.067662474,0.851182924
0.812736737,0.667154845,0.118274705,0.484017732,0.052666038
0.215947395,0.145078319,0.484063281,0.79414799,0.373845815
0.497877968,0.554808367,0.370429652,0.081553316,0.793608698
0.607612542,0.424703584,0.208995066,0.249033837,0.808169709
0.199613478,0.065853429,0.77236195,0.757789625,0.597225697
0.044167285,0.1024231,0.959682778,0.892311813,0.621810775
0.861175219,0.853442735,0.742542086,0.704287769,0.435969078
0.706544823,0.062501379,0.482065481,0.598698867,0.845585046
0.967217599,0.13127149,0.294860203,0.191045015,0.590202032
0.031666757,0.965674812,0.177792841,0.419935921,0.895265056
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.306849588,0.177454423,0.538670939,0.602747137,0.081221293
0.729747557,0.11762043,0.409064884,0.051577964,0.666653287
0.492543468,0.097222882,0.448642979,0.130965724,0.48613413
0.0802024,0.726352481,0.457476151,0.647556514,0.033820374
0.617976299,0.934428994,0.197735831,0.765364856,0.350880707
0.07660401,0.285816636,0.276995238,0.047003343,0.770284864
0.620820688,0.700434525,0.896417099,0.652364756,0.93838793
0.364233925,0.200229902,0.648342989,0.919306736,0.897029239
0.606100716,0.203585366,0.167232701,0.523079381,0.767224301
0.616600448,0.130377791,0.554714839,0.468486555,0.582775753
0.254480861,0.933534632,0.054558237,0.948978985,0.731855548
0.620161044,0.583061202,0.457991555,0.441254272,0.657127968
0.415874646,0.408141761,0.843133575,0.40991199,0.540792744
0.254903429,0.655739954,0.977873649,0.210656057,0.072451639
0.473680525,0.298845701,0.144989283,0.998560665,0.223980961
0.30605008,0.837920854,0.450681322,0.887787908,0.793229776
0.584644405,0.423279153,0.444505314,0.686058204,0.041154856
from io import StringIO
import pandas as pd
data ="""
TIME,HDRA-1,HDRA-2,HDRA-3,HDRA-4
0.473934934,0.944026678,0.460177668,0.157028404,0.221362174
0.911384892,0.336694914,0.586014563,0.828339071,0.632790473
0.772652589,0.318146985,0.162987171,0.555896202,0.659099194
0.541382917,0.033706768,0.229596419,0.388057901,0.465507295
0.462815443,0.088206108,0.717132904,0.545779038,0.268174922
0.522861489,0.736462083,0.532785319,0.961993893,0.393424116
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.92264286,0.026312552,0.905839375,0.869477136,0.985560264
0.410573341,0.004825381,0.920616162,0.19473237,0.848603523
0.999293171,0.259955029,0.380094352,0.101050014,0.428047493
0.820216119,0.655118219,0.586754951,0.568492346,0.017038336
0.040384337,0.195101879,0.778631044,0.655215972,0.701596844
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.342418934,0.290979228,0.84201758,0.690964176,0.927385229
0.173485057,0.214049903,0.27438753,0.433904377,0.821778689
0.982816721,0.094490904,0.105895645,0.894103833,0.34362529
0.738593272,0.423470984,0.343551191,0.192169774,0.907698897
"""
df = pd.read_csv(StringIO(data), header=None)
start_marker = 'TIME'
grouper = (df.iloc[:, 0] == start_marker).cumsum()
groups = df.groupby(grouper)
frames = [gr.T.set_index(gr.index[0]).T for _, gr in groups]

Copy/assign a Pandas dataframe based on their name in a for loop

I am relatively new with python - and I am struggling to do the following:
I have a set of different data frames, with sequential naming (df_i), which I want to access in a for loop based on their name (with an string), how can I do that? e.g.
df_1 = pd.read_csv('...')
df_2 = pd.read_csv('...')
df_3 = pd.read_csv('...')
....
n_df = 3
for i in range(len(n_df)):
df_namestr= 'df_' + str(i+1)
# ---------------------
df_temp = df_namestr
# ---------------------
# Operate with df_temp. For i+1= 1, df_temp should be df_1
Kind regards,
DF
You can try something like that:
for n in range(1, n_df+1):
df_namestr = f"df_{n}"
df_tmp = locals().get(df_namestr)
if not isinstance(df_tmp, pd.DataFrame):
continue
print(df_namestr)
print(df_tmp)
Refer to the documentation of locals() to know more.
Would it be better to approach the accessing of multiple dataframes by reading them into a list?
You could put all the csv files required in a subfolder and read them all in. Then they are in a list and you can access each one as an item in that list.
Example:
import pandas as pd
import glob
path = r'/Users/myUsername/Documents/subFolder'
csv_files = glob.glob(path + "/*.csv")
dfs = []
for filename in csv_files:
df = pd.read_csv(filename)
dfs.append(df)
print(len(dfs))
print(dfs[1].head())

Pandas saving in text format

I am trying to save the output, which is a number ,to a text format in pandas after working on the dataset.
import pandas as pd
df = pd.read_csv("sales.csv")
def HighestSales():
df.drop(['index', "month"], axis =1, inplace = True)
df2 = df.groupby("year").sum()
df2 = df2.sort_values(by = 'sales', ascending = True).reset_index()
df3 = df2.loc[11, 'year']
df4 = pd.Series(df3)
df5 = df4.iloc[0]
#*the output here is 1964 , which alone needs to be saved in the text file*.
df5.to_csv("modified.txt")
HighestSales()
But I get 'numpy.int64' object has no attribute 'to_csv'- this error . Is there a way to save just one single value in the text file?
you can do:
# open a file named modified.txt
with open('modified.txt', 'w') as f:
# df5 is just an integer of 196
# and write 1964 plus a line break
f.write(df5 + '\n')
You cannot save a single value to csv by using "pd.to_csv". In your case you should convert it into DataFrame again and then saving it. If you want to see only the number in .txt file, you need to add some parameters:
result = pd.DataFrame(df5)
result.to_csv('modified.txt', index=False, header=False)

Pandas - Trying to save a set of files by reading it using Pandas but only the latest file gets saved

I am trying to read a set of txt files into Pandas as below. I see I am able to read them to a Dataframe however when I try to save the Dataframe it only saves the last file it read. However when I perform print(df) it prints all the records.
Given below is the code I am using:
files = '/users/user/files'
list = []
for file in files:
df = pd.read_csv(file)
list.append(df)
print(df)
df.to_csv('file_saved_path')
Could anyone advice why is the last file only being saved to the csv file and now the entire list.
Expected output:
output1
output2
output3
Current output:
output1,output2,output3
Try this:
path = '/users/user/files'
for id in range(len(os.listdir(path))):
file = os.listdir(path)[id]
data = pd.read_csv(path+'/'+file, sep='\t')
if id == 0:
df1 = data
else:
data = pd.concat([df1, data], ignore_index=True)
data.to_csv('file_saved_path')
First change variable name list, because code word in python (builtin), then for final DataFrame use concat:
files = '/users/user/files'
L = []
for file in files:
df = pd.read_csv(file)
L.append(df)
bigdf = pd.concat(L, ignore_index=True)
bigdf.to_csv('file_saved_path')

Set Multiple Restrictions for Rows Called to Print in Pandas

import pandas as pd
import numpy as np
#load data
#data file and py file must be in same file path
df = pd.read_csv('cbp15st.txt', delimiter = ',', encoding = 'utf-8-
sig')
#define load data DataFrame columns
state = df['FIPSTATE']
industry = df['NAICS']
legal_form_of_organization = df['LFO']
suppression_flag = df['EMPFLAG']
total_establishment = df['EST']
establishment_1_4 = df['N1_4']
establishment_5_9 = df['N5_9']
establishment_10_19 = df['N10_19']
establishment_20_49 = df['N20_49']
establishment_50_99 = df['N50_99']
establishment_100_249 = df['N100_249']
establishment_250_499 = df['N250_499']
establishment_500_999 = df['N500_999']
establishment_1000_more = df['N1000']
#use df.loc to parse dataset for partiuclar value types
print(df.loc[df['EMPFLAG']=='A'], df.loc[df['FIPSTATE']==1],
df.loc[df['NAICS']=='------'])
Currently using df.loc to locate specific values from the df columns, but will read out those columns that contain all of these values, not only these values (like an or vs and statement)
Trying to find a way to place multiple restrictions on this to only get column reads that meet criteria x y and z.
Current Readout from above:
enter image description here
You can use & operator while specifying multiple filtering criteria, something like:
df1 = df.loc[(df['EMPFLAG']=='A']) & (df['FIPSTATE']==1) & (df['NAICS']=='------')]
print(df1)