Key error: '3' When extracting data from Pandas DataFrame - pandas

My code plan is as follows:
1) find csv files in folder using glob and create a list of files
2) covert each csv file into dataframe
3) extract data from a column location and convert into a separate dataframe
4) append the new data into a separate summary csv file
code is as follows:
Result = []
def result(filepath):
files = glob.glob(filepath)
print files
dataframes = [pd.DataFrame.from_csv(f, index_col=None) for f in files]
new_dfb = pd.DataFrame()
for i, df in enumerate(dataframes):
colname = 'Run {}'.format(i+1)
selected_data = df['3'].ix[0:4]
new_dfb[colname] = selected_data
Result.append(new_dfb)
folder = r"C:/Users/Joey/Desktop/tcd/summary.csv"
new_dfb.to_csv(folder)
result("C:/Users/Joey/Desktop/tcd/*.csv")
print Result
The code error is shown below. The issue seems to be with line 36 .. which corresponds to the selected_data = df['3'].ix[0:4].
I show one of my csv files below:
I'm not sure what the problem is with the dataframe constructor?

You're csv snippet is a bit unclear. But as suggested in the comments, read_csv (from_csv in this case) automatically taken the first row as a list of headers. The behaviour you appear to want is the columns to be labelled as 1,2,3 etc. To achieve this you need to have
[pd.DataFrame.from_csv(f, index_col=None,header=None) for f in files]

Related

Compare two files and find the difference

I have two csv files, have to find difference for both files and generate the output file in sheet1 - difference data for txt1.csv and sheet2 - difference data for txt2.csv. Kindly advise me.
Sample Input :
txt1.csv
txt2.csv
Code
with open('txt1.csv', 'r') as t1, open('txt2.csv', 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open('update.csv', 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
with open('update1.csv', 'w') as outFile:
for line in fileone:
if line not in filetwo:
outFile.write(line)
Expected output:
In sheet1
In sheet2
Note :
When the input file is too large above code is executing very slow
You could try the following.
Dataset:
df1=pd.DataFrame({"A":[1,2,3,4],"B":[5,6,7,8]})
df2=pd.DataFrame({"A":[1, 2],"B":[2,9]})
Output1:
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))].reset_index(drop=True)
A
0 3
1 4
Output2:
df2[~df2.apply(tuple,1).isin(df1.apply(tuple,1))].reset_index(drop=True)
A
0 8
1 9
In your case something like:
df1 = pd.read_csv("txt1.csv")
df2 = pd.read_csv("txt2.csv")
delta1 = df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))].reset_index(drop=True)
delta2 = df2[~df2.apply(tuple,1).isin(df1.apply(tuple,1))].reset_index(drop=True)
delta1.to_csv("txt1_delta.csv", index=False)
delta2.to_csv("txt2_delta.csv", index=False)
edit, or if you want to have it in Excel with multiple sheets:
pip install xlsxwriter # if required
import xlsxwriter
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter("your_output_excel.xlsx", engine="xlsxwriter")
# Write each dataframe to a different worksheet.
delta1.to_excel(writer, sheet_name="Delta1")
delta2.to_excel(writer, sheet_name="Delta2")
# Close the Pandas Excel writer and output the Excel file.
writer.save()

Add column with filename wildcard

I have files that have the pattern
XXXX____________030621_120933_D.csv
YYYY____________030621_120933_E.csv
ZZZZ____________030621_120933_F.csv
I am using glob.glob and for loop to parse each file to pandas to create Data frame of which i will merge at the end. I want to add a column which will add the XXXX,YYYY, and ZZZZ to each data frame accordingly
I can create the column called ID with df['ID'] and want to pick the value from the filenames. is the easiest way to grab that from the filename when reading the CSV and processing via pd
If the file names are as what you have presented, then use this code:
dir_path = #path to your directory
file_paths = glob.glob(dir_path + '*.csv')
result = pd.DataFrame()
for file_ in file_paths :
df = pd.read_csv(file_)
df['ID'] = file_[<index of the ID>]
result = result.append(df, ignore_index=True)
Finding the right index might take a bit of time, but that should do it.

Pandas dataframe: Splitting single-column data from txt file into multiple columns

I have an obnoxious .txt file that is output from a late 1990's program for an Agilent instrument. I am trying to comma-separate and organize the single column of the text file into multiple columns in a pd dataframe. After some organization, the txt file currently looks like the following: See link here:
Organized Text File
Each row is indexed in a pd dataframe. The code used to reorganize the file and attempt to split into multiple columns follows:
quantData = pd.read_csv(epaTemp, header = None)
trimmed_File = quantData.iloc[16:,]
trimmed_File = trimmed_File.drop([17,18,70,71,72], axis = 0)
print (trimmed_File)
###
splitFile = trimmed_File.apply( lambda x: pd.Series(str(x).split(',')))
print (splitFile)
The split function above did not get applied to all rows present in the txt file. It only split(',')the first row rather than all of them:
0 16 Compound R... 1
dtype: object
I would like this split functionality to apply to all rows in my txt file so I can further organize my data. Thank you for the help.

How to access nested tables in hdf5 with pandas

I want to retrieve a table from an HDF5 file using pandas.
Following several references I found, I have tried to open the file using:
df = pd.read_hdf('data/test.h5', g_name),
where g_name is the path to the object I want to retrieve, i.e. the table TAB1, for instance, MAIN/Basic/Tables/TAB1.
g_name is retrieved as follows:
def get_all(name):
if 'TAB1' in name:
return name
with h5py.File('data/test.h5') as f:
g_name = f.visit(get_all)
print(g_name)
group = f[g_name]
print(type(group))
I have also tried retrieving the object itself, as seen in the above code snippet, but the object type is
How would I convert this to something I can read as a data frame in pandas?
For the first case, I get the following error:
"cannot create a storer if the object is not existing "
I do not understand why it cannot find the object, if the path is the same as retrieved during the search.
I found the following solution:
hf = h5py.File('data/test.h5')
data = hf.get('MAIN/Basic/Tables/TAB1')
result = data[()]
# This last step just converts the table into a pandas df
df = pd.DataFrame(result)

Pandas - Issue when concatenating multiple csv files into one

I have a list of csv files that I am trying to concatenate using Pandas.
Given below is sample view of the csv file:
Note: Column 4 - stores the latitude
Column 5 - store the longitude
store-001,store_name,building_no_060,23.4324,43.3532,2018-10-01 10:00:00,city_1,state_1
store-002,store_name,building_no_532,12.4345,45.6743,2018-10-01 12:00:00,city_2,state_1
store-003,store_name,building_no_536,54.3453,23.3444,2018-07-01 04:00:00,city_3,state_1
store-004,store_name,building_no_004,22.4643,56.3322,2018-04-01 07:00:00,city_2,state_3
store-005,store_name,building_no_453,76.3434,55.4345,2018-10-02 16:00:00,city_4,state_2
store-006,store_name,building_no_456,35.3455,54.3334,2018-10-05 10:00:00,city_6,state_2
When I try to concat multiple csv files in the above format, I see the columns having latitude and longitude are first saved in the first row from A2 - A30 and they are followed by the other columns all in the row 1.
Given below is the way I am performing the concat:
masterlist = glob.glob('path') <<- This is the path where all the csv files are stored.
df_v1 = [pd.read_csv(fp, sep=',', error_bad_lines=False).assign(FileName=os.path.basename(fp)) for fp in masterlist] <<-- This also includes the file name in the csv file
df = pd.concat(df_v1, ignore_index=True)
df.to_csv('path'), index=False) <<-- This stores the final concatenated csv file
Could anyone guide me why is the concatenation not working properly. Thanks