Add column with filename wildcard - pandas

I have files that have the pattern
XXXX____________030621_120933_D.csv
YYYY____________030621_120933_E.csv
ZZZZ____________030621_120933_F.csv
I am using glob.glob and for loop to parse each file to pandas to create Data frame of which i will merge at the end. I want to add a column which will add the XXXX,YYYY, and ZZZZ to each data frame accordingly
I can create the column called ID with df['ID'] and want to pick the value from the filenames. is the easiest way to grab that from the filename when reading the CSV and processing via pd

If the file names are as what you have presented, then use this code:
dir_path = #path to your directory
file_paths = glob.glob(dir_path + '*.csv')
result = pd.DataFrame()
for file_ in file_paths :
df = pd.read_csv(file_)
df['ID'] = file_[<index of the ID>]
result = result.append(df, ignore_index=True)
Finding the right index might take a bit of time, but that should do it.

Related

how to put first value in one column and remaining into other column?

ROCO2_CLEF_00001.jpg,C3277934,C0002978
ROCO2_CLEF_00002.jpg,C3265939,C0002942,C2357569
I want to make a pandas data frame from csv file.
I want to put first row entry(filename) into a column and give the column/header name "filenames", and remaining entries into another column name "class". How to do so?
in case your file hasn't a fixed number of commas per row, you could do the following:
import pandas as pd
csv_path = 'test_csv.csv'
raw_data = open(csv_path).readlines()
# clean rows
raw_data = [x.strip().replace("'", "") for x in raw_data]
print(raw_data)
# make split between data
raw_data = [ [x.split(",")[0], ','.join(x.split(",")[1:])] for x in raw_data]
print(raw_data)
# build the pandas Dataframe
column_names = ["filenames", "class"]
temp_df = pd.DataFrame(data=raw_data, columns=column_names)
print(temp_df)
filenames class
0 ROCO2_CLEF_00001.jpg C3277934,C0002978
1 ROCO2_CLEF_00002.jpg C3265939,C0002942,C2357569

Pandas dataframe: Splitting single-column data from txt file into multiple columns

I have an obnoxious .txt file that is output from a late 1990's program for an Agilent instrument. I am trying to comma-separate and organize the single column of the text file into multiple columns in a pd dataframe. After some organization, the txt file currently looks like the following: See link here:
Organized Text File
Each row is indexed in a pd dataframe. The code used to reorganize the file and attempt to split into multiple columns follows:
quantData = pd.read_csv(epaTemp, header = None)
trimmed_File = quantData.iloc[16:,]
trimmed_File = trimmed_File.drop([17,18,70,71,72], axis = 0)
print (trimmed_File)
###
splitFile = trimmed_File.apply( lambda x: pd.Series(str(x).split(',')))
print (splitFile)
The split function above did not get applied to all rows present in the txt file. It only split(',')the first row rather than all of them:
0 16 Compound R... 1
dtype: object
I would like this split functionality to apply to all rows in my txt file so I can further organize my data. Thank you for the help.

Concatenate multiple file rowwise in a single dataframe with the same header name

I have 400 csv files and all the files contains single column which has 4667 rows . Every row has name and corresponding value for example "A=54,B=56 and so on till 4667 rows. My problem statement is :
1. fetch the variable name and put it different columns
2. fetch the corresponding variable value and put it in the next rows above the columns.
3. Now, Do this step for all the 400 files and append all the corresponding values in the above rows which makes 400 rows.
I have done for the single file and how to do with the multiple files . I don't Know
import glob
from collections import OrderedDict
path =r'Github/dataset/Raw_Dataset/'
filenames = glob.glob(path + "/*.csv")
dict_of_df = OrderedDict((f, pd.read_csv((f),header=None,names=['Devices'])) for f in filenames)
eda=pd.concat(dict_of_df)
I
Do you mean a concatenation?
you could use pandas like this:
import pandas as pd
df1 = pd.DataFrame(columns=["A","B","C","D"],data=[["1","2","3","4"]])
df2 = pd.DataFrame(columns=["A","B","C","D"],data=[["5","6","7","8"]])
df3 = pd.DataFrame(columns=["A","B","C","D"],data=[["9","10","11","12"]])
df_concat = pd.concat([df1, df2, df3])
print(df_concat)
In combination with a loop through the files in the folder ending with ".csv" it would be:
import os
import pandas as pd
path = r"C:\YOUR\DICTIONARY"
df_concat = pd.DataFrame()
for filename in os.listdir(path):
if filename.endswith(".csv"):
print(filename)
df_temp = pd.read_csv(path + "\\" + filename)
df_concat = pd.concat([df_concat, df_temp])
continue
else:
continue

Not able to add Index and replace header rows - Pandas DataFrame

A dataframe was created by the following code:
I set a path variable
filenames = glob.glob(path + "/*.csv")
df = []
for filename in filenames:
df.append(pd.read_csv(filename))
frame = pd.concat(df)
The dataframe comes in without index(s) and I want the first row of the dataframe to be the header.
In a attempt to rename the header with the first row values, I wrote the following code and I get the following error.
header = frame.iloc[0]
frame2 = df[1:]
frame2.rename(columns = header)
'list' object has no attribute 'rename'
When I run the following, I seems to me to be a dataframe:
type(frame)
pandas.core.frame.DataFrame
How would I go about giving the Data frame (called frame) a numbered index too.

Key error: '3' When extracting data from Pandas DataFrame

My code plan is as follows:
1) find csv files in folder using glob and create a list of files
2) covert each csv file into dataframe
3) extract data from a column location and convert into a separate dataframe
4) append the new data into a separate summary csv file
code is as follows:
Result = []
def result(filepath):
files = glob.glob(filepath)
print files
dataframes = [pd.DataFrame.from_csv(f, index_col=None) for f in files]
new_dfb = pd.DataFrame()
for i, df in enumerate(dataframes):
colname = 'Run {}'.format(i+1)
selected_data = df['3'].ix[0:4]
new_dfb[colname] = selected_data
Result.append(new_dfb)
folder = r"C:/Users/Joey/Desktop/tcd/summary.csv"
new_dfb.to_csv(folder)
result("C:/Users/Joey/Desktop/tcd/*.csv")
print Result
The code error is shown below. The issue seems to be with line 36 .. which corresponds to the selected_data = df['3'].ix[0:4].
I show one of my csv files below:
I'm not sure what the problem is with the dataframe constructor?
You're csv snippet is a bit unclear. But as suggested in the comments, read_csv (from_csv in this case) automatically taken the first row as a list of headers. The behaviour you appear to want is the columns to be labelled as 1,2,3 etc. To achieve this you need to have
[pd.DataFrame.from_csv(f, index_col=None,header=None) for f in files]