Looping through csv files and creating a DataFrame that summarize info by locating text in columns - pandas

The script that I have so far (see below) does that:
1: looping through a folder and convert .xlsx in .csv.
2: looping through the csv list and populate a dataframe with data extracted from each one of them.
3: append a new column that is populated with the filename.
import pandas as pd
import numpy as np
cwd = os.path.abspath('')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.csv'):
df1 = pd.read_csv(file, header = None,encoding='latin1')
df1 = df1.assign(Filename= os.path.basename(file))
df = df1.append(df,ignore_index=False)
What I want to do now is while still in the 'file' loop;
The 1st column (will be named Pozo): where dataframe column number 1 contains 'Pozo:', extract the value from the same row but column 3.
To populate the second column a search for text 'MD' has to be done inside the first column (column 0) of the dataframe and has to be populated with all the values below the row where text is found inside this same column.
The outcome looks like the joined picture now. The image show what I want to extract from it (in red in the searched text, and in yellow the values to extract)
What I want is to take the dataframe shown in the image and 'clean it' like so:
Pozo
MD
Filename
First
NWI-RC-176 calidad
0.00
NWIRC-176-22_SPT_Offline_OutRun_QC.csv
Second
NWI-RC-176 calidad
5.00
NWIRC-176-22_SPT_Offline_OutRun_QC.csv
[...]
enter image description here
Thanks for helping me !!

see if this helps
Choose the value from column 3, where column 1 has 'POZO', and assign to variable
var_pozo = df.at[df[df['1'] == 'Pozo:'].index[0],'3']
var_pozo
to choose all values below where MD Is found, we first find that row
then choose all rows from below that index and assign to another DF (md_df)
then find a first null value and exclude that from the md_df
# all rows from DF where MD is found are selected and assigned to md_df
md_df = df[df[df['0'] == 'MD'].index[0]:].reset_index()
# find the first null value and select all rows above the null
md_df = md_df[:md_df[md_df['0'].isnull()].index[0]]['0'].reset_index()
md_df
We add the required columns in md_df
md_df['Pozo'] = var_pozo
md_df['filename'] = 'your filename' # replace this with your file variable
md_df.drop(columns='index', inplace=True)
md_df.rename(columns={'0': 'MD'}, inplace=True)
md_df
md_df becomes your extracted df
here is the output with the test csv I created
MD Pozo filename
0 (m) NWI-RC-176 calidad your filename
1 0 NWI-RC-176 calidad your filename
2 5 NWI-RC-176 calidad your filename
3 10 NWI-RC-176 calidad your filename
4 15 NWI-RC-176 calidad your filename

Related

How to make dataframe from different parts of an Excel sheet given specific keywords?

I have one Excel file where multiple tables are placed in same sheet. My requirement is to read certain tables based on keyword. I have read tables using skip rows and nrows method, which is working as of now, but in future it won't work due to dynamic table length.
Is there any other workaround apart from skip rows & nrows method to read table as shown in picture?
I want to read data1 as one table & data2 as another table. Out of which in particular I want columns "RR","FF" & "WW" as two different data frames.
Appreciate if some one can help or guide to do this.
Method I have tried:
all_files=glob.glob(INPATH+"*sample*")
df1 = pd.read_excel(all_files[0],skiprows=11,nrows= 3)
df2 = pd.read_excel(all_files[0],skiprows=23,nrows= 3)
This works fine, the only problem is table length will vary every time.
With an Excel file identical to the one of your image, here is one way to do it:
import pandas as pd
df = pd.read_excel("file.xlsx").dropna(how="all").reset_index(drop=True)
# Setup
targets = ["Data1", "Data2"]
indices = [df.loc[df["Unnamed: 0"] == target].index.values[0] for target in targets]
dfs = []
for i in range(len(indices)):
# Slice df starting from first indice to second one
try:
data = df.loc[indices[i] : indices[i + 1] - 1, :]
except IndexError:
data = df.loc[indices[i] :, :]
# For one slice, get only values where row starts with 'rr'
r_idx = data.loc[df["Unnamed: 0"] == "rr"].index.values[0]
data = data.loc[r_idx:, :].reset_index(drop=True).dropna(how="all", axis=1)
# Cleanup
data.columns = data.iloc[0]
data.columns.name = ""
dfs.append(data.loc[1:, :].iloc[:, 0:3])
And so:
for item in dfs:
print(item)
# Output
rr ff ww
1 car1 1000000 sellout
2 car2 1500000 to be sold
3 car3 1300000 sellout
rr ff ww
1 car1 1000000 sellout
2 car2 1500000 to be sold
3 car3 1300000 sellout

how to put first value in one column and remaining into other column?

ROCO2_CLEF_00001.jpg,C3277934,C0002978
ROCO2_CLEF_00002.jpg,C3265939,C0002942,C2357569
I want to make a pandas data frame from csv file.
I want to put first row entry(filename) into a column and give the column/header name "filenames", and remaining entries into another column name "class". How to do so?
in case your file hasn't a fixed number of commas per row, you could do the following:
import pandas as pd
csv_path = 'test_csv.csv'
raw_data = open(csv_path).readlines()
# clean rows
raw_data = [x.strip().replace("'", "") for x in raw_data]
print(raw_data)
# make split between data
raw_data = [ [x.split(",")[0], ','.join(x.split(",")[1:])] for x in raw_data]
print(raw_data)
# build the pandas Dataframe
column_names = ["filenames", "class"]
temp_df = pd.DataFrame(data=raw_data, columns=column_names)
print(temp_df)
filenames class
0 ROCO2_CLEF_00001.jpg C3277934,C0002978
1 ROCO2_CLEF_00002.jpg C3265939,C0002942,C2357569

new pandas data frame fill from continous scrape, column names known

I scraped data like :
for row in stat_table.find_all("tr"):
for cell in row.find_all('td'):
print(cell.text)
the output looks like this :
1
2019-10-24
31-206
MIL
#
HOU
W (+6)
0
16:35
1
3
.333
0
2
etc.
I created a columns variable:
columns = ['G','Date', 'Age','Team',"at","Opp",'Score','Starter','MP','FG','FGA','FG%','3P','3PA',"3P%",
'FT','FTA','FT%','ORB','DRB','TRB','AST','STL','BLK','TOV','PF','PTS',"GmSC","+/-"]
I would like to read in the output and create a new pandas data frame with those columns. Any idea how I can read that in?
The way I would do it is split your text so that it becomes a list in your for loop and append it to a list of lists(body):
header = [**your column names**]
body = [] # list of lists
for row in stat_table.find_all("tr"):
for cell in row.find_all('td'):
body.append(cell.text.split(' ')) # splitting on space
Then, make sure header and the lists within body are of equal length and:
df = pd.DataFrame(data=body, columns=header)

Pandas dataframe: Splitting single-column data from txt file into multiple columns

I have an obnoxious .txt file that is output from a late 1990's program for an Agilent instrument. I am trying to comma-separate and organize the single column of the text file into multiple columns in a pd dataframe. After some organization, the txt file currently looks like the following: See link here:
Organized Text File
Each row is indexed in a pd dataframe. The code used to reorganize the file and attempt to split into multiple columns follows:
quantData = pd.read_csv(epaTemp, header = None)
trimmed_File = quantData.iloc[16:,]
trimmed_File = trimmed_File.drop([17,18,70,71,72], axis = 0)
print (trimmed_File)
###
splitFile = trimmed_File.apply( lambda x: pd.Series(str(x).split(',')))
print (splitFile)
The split function above did not get applied to all rows present in the txt file. It only split(',')the first row rather than all of them:
0 16 Compound R... 1
dtype: object
I would like this split functionality to apply to all rows in my txt file so I can further organize my data. Thank you for the help.

Key error: '3' When extracting data from Pandas DataFrame

My code plan is as follows:
1) find csv files in folder using glob and create a list of files
2) covert each csv file into dataframe
3) extract data from a column location and convert into a separate dataframe
4) append the new data into a separate summary csv file
code is as follows:
Result = []
def result(filepath):
files = glob.glob(filepath)
print files
dataframes = [pd.DataFrame.from_csv(f, index_col=None) for f in files]
new_dfb = pd.DataFrame()
for i, df in enumerate(dataframes):
colname = 'Run {}'.format(i+1)
selected_data = df['3'].ix[0:4]
new_dfb[colname] = selected_data
Result.append(new_dfb)
folder = r"C:/Users/Joey/Desktop/tcd/summary.csv"
new_dfb.to_csv(folder)
result("C:/Users/Joey/Desktop/tcd/*.csv")
print Result
The code error is shown below. The issue seems to be with line 36 .. which corresponds to the selected_data = df['3'].ix[0:4].
I show one of my csv files below:
I'm not sure what the problem is with the dataframe constructor?
You're csv snippet is a bit unclear. But as suggested in the comments, read_csv (from_csv in this case) automatically taken the first row as a list of headers. The behaviour you appear to want is the columns to be labelled as 1,2,3 etc. To achieve this you need to have
[pd.DataFrame.from_csv(f, index_col=None,header=None) for f in files]