Add (insert) new columns to an existing file with different shape Pandas - pandas

I have 5 files dataframe, which are each files contains with different shape on pandas.
1st file contains of 3968 rows x 7 columns (Date,Open,High,Low,Close,Adj Close,Volume)
2nd file contains of 3774 rows x 7 columns (Date1,Open1,High1,Low1,Close1,Adj Close1,Volume1)
3rd file contains of 58 rows x 3 columns (No, Date, Rate)
4th file contains of 192 rows x 3 columns (No1, Date1, Rates1)
5th file contains of 1850 rows x 3 columns (No2, Date2,Rate2)
My Output will :
3968 rows x 16 columns
(Date,Open,High,Low,Close,Adj Close,Volume, Open1,High1,Low1,Close1,Adj Close1,Volume1, Rate, Rates1, Rates2)
How to append/insert the new columns on 1st file from 2nd - 5th files with diffent shapes?
is there any technique to match the different shapes?
I put my code :
df = pd.read_csv('^JKLQ45.csv') # 1st file
files = [file for file in os.listdir('./Raw Data')] #i put all the files
all_data = pd.DataFrame()
for file in files:
current_data = pd.read_csv('./Raw Data'+"/"+file)
all_data = pd.concat([all_data, current_data])
all_data.to_csv("all_data_copy.csv", index=False)
the output are 9842 rows × 14 columns, but i want the shape will be 3968 rows x 16 columns

Can you add this code inside the loop?
pd.concat([df1.reset_index(drop=True),df2.reset_index(drop=True)],axis=1)

Related

How to make dataframe from different parts of an Excel sheet given specific keywords?

I have one Excel file where multiple tables are placed in same sheet. My requirement is to read certain tables based on keyword. I have read tables using skip rows and nrows method, which is working as of now, but in future it won't work due to dynamic table length.
Is there any other workaround apart from skip rows & nrows method to read table as shown in picture?
I want to read data1 as one table & data2 as another table. Out of which in particular I want columns "RR","FF" & "WW" as two different data frames.
Appreciate if some one can help or guide to do this.
Method I have tried:
all_files=glob.glob(INPATH+"*sample*")
df1 = pd.read_excel(all_files[0],skiprows=11,nrows= 3)
df2 = pd.read_excel(all_files[0],skiprows=23,nrows= 3)
This works fine, the only problem is table length will vary every time.
With an Excel file identical to the one of your image, here is one way to do it:
import pandas as pd
df = pd.read_excel("file.xlsx").dropna(how="all").reset_index(drop=True)
# Setup
targets = ["Data1", "Data2"]
indices = [df.loc[df["Unnamed: 0"] == target].index.values[0] for target in targets]
dfs = []
for i in range(len(indices)):
# Slice df starting from first indice to second one
try:
data = df.loc[indices[i] : indices[i + 1] - 1, :]
except IndexError:
data = df.loc[indices[i] :, :]
# For one slice, get only values where row starts with 'rr'
r_idx = data.loc[df["Unnamed: 0"] == "rr"].index.values[0]
data = data.loc[r_idx:, :].reset_index(drop=True).dropna(how="all", axis=1)
# Cleanup
data.columns = data.iloc[0]
data.columns.name = ""
dfs.append(data.loc[1:, :].iloc[:, 0:3])
And so:
for item in dfs:
print(item)
# Output
rr ff ww
1 car1 1000000 sellout
2 car2 1500000 to be sold
3 car3 1300000 sellout
rr ff ww
1 car1 1000000 sellout
2 car2 1500000 to be sold
3 car3 1300000 sellout

Looping through csv files and creating a DataFrame that summarize info by locating text in columns

The script that I have so far (see below) does that:
1: looping through a folder and convert .xlsx in .csv.
2: looping through the csv list and populate a dataframe with data extracted from each one of them.
3: append a new column that is populated with the filename.
import pandas as pd
import numpy as np
cwd = os.path.abspath('')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.csv'):
df1 = pd.read_csv(file, header = None,encoding='latin1')
df1 = df1.assign(Filename= os.path.basename(file))
df = df1.append(df,ignore_index=False)
What I want to do now is while still in the 'file' loop;
The 1st column (will be named Pozo): where dataframe column number 1 contains 'Pozo:', extract the value from the same row but column 3.
To populate the second column a search for text 'MD' has to be done inside the first column (column 0) of the dataframe and has to be populated with all the values below the row where text is found inside this same column.
The outcome looks like the joined picture now. The image show what I want to extract from it (in red in the searched text, and in yellow the values to extract)
What I want is to take the dataframe shown in the image and 'clean it' like so:
Pozo
MD
Filename
First
NWI-RC-176 calidad
0.00
NWIRC-176-22_SPT_Offline_OutRun_QC.csv
Second
NWI-RC-176 calidad
5.00
NWIRC-176-22_SPT_Offline_OutRun_QC.csv
[...]
enter image description here
Thanks for helping me !!
see if this helps
Choose the value from column 3, where column 1 has 'POZO', and assign to variable
var_pozo = df.at[df[df['1'] == 'Pozo:'].index[0],'3']
var_pozo
to choose all values below where MD Is found, we first find that row
then choose all rows from below that index and assign to another DF (md_df)
then find a first null value and exclude that from the md_df
# all rows from DF where MD is found are selected and assigned to md_df
md_df = df[df[df['0'] == 'MD'].index[0]:].reset_index()
# find the first null value and select all rows above the null
md_df = md_df[:md_df[md_df['0'].isnull()].index[0]]['0'].reset_index()
md_df
We add the required columns in md_df
md_df['Pozo'] = var_pozo
md_df['filename'] = 'your filename' # replace this with your file variable
md_df.drop(columns='index', inplace=True)
md_df.rename(columns={'0': 'MD'}, inplace=True)
md_df
md_df becomes your extracted df
here is the output with the test csv I created
MD Pozo filename
0 (m) NWI-RC-176 calidad your filename
1 0 NWI-RC-176 calidad your filename
2 5 NWI-RC-176 calidad your filename
3 10 NWI-RC-176 calidad your filename
4 15 NWI-RC-176 calidad your filename

Adding file name to column name pandas dataframe

I have a pandas dataframe created from several csv files. The csv files are all structured the same way, so I have the same column names over and over again. I want the column names to be expanded by the file names (which I have as a list) they come from.
From this I know how to add a count to same name columns and I know how to rename columns. But I fail at bringing the right file name to the right column value.
That should be the relevant part of the code:
for i in range(0,len(file_list)):
data = pd.read_table(file_list[i], encoding='unicode_escape')
df = pd.DataFrame(data)
df = df.drop(droplist,axis=1)
main_dataframe = pd.concat([main_dataframe, df], axis = 1)
You can use a dictionary in concat to generate a MultiIndex:
list_of_files = ['f1.csv', 'f2.csv']
pd.concat({f: pd.read_table(f, encoding='unicode_escape', sep=',')
for f in list_of_files}, axis=1)
example:
# f1.csv
a,b
1,2
3,4
# f2.csv
a,b
5,6
7,8
output:
f1.csv f2.csv
a b a b
0 1 2 5 6
1 3 4 7 8
Alternative using add_prefix in the list comprehension:
pd.concat([pd.read_table(f, encoding='unicode_escape', sep=',')
.add_prefix(f[:-3]) # add prefix without ".csv" extension
for f in list_of_files], axis=1))
output:
f1.a f1.b f2.a f2.b
0 1 2 5 6
1 3 4 7 8

new pandas data frame fill from continous scrape, column names known

I scraped data like :
for row in stat_table.find_all("tr"):
for cell in row.find_all('td'):
print(cell.text)
the output looks like this :
1
2019-10-24
31-206
MIL
#
HOU
W (+6)
0
16:35
1
3
.333
0
2
etc.
I created a columns variable:
columns = ['G','Date', 'Age','Team',"at","Opp",'Score','Starter','MP','FG','FGA','FG%','3P','3PA',"3P%",
'FT','FTA','FT%','ORB','DRB','TRB','AST','STL','BLK','TOV','PF','PTS',"GmSC","+/-"]
I would like to read in the output and create a new pandas data frame with those columns. Any idea how I can read that in?
The way I would do it is split your text so that it becomes a list in your for loop and append it to a list of lists(body):
header = [**your column names**]
body = [] # list of lists
for row in stat_table.find_all("tr"):
for cell in row.find_all('td'):
body.append(cell.text.split(' ')) # splitting on space
Then, make sure header and the lists within body are of equal length and:
df = pd.DataFrame(data=body, columns=header)

Read from the specific lines of a csv in pandas [duplicate]

I'm having trouble figuring out how to skip n rows in a csv file but keep the header which is the 1 row.
What I want to do is iterate but keep the header from the first row. skiprows makes the header the first row after the skipped rows. What is the best way of doing this?
data = pd.read_csv('test.csv', sep='|', header=0, skiprows=10, nrows=10)
You can pass a list of row numbers to skiprows instead of an integer.
By giving the function the integer 10, you're just skipping the first 10 lines.
To keep the first row 0 (as the header) and then skip everything else up to row 10, you can write:
pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))
Other ways to skip rows using read_csv
The two main ways to control which rows read_csv uses are the header or skiprows parameters.
Supose we have the following CSV file with one column:
a
b
c
d
e
f
In each of the examples below, this file is f = io.StringIO("\n".join("abcdef")).
Read all lines as values (no header, defaults to integers)
>>> pd.read_csv(f, header=None)
0
0 a
1 b
2 c
3 d
4 e
5 f
Use a particular row as the header (skip all lines before that):
>>> pd.read_csv(f, header=3)
d
0 e
1 f
Use a multiple rows as the header creating a MultiIndex (skip all lines before the last specified header line):
>>> pd.read_csv(f, header=[2, 4])
c
e
0 f
Skip N rows from the start of the file (the first row that's not skipped is the header):
>>> pd.read_csv(f, skiprows=3)
d
0 e
1 f
Skip one or more rows by giving the row indices (the first row that's not skipped is the header):
>>> pd.read_csv(f, skiprows=[2, 4])
a
0 b
1 d
2 f
Great answers already. Consider this generalized scenario:
Say your xls/csv has junk rows in the top 2 rows (row #0,1). Row #2 (3rd row) is the real header and you want to load 10 rows starting from row #50 (i.e 51st row).
Here's the snippet:
pd.read_csv('test.csv', header=2, skiprows=range(3, 50), nrows=10)
To expand on #AlexRiley's answer, the skiprows argument takes a list of numbers which determines what rows to skip. So:
pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))
is the same as:
pd.read_csv('test.csv', sep='|', skiprows=[1,2,3,4,5,6,7,8,9])
The best way to go about ignoring specific rows would be to create your ignore list (either manually or with a function like range that returns a list of integers) and pass it to skiprows.
If you're iterating through a long csv file, you can use the chunksize argument. If for some reason you need to manually step through it, you can try the following as long as you know how many iterations you need to go through:
for i in range(num_iters):
pd.read_csv('test.csv', sep='|', header=0,
skiprows = range(i*10 + 1, (i+1)*10), nrows=10)
If you need to skip/drop specific rows, say the first 3 rows (i.e. 0,1,2) and then 2 more rows (i.e. 4,5). You can use the following to retain the header row:
df = pd.read_csv(file_in, delimiter='\t', skiprows=[0,1,2,4,5], encoding='utf-16', usecols=cols)