How to shift the column headers in pandas - pandas

I have .txt files I'm reading in with pandas and the header line starts with '~A'. I need to ignore the '~A' and have the next header correspond to the data in the first column. Thanks!

You can do this:
import pandas as pd
data = pd.read_csv("./test.txt", names=[ 'A', 'B' ], skiprows=1)
print(data)
and the output for input:
~A, A, B
1, 2
3, 4
is:
c:\Temp\python>python test.py
A B
0 1 2
1 3 4
You have to name the columns yourself but given that your file seems to be malformed I guess it is not that bad.
If your header lines are not the same in all files, then you can just read them in Python:
import pandas as pd;
# read first line
with open("./test.txt") as myfile:
headRow = next(myfile)
# read column names
columns = [x.strip() for x in headRow.split(',')]
# process by pandas
data = pd.read_csv("./test.txt", names=columns[1:], skiprows=1)
print(data);

Related

assigning csv file to a variable name

I have a .csv file, i uses pandas to read the .csv file.
import pandas as pd
from pandas import read_csv
data=read_csv('input.csv')
print(data)
0 1 2 3 4 5
0 -3.288733e-08 2.905263e-08 2.297046e-08 2.052534e-08 3.767194e-08 4.822049e-08
1 2.345769e-07 9.462636e-08 4.331173e-08 3.137627e-08 4.680112e-08 6.067109e-08
2 -1.386798e-07 1.637338e-08 4.077676e-08 3.339685e-08 5.020153e-08 5.871679e-08
3 -4.234607e-08 3.555008e-08 2.563824e-08 2.320405e-08 4.008257e-08 3.901410e-08
4 3.899913e-08 5.368551e-08 3.713510e-08 2.367323e-08 3.172775e-08 4.799337e-08
My aim is to assign the file to a column name so that i can access the data in later time. For example by doing something like
new_data= df['filename']
filename
0 -3.288733e-08,2.905263e-08,2.297046e-08,2.052534e-08,3.767194e-08,4.822049e-08
1 2.345769e-07,9.462636e-08,4.331173e-08,3.137627e-08,4.680112e-08, 6.067109e-08
2 -1.386798e-07,1.637338e-08,4.077676e-08,3.339685e-08,5.020153e-08,5.871679e-08
3 -4.234607e-08,3.555008e-08,2.563824e-08,2.320405e-08,4.008257e-08,3.901410e-08
4 3.899913e-08,5.368551e-08,3.713510e-08,2.367323e-08,3.172775e-08,4.799337e-08
I don't really like it (and I still don't completely get the point), but you could just read in your data as 1 column (by using a 'wrong' seperator) and renaming the column.
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename, sep=';')
df.columns = [filename]
If you then wish, you could add other files by doing the same thing (with a different name for df at first) and then concatenate that with df.
A more usefull approach IMHO would be to add the dataframe to a dictionary (or a list would be possible).
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename)
data_dict = {filename: df}
# ... Add multiple files to data_dict by repeating steps above in a loop
You can then access your data later on by calling data_dict[filename] or data_dict['input.csv']

Pandas read csv using column names included in a list

I'm quite new to Pandas.
I'm trying to create a dataframe reading thousands of csv files.
The files are not structured in the same way, but I want to extract only columns I'm interested in, so I created a list which inlcudes all the column names I want, but then i have an error cause not all of them are included in each dataset.
import pandas as pd
import numpy as np
import os
import glob
# select the csv folder
csv_folder= r'myPath'
# select all xlsx files within the folder
all_files = glob.glob(csv_folder + "/*.csv")
# Set the column names to include in the dataframe
columns_to_use = ['Name1', 'Name2', 'Name3', 'Name4', 'Name5', 'Name6']
# read one by one all the excel
for filename in all_files:
df = pd.read_csv(filename,
header=0,
usecols = columns_to_use)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-0d9670495660> in <module>
1 for filename in all_files:
----> 2 df = pd.read_csv(filename,
3 header=0,
4 usecols = columns_to_use)
5
ValueError: Usecols do not match columns, columns expected but not found: ['Name1', 'Name2', 'Name4']
How could I handle this issue by including a columns if this is present in the list?
Usa a callable for usecols, i.e. df = pd.read_csv(filename, header=0, usecols=lambda c: c in columns_to_use). From the docs of the usecols parameter:
If callable, the callable function will be evaluated against the
column names, returning names where the callable function evaluates to
True.
Working example that will only read col1 and not throw an error on missing col3:
import pandas as pd
import io
s = """col1,col2
1,2"""
df = pd.read_csv(io.StringIO(s), usecols=lambda c: c in ['col1', 'col3'])

how to put first value in one column and remaining into other column?

ROCO2_CLEF_00001.jpg,C3277934,C0002978
ROCO2_CLEF_00002.jpg,C3265939,C0002942,C2357569
I want to make a pandas data frame from csv file.
I want to put first row entry(filename) into a column and give the column/header name "filenames", and remaining entries into another column name "class". How to do so?
in case your file hasn't a fixed number of commas per row, you could do the following:
import pandas as pd
csv_path = 'test_csv.csv'
raw_data = open(csv_path).readlines()
# clean rows
raw_data = [x.strip().replace("'", "") for x in raw_data]
print(raw_data)
# make split between data
raw_data = [ [x.split(",")[0], ','.join(x.split(",")[1:])] for x in raw_data]
print(raw_data)
# build the pandas Dataframe
column_names = ["filenames", "class"]
temp_df = pd.DataFrame(data=raw_data, columns=column_names)
print(temp_df)
filenames class
0 ROCO2_CLEF_00001.jpg C3277934,C0002978
1 ROCO2_CLEF_00002.jpg C3265939,C0002942,C2357569

Pandas Function to Add Underscore to all Column Headers in a DataFrame

I am looking to write a pandas function that adds underscores to the beginning of all column headers of a given data frame.
DataFrame.add_prefix
Works even if the original column labels aren't strings.
import pandas as pd
df = pd.DataFrame([[1,1,1]], columns=['a', 0, 'foo'])
# a 0 foo
# 1 1 1
df.add_prefix('_')
# _a _0 _foo
#0 1 1 1

Reading CSV and import column data as an numpy array

I have many csv file all contains two column. One is 'Energy' and another is 'Count'. My target is to import those data and keep them as a numpy array separately. Let's say X and Y will be two numpy array where X have all Energy and Y have all count data. But the problem is in my csv file i have a blank row after each data that seems making a lot of trouble. How can I eliminate those lines and save data as an array?
Energy Counts
-0.4767 0
-0.4717 0
-0.4667 0
-0.4617 0
-0.4567 0
-0.4517 0
import pandas as pd
import glob
import numpy as np
import os
import matplotlib.pyplot as plt
file_path = "path" ###file path
read_files = glob.glob(os.path.join(file_path,"*.csv")) ###get all files
X = [] ##create empty list
Y = [] ##create empty list
for files in read_files:
df = pd.read_csv(files,header=[0])
X.append(['Energy'])##store X data
Y.append(['Counts'])##store y data
X=np.array(X)
Y=np.array(Y)
print(X.shape)
print(Y.shape)
plt.plot(X[50],Y[50])
plt.show()
Ideally if I can save all data correctly, I suppose to get my plot but as data is not saving correctly, I am not getting any plot.
Set the skip_blank_lines parameter to True and these lines won't be read into the dataframe:
df = pd.read_csv(files, header=[0], skip_blank_lines=True)
So your whole program should be something like this (each file has the same column headers in the first line and the columns are separated by spaces):
...
df = pd.DataFrame()
for file in read_files:
df = df.append(pd.read_csv(file, sep='\s+', skip_blank_lines=True))
df.plot(x='Energy', y='Counts')
df.show()
# save both columns in one file
df.to_csv('myXYFile.csv', index=False)
# or two files with one column each
df.Energy.to_csv('myXFile.csv', index=False)
df.Counts.to_csv('myYFile.csv', index=False)
TEST PROGRAM
import pandas as pd
import io
input1="""Energy Counts
-0.4767 0
-0.4717 0
-0.4667 0
-0.4617 0
-0.4567 0
-0.4517 0
"""
input2="""Energy Counts
-0.4767 0
-0.4717 0
"""
df = pd.DataFrame()
for input in (input1,input2):
df = df.append(pd.read_csv(io.StringIO(input), sep='\s+', skip_blank_lines=True))
print(df)
TEST OUTPUT:
Energy Counts
0 -0.4767 0
1 -0.4717 0
2 -0.4667 0
3 -0.4617 0
4 -0.4567 0
5 -0.4517 0
0 -0.4767 0
1 -0.4717 0