I'm quite new to Pandas.
I'm trying to create a dataframe reading thousands of csv files.
The files are not structured in the same way, but I want to extract only columns I'm interested in, so I created a list which inlcudes all the column names I want, but then i have an error cause not all of them are included in each dataset.
import pandas as pd
import numpy as np
import os
import glob
# select the csv folder
csv_folder= r'myPath'
# select all xlsx files within the folder
all_files = glob.glob(csv_folder + "/*.csv")
# Set the column names to include in the dataframe
columns_to_use = ['Name1', 'Name2', 'Name3', 'Name4', 'Name5', 'Name6']
# read one by one all the excel
for filename in all_files:
df = pd.read_csv(filename,
header=0,
usecols = columns_to_use)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-0d9670495660> in <module>
1 for filename in all_files:
----> 2 df = pd.read_csv(filename,
3 header=0,
4 usecols = columns_to_use)
5
ValueError: Usecols do not match columns, columns expected but not found: ['Name1', 'Name2', 'Name4']
How could I handle this issue by including a columns if this is present in the list?
Usa a callable for usecols, i.e. df = pd.read_csv(filename, header=0, usecols=lambda c: c in columns_to_use). From the docs of the usecols parameter:
If callable, the callable function will be evaluated against the
column names, returning names where the callable function evaluates to
True.
Working example that will only read col1 and not throw an error on missing col3:
import pandas as pd
import io
s = """col1,col2
1,2"""
df = pd.read_csv(io.StringIO(s), usecols=lambda c: c in ['col1', 'col3'])
Related
I have a .csv file, i uses pandas to read the .csv file.
import pandas as pd
from pandas import read_csv
data=read_csv('input.csv')
print(data)
0 1 2 3 4 5
0 -3.288733e-08 2.905263e-08 2.297046e-08 2.052534e-08 3.767194e-08 4.822049e-08
1 2.345769e-07 9.462636e-08 4.331173e-08 3.137627e-08 4.680112e-08 6.067109e-08
2 -1.386798e-07 1.637338e-08 4.077676e-08 3.339685e-08 5.020153e-08 5.871679e-08
3 -4.234607e-08 3.555008e-08 2.563824e-08 2.320405e-08 4.008257e-08 3.901410e-08
4 3.899913e-08 5.368551e-08 3.713510e-08 2.367323e-08 3.172775e-08 4.799337e-08
My aim is to assign the file to a column name so that i can access the data in later time. For example by doing something like
new_data= df['filename']
filename
0 -3.288733e-08,2.905263e-08,2.297046e-08,2.052534e-08,3.767194e-08,4.822049e-08
1 2.345769e-07,9.462636e-08,4.331173e-08,3.137627e-08,4.680112e-08, 6.067109e-08
2 -1.386798e-07,1.637338e-08,4.077676e-08,3.339685e-08,5.020153e-08,5.871679e-08
3 -4.234607e-08,3.555008e-08,2.563824e-08,2.320405e-08,4.008257e-08,3.901410e-08
4 3.899913e-08,5.368551e-08,3.713510e-08,2.367323e-08,3.172775e-08,4.799337e-08
I don't really like it (and I still don't completely get the point), but you could just read in your data as 1 column (by using a 'wrong' seperator) and renaming the column.
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename, sep=';')
df.columns = [filename]
If you then wish, you could add other files by doing the same thing (with a different name for df at first) and then concatenate that with df.
A more usefull approach IMHO would be to add the dataframe to a dictionary (or a list would be possible).
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename)
data_dict = {filename: df}
# ... Add multiple files to data_dict by repeating steps above in a loop
You can then access your data later on by calling data_dict[filename] or data_dict['input.csv']
Need help with merging multiple csv file
import pandas as pd
import glob
import csv
r1=glob.glob("path/*.csv")
wr1 = csv.writer(open("path/merge.csv",'wb'),delimiter = ',')
for files in r1:
rd=csv.reader(open(files,'r'), delimiter=',')
for row in rd:
print(row)
wr1.writerow(row)
I am getting a type error
TypeError: a bytes-like object is required, not 'str' not sure how to resolve this
Using pandas you can do it like this:
dfs = glob.glob('path/*.csv')
result = pd.concat([pd.read_csv(df) for df in dfs], ignore_index=True)
result.to_csv('path/merge.csv', ignore_index=True)
I have a file that is not beautiful and searchable so i downloaded it in the csv format. It contains 4 columns and 116424 rows.
I'm not able to plot its three columns namely Year, Age and Ratio onto a heat map.
The link for the csv file is: https://gist.github.com/JustGlowing/1f3d7ff0bba7f79651b00f754dc85bf1
import numpy as np
import pandas as pd
from pandas import DataFrame
from numpy.random import randn
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('new_file.csv')
print(df.info())
print(df.shape)
couple_columns = df[['Year','Age','Ratio']]
print(couple_columns.head())
Error
C:\Users\Pranav\AppData\Local\Programs\Python\Python36-32\python.exe C:/Users/Pranav/PycharmProjects/takenmind/Data_Visualization/a1.py
Traceback (most recent call last):
RangeIndex: 116424 entries, 0 to 116423
File "C:/Users/Pranav/PycharmProjects/takenmind/Data_Visualization/a1.py", line 12, in
Data columns (total 4 columns):
couple_columns = df[['Year','Age','Ratio']]
AREA 116424 non-null object
File "C:\Users\Pranav\AppData\Roaming\Python\Python36\site-packages\pandas\core\frame.py", line 2682, in getitem
YEAR 116424 non-null int64
AGE 116424 non-null object
RATIO 116424 non-null object
dtypes: int64(1), object(3)
memory usage: 2.2+ MB
None
(116424, 4)
return self._getitem_array(key)
File "C:\Users\Pranav\AppData\Roaming\Python\Python36\site-packages\pandas\core\frame.py", line 2726, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "C:\Users\Pranav\AppData\Roaming\Python\Python36\site-packages\pandas\core\indexing.py", line 1327, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: "['Year' 'Age' 'Ratio'] not in index"
It seems that your columns are uppercase from the info output: YEAR 116424 non-null int64. You should be able to get e.g. the year column with df[['YEAR']].
If you would rather use lowercase, you can use
df = pd.read_csv('new_file.csv').rename(columns=str.lower)
The csv has some text in the top 8 lines before your actual data begins. You can skip those by using the skiprows argument
df = pd.read_csv('f2m_ratios.csv', skiprows=8)
Lets say you want to plot heatmap only for one Area
df = df[df['Area'] == 'Afghanistan']
Before you plot a heatmap, you need data in a certain format (pivot table)
df = df.pivot('Year','Age','Ratio')
Now your dataframe is ready for a heatmap
sns.heatmap(df)
I want to append a pandas DataFrame object to an existing h5py file, whether as a subgroup or dataset, with all the index and header information. Is that possible? I tried the following:
import pandas as pd
import h5py
f = h5py.File('f.h5', 'r+')
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['A', 'B', 'C'], index=['X', 'Y'])
f['df'] = df
From another script, I would like to access f.h5, but the output of f['df'][()] is array([[1, 2, 3],[4, 5, 6]]), which doesn't contain the header information.
You can write to an existing hdf5 file directly with Pandas via pd.DataFrame.to_hdf() and read it back in with pd.read_hdf(). You just have to make sure to read and write with the same key.
To write to the h5 file:
existing_hdf5 = "f.h5"
df = pd.DataFrame([[1,2,3],[4,5,6]],
columns=['A', 'B', 'C'], index=['X', 'Y'])
df.to_hdf(existing_hdf5 , key='df')
Then you can read by:
df2 = pd.read_hdf(existing_hdf5 , key='df')
print(df2)
A B C
X 1 2 3
Y 4 5 6
Note that you can also make the dataframe appendable using format="table" which requires the option dependency of Pytables
I have .txt files I'm reading in with pandas and the header line starts with '~A'. I need to ignore the '~A' and have the next header correspond to the data in the first column. Thanks!
You can do this:
import pandas as pd
data = pd.read_csv("./test.txt", names=[ 'A', 'B' ], skiprows=1)
print(data)
and the output for input:
~A, A, B
1, 2
3, 4
is:
c:\Temp\python>python test.py
A B
0 1 2
1 3 4
You have to name the columns yourself but given that your file seems to be malformed I guess it is not that bad.
If your header lines are not the same in all files, then you can just read them in Python:
import pandas as pd;
# read first line
with open("./test.txt") as myfile:
headRow = next(myfile)
# read column names
columns = [x.strip() for x in headRow.split(',')]
# process by pandas
data = pd.read_csv("./test.txt", names=columns[1:], skiprows=1)
print(data);