Merge multiple csv files in python - pandas

Need help with merging multiple csv file
import pandas as pd
import glob
import csv
r1=glob.glob("path/*.csv")
wr1 = csv.writer(open("path/merge.csv",'wb'),delimiter = ',')
for files in r1:
rd=csv.reader(open(files,'r'), delimiter=',')
for row in rd:
print(row)
wr1.writerow(row)
I am getting a type error
TypeError: a bytes-like object is required, not 'str' not sure how to resolve this

Using pandas you can do it like this:
dfs = glob.glob('path/*.csv')
result = pd.concat([pd.read_csv(df) for df in dfs], ignore_index=True)
result.to_csv('path/merge.csv', ignore_index=True)

Related

Selecting multiple excel files in different folders and comparing them with pandas

I am trying to compare multiple excel files with pandas and glob, but i am getting the folowing output from my code:
Empty DataFrame
Columns: []
Index: []
This is my code:
import glob
import pandas as pd
all_data = pd.DataFrame()
path = 'S:\data\**\C_76_00_a?.xlsx'
for f in glob.glob(path):
df = pd.read_excel(f, sheet_name=None, ignore_index=True, skiprows=10, usecols=4)
cdf = pd.concat(df.values(), axis=1)
all_data = all_data.append(cdf,ignore_index=True)
print(all_data)
I am using jupyter notebook and the folder structure is \2021\12\2021-30-12\ and the file name containg "C_76_00_a" +"something unknown +".xlsx"
Below is a screenshot of the type of document i am trying to compare

Pandas read csv using column names included in a list

I'm quite new to Pandas.
I'm trying to create a dataframe reading thousands of csv files.
The files are not structured in the same way, but I want to extract only columns I'm interested in, so I created a list which inlcudes all the column names I want, but then i have an error cause not all of them are included in each dataset.
import pandas as pd
import numpy as np
import os
import glob
# select the csv folder
csv_folder= r'myPath'
# select all xlsx files within the folder
all_files = glob.glob(csv_folder + "/*.csv")
# Set the column names to include in the dataframe
columns_to_use = ['Name1', 'Name2', 'Name3', 'Name4', 'Name5', 'Name6']
# read one by one all the excel
for filename in all_files:
df = pd.read_csv(filename,
header=0,
usecols = columns_to_use)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-0d9670495660> in <module>
1 for filename in all_files:
----> 2 df = pd.read_csv(filename,
3 header=0,
4 usecols = columns_to_use)
5
ValueError: Usecols do not match columns, columns expected but not found: ['Name1', 'Name2', 'Name4']
How could I handle this issue by including a columns if this is present in the list?
Usa a callable for usecols, i.e. df = pd.read_csv(filename, header=0, usecols=lambda c: c in columns_to_use). From the docs of the usecols parameter:
If callable, the callable function will be evaluated against the
column names, returning names where the callable function evaluates to
True.
Working example that will only read col1 and not throw an error on missing col3:
import pandas as pd
import io
s = """col1,col2
1,2"""
df = pd.read_csv(io.StringIO(s), usecols=lambda c: c in ['col1', 'col3'])

Parse list and create DataFrame

I have been given a list called data which has the following content
data=[b'Name,Age,Occupation,Salary\r\nRam,37,Plumber,1769\r\nMohan,49,Elecrician,3974\r\nRahim,39,Teacher,4559\r\n']
I wanted to have a pandas dataframe which looks like the link
Expected Dataframe
How can I achieve this.
You can try this:
data=[b'Name,Age,Occupation,Salary\r\nRam,37,Plumber,1769\r\nMohan,49,Elecrician,3974\r\nRahim,39,Teacher,4559\r\n']
processed_data = [x.split(',') for x in data[0].decode().replace('\r', '').strip().split('\n')]
df = pd.DataFrame(columns=processed_data[0], data=processed_data[1:])
Hope it helps.
I would recommend you to convert this list to string as there is only one index in this list
str1 = ''.join(data)
Then use solution provided here
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
TESTDATA = StringIO(str1)
df = pd.read_csv(TESTDATA, sep=",")

How to add/edit text in pandas.io.parsers.TextFileReader

I have a large file in CSV. Since it is a large file(almost 7 GB) , it cannot be converted into a pandas dataframe.
import pandas as pd
df1 = pd.read_csv('tblViewPromotionDataVolume_202004070600.csv', sep='\t', iterator=True, chunksize=1000)
for chunk in df1:
print (chunk)
df1 is of type pandas.io.parsers.TextFileReader
Now i want to edit/add/insert some text(a new row) into this file , and convert it back to a pandas dataframe. Please let me know of possible solutions. Thanks in advance.
Here is DataFrame called chunk, so for processing use it, last for write to file use DataFrame.to_csv with mode='a' for append mode:
import pandas as pd
import os
infile = 'tblViewPromotionDataVolume_202004070600.csv'
outfile = 'out.csv'
df1 = pd.read_csv(infile, sep='\t', iterator=True, chunksize=1000)
for chunk in df1:
print (chunk)
#processing with chunk
# https://stackoverflow.com/a/30991707/2901002
# if file does not exist write header with first chunk
if not os.path.isfile(outfile):
chunk.to_csv(, sep='\t')
else: # else it exists so append without writing the header
chunk.to_csv('out.csv', sep='\t', mode='a', header=False)

How do I combine multiple pandas dataframes into an HDF5 object under one key/group?

I am parsing data from a large csv sized 800 GB. For each line of data, I save this as a pandas dataframe.
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
Now, I would like to save this into an HDF5 format, and query the h5 as if it was the entire csv file.
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
My approach so far has been:
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
store.append(hdf5_key, df, data_columns=csv_columns, index=False)
That is, I try to save each dataframe df into the HDF5 under one key. However, this fails:
Attribute 'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'
So, I could try to save everything into one pandas dataframe first, i.e.
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame()
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
total_df = pd.concat([total_df, df]) # creates one big CSV
and now store into HDF5 format
store.append(hdf5_key, total_df, data_columns=csv_columns, index=False)
However, I don't think I have the RAM/storage to save all csv lines into total_df into HDF5 format.
So, how do I append each "single-line" df into an HDF5 so that it ends up as one big dataframe (like the original csv)?
EDIT: Here's a concrete example of a csv file with different data types:
order start end value
1 1342 1357 category1
1 1459 1489 category7
1 1572 1601 category23
1 1587 1599 category2
1 1591 1639 category1
....
15 792 813 category13
15 892 913 category5
....
Your code should work, can you try the following code:
import pandas as pd
import numpy as np
store = pd.HDFStore("file.h5", "w")
hdf5_key = "single_key"
csv_columns = ["COL%d" % i for i in range(1, 56)]
for i in range(10):
df = pd.DataFrame(np.random.randn(1, len(csv_columns)), columns=csv_columns)
store.append(hdf5_key, df, data_column=csv_columns, index=False)
store.close()
If the code works, then there are something wrong with your data.